Perl Unicode Cookbook: Case-insensitive Comparisons
℞ 21: Unicode case-insensitive comparisons
Unicode is more than an expanded character set. Unicode is a set of rules about how characters behave and a set of properties about each character.
Comparing strings for equivalence often requires normalizing them to a standard form. That normalized form often requires that all characters be in a specific case. ℞ 20: Unicode casing demonstrated that converting between upper- and lower-case Unicode characters is more complicated than simply mapping [A-Z]
to [a-z]
. (Remember also that many characters have a title case form!)
The proper solution for normalized comparisons is to perform casefolding instead of mapping a subset of some characters to another. Perl 5.16 added a new feature fc(), or “foldcase”, to perform Unicode casefolding as the /i
pattern modifier has always provided. This feature is available for other Perls thanks to the CPAN module Unicode::CaseFold
:
use feature "fc"; # fc() function is from v5.16
# OR
use Unicode::CaseFold;
# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;
# both are true:
fc("tschüß") eq fc("TSCHÜSS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
Fold cases properly goes into more detail about case folding in Perl.
Previous: ℞ 20: Unicode Casing
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub