Perl Unicode Cookbook: Case-insensitive Comparisons

℞ 21: Unicode case-insensitive comparisons

Unicode is more than an expanded character set. Unicode is a set of rules about how characters behave and a set of properties about each character.

Comparing strings for equivalence often requires normalizing them to a standard form. That normalized form often requires that all characters be in a specific case. ℞ 20: Unicode casing demonstrated that converting between upper- and lower-case Unicode characters is more complicated than simply mapping [A-Z] to [a-z]. (Remember also that many characters have a title case form!)

The proper solution for normalized comparisons is to perform casefolding instead of mapping a subset of some characters to another. Perl 5.16 added a new feature fc(), or “foldcase”, to perform Unicode casefolding as the /i pattern modifier has always provided. This feature is available for other Perls thanks to the CPAN module Unicode::CaseFold:

 use feature "fc"; # fc() function is from v5.16
 # OR
 use Unicode::CaseFold;

 # sort case-insensitively
 my @sorted = sort { fc($a) cmp fc($b) } @list;

 # both are true:
 fc("tschüß")  eq fc("TSCHÜSS")
 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

Fold cases properly goes into more detail about case folding in Perl.

Previous: ℞ 20: Unicode Casing

Series Index: The Standard Preamble

Next: ℞ 22: Match Unicode Linebreak Sequence

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor
TPRF Bronze Sponsor