Perl Unicode Cookbook: Always Decompose and Recompose
℞ 1: Generic Unicode-savvy filter
Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.
To handle Unicode effectively, always decompose on the way in, then recompose on the way out.
 use Unicode::Normalize;
 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }
See the Unicode Normalization FAQ for more details.
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub



