Perl Unicode Cookbook: Always Decompose and Recompose

℞ 1: Generic Unicode-savvy filter

Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.

To handle Unicode effectively, always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }

See the Unicode Normalization FAQ for more details.

Series Index: The Standard Preamble

Next: ℞ 2: Fine-Tuning Unicode Warnings

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub