Perl Unicode Cookbook: Always Decompose and Recompose

Apr 3, 2012 by Tom Christiansen

℞ 1: Generic Unicode-savvy ﬁlter

Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.

To handle Unicode effectively, always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }

See the Unicode Normalization FAQ for more details.

Series Index: The Standard Preamble

Next: ℞ 2: Fine-Tuning Unicode Warnings

Tags

unicode