Perl Unicode Cookbook: Always Decompose and Recompose

℞ 1: Generic Unicode-savvy filter

Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.

To handle Unicode effectively, always decompose on the way in, then recompose on the way out.

 use Unicode::Normalize;

 while (<>) {
     $_ = NFD($_);   # decompose + reorder canonically
     ...
 } continue {
     print NFC($_);  # recompose (where possible) + reorder canonically
 }

See the Unicode Normalization FAQ for more details.

Series Index: The Standard Preamble

Next: ℞ 2: Fine-Tuning Unicode Warnings

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor
TPRF Bronze Sponsor