Perl Unicode Cookbook: Always Decompose and Recompose
℞ 1: Generic Unicode-savvy filter
Unicode allows multiple representations of the same characters. Comparing such strings for equivalence (sorting, searching, exact matching) requires care—including a coherent and consistent strategy of normalizing these representations to well-understood forms. Enter Unicode::Normalize.
To handle Unicode effectively, always decompose on the way in, then recompose on the way out.
use Unicode::Normalize;
while (<>) {
$_ = NFD($_); # decompose + reorder canonically
...
} continue {
print NFC($_); # recompose (where possible) + reorder canonically
}
See the Unicode Normalization FAQ for more details.
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub