Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (regex)
℞ 30: Extract by grapheme instead of by codepoint (regex)
Remember that Unicode defines a grapheme as “what a user thinks of as a character”. A codepoint is an integer value in the Unicode codespace. While ASCII conflates the two, effective Unicode use respects the difference between user-visible characters and their representations.
Use the \X
regex metacharacter when you need to extract graphemes from a string instead of codepoints:
# match and grab five first graphemes
my ($first_five) = $str =~ /^ ( \X{5} ) /x;
Previous: ℞ 29: Match Unicode Grapheme Cluster in Regex
Series Index: The Standard Preamble
Next: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub