Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (regex)

℞ 30: Extract by grapheme instead of by codepoint (regex)

Remember that Unicode defines a grapheme as “what a user thinks of as a character”. A codepoint is an integer value in the Unicode codespace. While ASCII conflates the two, effective Unicode use respects the difference between user-visible characters and their representations.

Use the \X regex metacharacter when you need to extract graphemes from a string instead of codepoints:

 # match and grab five first graphemes
 my ($first_five) = $str =~ /^ ( \X{5} ) /x;

Previous: ℞ 29: Match Unicode Grapheme Cluster in Regex

Series Index: The Standard Preamble

Next: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor