Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (substr)

℞ 31: Extract by grapheme instead of by codepoint (substr)

The Unicode Standard Annex #29 discusses the boundaries between grapheme clusters—what users might perceive as “characters”. The CPAN module Unicode::GCString allows you to treat a Unicode string as a sequence of these grapheme clusters.

While you may use \X to extract graphemes within a regex, Unicode::GCString provides a substr() method to extract a series of grapheme clusters:

 # cpan -i Unicode::GCString
 use Unicode::GCString;

 my $gcs        = Unicode::GCString->new($str);
 my $first_five = $gcs->substr(0, 5);

The module also provides an iterator interface to grapheme clusters within a string.

Previous: ℞ 30: Extract by Grapheme Instead of Codepoint (regex)

Series Index: The Standard Preamble

Next: ℞ 32: Reverse String by Grapheme

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor
TPRF Bronze Sponsor