Perl Unicode Cookbook: Get Character Categories

℞ 23: Get character category

Unicode is a set of characters and a list of rules and properties applied to those characters. The Unicode Character Database collects those properties. The core module Unicode::UCD provides access to these properties.

These general properties group characters into groups, such as upper- or lowercase characters, punctuation symbols, math symbols, and more. (See Unicode::UCD’s general_categories() for more information.)

The charinfo() function returns a hash reference containing a wealth of information about the Unicode character in question. In particular, its category value contains the short name of a character’s category.

To find the general category of a numeric codepoint:

 use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)->{category};  # "Lu"

To translate this category into something more human friendly:

 use Unicode::UCD qw( charinfo general_categories );
 my $categories = general_categories();
 my $cat        = charinfo(0x3A3)->{category};  # "Lu"
 my $full_cat   = $categories{ $cat }; # "UppercaseLetter"

Previous: ℞ 22: Match Unicode Linebreak Sequence

Series Index: The Standard Preamble

Next: ℞ 24: Disable Unicode-awareness in Builtin Character Classes

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor
TPRF Bronze Sponsor