Perl Unicode Cookbook: Match Unicode Properties in Regex
℞ 25: Match Unicode properties in regex with \p
, \P
Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl’s regex engine is aware of these properties; use the \p{}
metacharacter sequence to match a codepoint possessing that property and its inverse, \P{}
to match a codepoint lacking that property.
Each property has a short name and a long name. For example, to match any codepoint which has the Letter
property, you may use \p{Letter}
or \p{L}
. Similarly, you may use \P{Uppercase}
or \P{Upper}
. perldoc perlunicode’s “Unicode Character Properties” section describes these properties in greater detail.
Examples of these properties useful in regex include:
\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script=Latin}, \p{script=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}
Previous: ℞ 24: Disable Unicode-awareness in Builtin Character Classes
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub