Perl Unicode Cookbook: The Standard Preamble

Apr 2, 2012 by Tom Christiansen

Editor’s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. This is the first recipe in the series.

℞ 0: Standard preamble

Unless otherwise noted, all examples in this cookbook require this standard preamble to work correctly, with the #! adjusted to work on your system:

 #!/usr/bin/env perl

 use utf8;      # so literals and identifiers can be in UTF-8
 use v5.12;     # or later to get "unicode_strings" feature
 use strict;    # quote strings, declare variables
 use warnings;  # on by default
 use warnings  qw(FATAL utf8);    # fatalize encoding glitches
 use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

This does make even Unix programmers binmode your binary streams, or open them with :raw, but that’s the only way to get at them portably anyway.

WARNING: use autodie and use open do not get along with each other.

This combination of features sets Perl to a known state of Unicode compatibility and strictness, so that subsequent operations behave as you expect.

The other recipes in this cookbook are:

℞ 0: The Standard Preamble
℞ 1: Always Decompose and Recompose
℞ 2: Fine-Tuning Unicode Warnings
℞ 3: Enable UTF-8 Literals
℞ 4: Characters and Their Numbers
℞ 5: Unicode Literals by Number
℞ 6: Get Character Names by Number
℞ 7: Get Character Number by Name
℞ 8: Unicode Named Characters
℞ 9: Unicode Named Character Sequences
℞ 10: Custom Named Characters
℞ 11: Names of CJK Codepoints
℞ 12: Explicit encode/decode
℞ 13: Decode @ARGV as UTF-8
℞ 14: Decode @ARGV as Local Encoding
℞ 15: Decode Standard Filehandles as UTF-8
℞ 16: Decode Standard Filehandles as Locale Encoding
℞ 17: Make File I/O Default to UTF-8
℞ 18: Make All I/O Default to UTF-8
℞ 19: Specify a File’s Encoding
℞ 20: Unicode Casing
℞ 21: Case-insensitive Comparisons
℞ 22: Match Unicode Linebreak Sequence
℞ 23: Get Character Categories
℞ 24: Disable Unicode-awareness in Builtin Character Classes
℞ 25: Match Unicode Properties in Regex
℞ 26: Custom Character Properties
℞ 27: Unicode Normalization
℞ 28: Convert non-ASCII Unicode Numerics
℞ 29: Match Unicode Grapheme Cluster in Regex
℞ 30: Extract by Grapheme Instead of Codepoint (regex)
℞ 31: Extract by Grapheme Instead of Codepoint (substr)
℞ 32: Reverse String by Grapheme
℞ 33: String Length in Graphemes
℞ 34: Unicode Column Width for Printing
℞ 35: Unicode Collation
℞ 36: Case- and Accent-insensitive Sorting
℞ 37: Unicode Locale Collation
℞ 38: Make cmp Work on Text instead of Codepoints
℞ 39: Case- and Accent-insensitive Comparison
℞ 40: Case- and Accent-insensitive Locale Comparisons
℞ 41: Unicode Linebreaking
℞ 42: Unicode Text in Stubborn Libraries
℞ 43: Unicode Text in DBM Files (the easy way)
℞ 44: Demo of Unicode Collation and Printing
℞ 45: Further Resources

Tags

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor

TPRF Silver Sponsor

TPRF Bronze Sponsor

Perl Resources

Site Map

Home

About

Authors

Categories

Tags

Contact Us

To get in touch, submit an issue to perladvent/perldotcom on GitHub.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Legal

Perl.com and the authors make no representations with respect to the accuracy or completeness of the contents of all work on this website and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. The information published on this website may not be suitable for every situation. All work on this website is provided with the understanding that Perl.com and the authors are not engaged in rendering professional services. Neither Perl.com nor the authors shall be liable for damages arising herefrom.