Perl Unicode Cookbook: Unicode Text in Stubborn Libraries

℞ 42: Unicode text in DBM hashes, the tedious way

While Perl 5 has long been very careful about handling Unicode correctly inside the world of Perl itself, every time you leave the Perl internals, you cross a boundary at which something may need to handle decoding and encoding. This happens when performing IO across a network or to files, when speaking to a database, or even when using XS to use a shared library from Perl.

For example, consider the core module DB_File, which allows you to use Berkeley DB files from Perl—persistent storage for key/value pairs.

Using a regular Perl string as a key or value for a DBM hash will trigger a wide character exception if any codepoints won’t fit into a byte. Here’s how to manually manage the translation: use DB_File; use Encode qw(encode decode); tie %dbhash, “DB_File”, “pathname”;

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = encode("UTF-8", $uni_value, 1);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode("UTF-8", $enc_key, 1);

By performing this manual encoding and decoding yourself, you know that your storage file will have a consistent representation of your data. The correct encoding depends on the type of data you store and the capabilities of the external code, of course.

Previous: ℞ 41: Unicode Linebreaking

Series Index: The Standard Preamble

Next: ℞ 43: Unicode Text in DBM Files (the easy way)

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor
TPRF Silver Sponsor