Perl Unicode Cookbook: Names of CJK Codepoints
℞ 11: Names of CJK codepoints
CJK refers to Chinese, Japanese, and Korean. In the context of Unicode, it usually refers to the Han ideographs used in the modern Chinese and Japanese writing systems. As you can expect, pictoral languages such as Chinese make Unicode handling more complex.
Sinograms like “東京” come back with character names of CJK UNIFIED IDEOGRAPH-6771
and CJK UNIFIED IDEOGRAPH-4EAC
, because their “names” vary between languages. The CPAN Unicode::Unihan
module has a large database for decoding these (and a whole lot more), provided you know how to understand its output.
# cpan -i Unicode::Unihan
use Unicode::Unihan;
my $str = "東京";
my $unhan = Unicode::Unihan->new;
for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
printf "CJK $str in %-12s is ", $lang;
say $unhan->$lang($str);
}
prints:
CJK 東京 in Mandarin is DONG1JING1
CJK 東京 in Cantonese is dung1ging1
CJK 東京 in Korean is TONGKYENG
CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
If you have a specific romanization scheme in mind, use the specific module:
# cpan -i Lingua::JA::Romanize::Japanese
use Lingua::JA::Romanize::Japanese;
my $k2r = Lingua::JA::Romanize::Japanese->new;
my $str = "東京";
say "Japanese for $str is ", $k2r->chars($str);
prints:
Japanese for 東京 is toukyou
Previous: ℞ 10: Custom Named Characters
Series Index: The Standard Preamble
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub