Perl - Getting a Unicode Character's Hex Codepoint
Posted in perl, unicode Fri, 29 Sep 2006 18:00:00 GMT
I recently responded to someone asking how to get a Unicode hex codepoint from a Unicode literal on DevShed Forums. Since I think it may be more generally useful, here's my solution. The following function takes a unicode literal, converts it to a decimal representation using unpack and then converts it to hex usning sprintf:
sub codepoint_hex {
if (my $char = shift) {
return sprintf '%2.2x', unpack('U0U*', $char);
}
}
my $cp = codepoint_hex('カ'); # eq '30ab'
Using unpack is a nice solution here because the U0 option will require the literal to be strict UTF-8, aka utf-8-strict and not Perl's liberal utf8 version. It throws a warning if the literal is not valid UTF-8. The other nice thing about using unpack is it works whether the UTF-8 flag is on or not, removing one check.
An alternate way to get the decimal representation from the literal is to use ord instead of unpack, but ord requires the UTF-8 flag to be on. This can be turned on with Encode::_utf_on but that just turns the flag on without checking for strict UTF-8 so you'll have to do that with another check.
Update: miyagawa just showed me how to use ord with Encode:
use Encode;
sub codepoint_hex {
sprintf "%04x", ord Encode::decode("UTF-8", shift);
}


















If you need the character names from the code number, look at the charnames module.