<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Dev411 Blog: Category unicode</title>
  <subtitle type="html">John Wang on Technology</subtitle>
  <id>tag:www.dev411.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link href="http://www.dev411.com/blog/xml/atom/category/feed.xml" rel="self" type="application/atom+xml"/>
  <link href="http://www.dev411.com/blog/tag/unicode" rel="alternate" type="text/html"/>
  <updated>2007-06-16T12:30:25-05:00</updated>
  <entry>
    <author>
      <name>John Wang</name>
    </author>
    <id>urn:uuid:559515eb-db34-4760-832f-3b103e81fc08</id>
    <published>2006-10-02T10:35:00-05:00</published>
    <updated>2007-06-16T12:30:25-05:00</updated>
    <title type="html">Perl, MySQL and UTF-8</title>
    <link href="http://www.dev411.com/blog/2006/10/02/perl-mysql-and-utf-8" rel="alternate" type="text/html"/>
    <category term="perl" scheme="http://www.dev411.com/blog/tag/perl" label="perl"/>
    <category term="mysql" scheme="http://www.dev411.com/blog/tag/mysql" label="mysql"/>
    <category term="unicode" scheme="http://www.dev411.com/blog/tag/unicode" label="unicode"/>
    <category term="orm" scheme="http://www.dev411.com/blog/tag/orm" label="orm"/>
    <summary type="html">&lt;p&gt;One of the mysteries of Perl to me is that why, as of yet, is there no UTF-8 support in DBD::mysql although this issue has been discussed on the msql-mysql-modules list since at least 2003 (using the MARC archives). This is also given that MySQL does have UTF-8 support itself.&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;One of the mysteries of Perl to me is that why, as of yet, is there no UTF-8 support in DBD::mysql although this issue has been discussed on the msql-mysql-modules list since at least 2003 (using the MARC archives). This is also given that MySQL does have UTF-8 support itself.&lt;/p&gt;

&lt;p&gt;When I first looked into this I found a few articles on this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.simplicidade.org/notes/archives/2005/12/utf8_and_dbdmys.html"&gt;utf-8 and DBD::mysql by Pedro Melo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html"&gt;Movable Type, MySQL, Perl, Unicode by Zakaria "Zack" Ajmal: provides a patch for Movable Type 3.2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pedro's article mentions that the reason this hasn't been done for DBD::mysql is that the DBI and DBD::mysql folks cannot decide where to put UTF-8 implementation, i.e. in DBI itself or the DBD drivers. Because, there is still no built-in support. To get around this, there have been numerous patches produced. Andrew Forrest even put together UTF-8 versions of DBI and CGI.pm (link seems broken atm). However, some of these patches seem to have problems and are non-standard.&lt;/p&gt;

&lt;p&gt;If you prefer to use an ORM, DBIx::Class and Class::DBI get around this by implementing UTF-8 support in their own libraries with DBIx::Class::UTF8Columns and Class::DBI::utf8 respectively. I'd recommend DBIx::Class over Class::DBI since it has more functionality (e.g. built-in JOIN support) and is supposed to generate more efficient SQL.&lt;/p&gt;

&lt;p&gt;The intersting thing is that DBD::Pg for PostgreSQL has had built-in UTF-8 support for some time. While not an issue specific to the MySQL database, the UTF-8 perl driver issue is something to consider when choosing MySQL or PostgreSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Thanks to Dominic Mitchell for mentioning the latest developer release, &lt;a href="http://search.cpan.org/~capttofu/DBD-mysql-3.0007_1/"&gt;DBD::mysql 3.0007_1&lt;/a&gt; released on 8 Sep 2006, has integrated UTF-8 support. It's a developer release but good things are finally happening!&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>John Wang</name>
    </author>
    <id>urn:uuid:6b24eecf-4fda-43d5-94da-7497d3ed337c</id>
    <published>2006-09-29T13:21:00-05:00</published>
    <updated>2007-06-16T12:30:25-05:00</updated>
    <title type="html">Perl - Strictify utf8 to UTF-8</title>
    <link href="http://www.dev411.com/blog/2006/09/29/perl-strictify-utf8-to-UTF-8" rel="alternate" type="text/html"/>
    <category term="postgresql" scheme="http://www.dev411.com/blog/tag/postgresql" label="postgresql"/>
    <category term="perl" scheme="http://www.dev411.com/blog/tag/perl" label="perl"/>
    <category term="unicode" scheme="http://www.dev411.com/blog/tag/unicode" label="unicode"/>
    <summary type="html">&lt;p&gt;Perl has two UTF-8 encodings, &lt;span class="fix"&gt;utf8&lt;/span&gt; which is Perl's liberal version and &lt;span class="fix"&gt;UTF-8&lt;/span&gt; which is a strict interpretation, aka &lt;span class="fix"&gt;utf-8-strict&lt;/span&gt;. The liberal version allows for encoded characters outside the UTF-8 character set, however you can run into problems when interoperating with applications that expect &lt;span class="fix"&gt;utf-8-strict&lt;/span&gt;, such as PostgreSQL. Here's a function I wrote to strictify &lt;span class="fix"&gt;utf8&lt;/span&gt; to &lt;span class="fix"&gt;UTF-8&lt;/span&gt; using the Encode core module:&lt;/p&gt;

&lt;pre&gt;use Encode;

sub strictify_utf8 {
    my $data = shift;
    if (Encode::is_utf8($data) &amp;&amp; !Encode::is_utf8($data,1)) {
        Encode::_utf8_off($data);
        Encode::from_to($data, 'utf8', 'UTF-8');
        Encode::_utf8_on($data);
    }
    return $data;
}&lt;/pre&gt;</summary>
    <content type="html">&lt;p&gt;Perl has two UTF-8 encodings, &lt;span class="fix"&gt;utf8&lt;/span&gt; which is Perl's liberal version and &lt;span class="fix"&gt;UTF-8&lt;/span&gt; which is a strict interpretation, aka &lt;span class="fix"&gt;utf-8-strict&lt;/span&gt;. The liberal version allows for encoded characters outside the UTF-8 character set, however you can run into problems when interoperating with applications that expect &lt;span class="fix"&gt;utf-8-strict&lt;/span&gt;, such as PostgreSQL. Here's a function I wrote to strictify &lt;span class="fix"&gt;utf8&lt;/span&gt; to &lt;span class="fix"&gt;UTF-8&lt;/span&gt; using the Encode core module:&lt;/p&gt;

&lt;pre&gt;use Encode;

sub strictify_utf8 {
    my $data = shift;
    if (Encode::is_utf8($data) &amp;&amp; !Encode::is_utf8($data,1)) {
        Encode::_utf8_off($data);
        Encode::from_to($data, 'utf8', 'UTF-8');
        Encode::_utf8_on($data);
    }
    return $data;
}&lt;/pre&gt;

</content>
  </entry>
  <entry>
    <author>
      <name>John Wang</name>
    </author>
    <id>urn:uuid:72b931ef-8049-4229-aa6b-7f8eaff60fbe</id>
    <published>2006-09-29T13:00:00-05:00</published>
    <updated>2007-06-16T12:30:25-05:00</updated>
    <title type="html">Perl - Getting a Unicode Character's Hex Codepoint</title>
    <link href="http://www.dev411.com/blog/2006/09/29/perl-getting-a-unicode-characters-hex-codepoint" rel="alternate" type="text/html"/>
    <category term="perl" scheme="http://www.dev411.com/blog/tag/perl" label="perl"/>
    <category term="unicode" scheme="http://www.dev411.com/blog/tag/unicode" label="unicode"/>
    <summary type="html">&lt;p&gt;I recently responded to someone asking how to get a Unicode hex codepoint from a Unicode literal on DevShed Forums. Since I think it may be more generally useful, here's my solution. The following function takes a unicode literal, converts it to a decimal representation using &lt;span class="fix"&gt;unpack&lt;/span&gt; and then converts it to hex usning &lt;span class="fix"&gt;sprintf&lt;/span&gt;:&lt;/p&gt;

&lt;pre&gt;sub codepoint_hex {
    if (my $char = shift) {
        return sprintf '%2.2x', unpack('U0U*', $char);
    }
}

my $cp = codepoint_hex('&#12459;'); # eq '30ab'&lt;/pre&gt;</summary>
    <content type="html">&lt;p&gt;I recently responded to someone asking how to get a Unicode hex codepoint from a Unicode literal on DevShed Forums. Since I think it may be more generally useful, here's my solution. The following function takes a unicode literal, converts it to a decimal representation using &lt;span class="fix"&gt;unpack&lt;/span&gt; and then converts it to hex usning &lt;span class="fix"&gt;sprintf&lt;/span&gt;:&lt;/p&gt;

&lt;pre&gt;sub codepoint_hex {
    if (my $char = shift) {
        return sprintf '%2.2x', unpack('U0U*', $char);
    }
}

my $cp = codepoint_hex('&#12459;'); # eq '30ab'&lt;/pre&gt;

&lt;p&gt;Using &lt;span class="fix"&gt;unpack&lt;/span&gt; is a nice solution here because the U0 option will require the literal to be strict &lt;span class="fix"&gt;UTF-8&lt;/span&gt;, aka &lt;span class="fix"&gt;utf-8-strict&lt;/span&gt; and not Perl's liberal &lt;span class="fix"&gt;utf8&lt;/span&gt; version. It throws a warning if the literal is not valid &lt;span class="fix"&gt;UTF-8&lt;/span&gt;. The other nice thing about using unpack is it works whether the UTF-8 flag is on or not, removing one check.&lt;/p&gt;

&lt;p&gt;An alternate way to get the decimal representation from the literal is to use &lt;span class="fix"&gt;ord&lt;/span&gt; instead of &lt;span class="fix"&gt;unpack&lt;/span&gt;, but &lt;span class="fix"&gt;ord&lt;/span&gt; requires the UTF-8 flag to be on. This can be turned on with &lt;span class="fix"&gt;Encode::_utf_on&lt;/span&gt; but that just turns the flag on without checking for strict UTF-8 so you'll have to do that with another check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; &lt;a href="http://bulknews.typepad.com/"&gt;miyagawa&lt;/a&gt; just showed me how to use &lt;span class="fix"&gt;ord&lt;/span&gt; with Encode:&lt;/p&gt;

&lt;pre&gt;use Encode;

sub codepoint_hex {
    sprintf "%04x", ord Encode::decode("UTF-8", shift);
}&lt;/pre&gt;</content>
  </entry>
</feed>

