<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Dev411 Blog: Category xapian</title>
  <subtitle type="html">John Wang on Technology</subtitle>
  <id>tag:www.dev411.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link href="http://www.dev411.com/blog/xml/atom/category/feed.xml" rel="self" type="application/atom+xml"/>
  <link href="http://www.dev411.com/blog/tag/xapian" rel="alternate" type="text/html"/>
  <updated>2009-01-26T23:40:31-06:00</updated>
  <entry>
    <author>
      <name>John Wang</name>
    </author>
    <id>urn:uuid:c8c378a0-ac23-442c-8790-de04233f28a4</id>
    <published>2009-01-25T23:53:00-06:00</published>
    <updated>2009-01-26T23:40:31-06:00</updated>
    <title type="html">Xapian and Lucene - Kudos to Xapian from YouSport.com</title>
    <link href="http://www.dev411.com/blog/2009/01/25/xapian-and-lucene-kudos-to-xapian-from-yousport-com" rel="alternate" type="text/html"/>
    <category term="xapian" scheme="http://www.dev411.com/blog/tag/xapian" label="xapian"/>
    <summary type="html">&lt;p&gt;A few years ago I had picked Xapian after evaluating a number of solutions. More recently, the popularity surge of Lucene had me curious to learn about it. I needed to do a rip and replace of MySQL fulltext search due to scaling issues so I decided to check out clucene. I quickly found out the API was not as up to date as Lucene (a fast moving target) and that the mailing list had only had 4 posts in the last year or so. That led to a conclusion to move away from clucene. After that, I was told to check out Solr as an easy way to use Lucene without needing to implement Java. I replaced MySQL with Xapian but still had Solr in the back of my mind to check out.&lt;/p&gt;

&lt;p&gt;Recently, &lt;a href="http://lists.xapian.org/pipermail/xapian-discuss/2009-January/006345.html"&gt;an email from Jonathan Drake&lt;/a&gt;, Senior Developer at YouSport.com, came across the xapian-discuss mailing list that said:&lt;/p&gt;

&lt;blockquote&gt;&lt;em&gt;We were using Solr before but it was constantly causing headaches in terms of scalability and complexity. I gave Xapian a go and so far I'm blown away by how awesome it is. Its incredibly lightweight, its scaled a 100 times better and everyone involved is happier.&lt;/em&gt;&lt;/blockquote&gt;

&lt;p&gt;I'm curious to hear what scaling and complexity problems they faced, but it's good to hear a strong endorsement of Xapian from a former Solr developer. That, and a quick check of the &lt;a href="http://xapian.org/users.php"&gt;current users&lt;/a&gt; page listing del.icio.us with over 100 million documents, seems to indicate that Xapian remains a strong contender in the search space. That being said, I work with very scalable Lucene-based solutions as well, just in Java projects.&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;A few years ago I had picked Xapian after evaluating a number of solutions. More recently, the popularity surge of Lucene had me curious to learn about it. I needed to do a rip and replace of MySQL fulltext search due to scaling issues so I decided to check out clucene. I quickly found out the API was not as up to date as Lucene (a fast moving target) and that the mailing list had only had 4 posts in the last year or so. That led to a conclusion to move away from clucene. After that, I was told to check out Solr as an easy way to use Lucene without needing to implement Java. I replaced MySQL with Xapian but still had Solr in the back of my mind to check out.&lt;/p&gt;

&lt;p&gt;Recently, &lt;a href="http://lists.xapian.org/pipermail/xapian-discuss/2009-January/006345.html"&gt;an email from Jonathan Drake&lt;/a&gt;, Senior Developer at YouSport.com, came across the xapian-discuss mailing list that said:&lt;/p&gt;

&lt;blockquote&gt;&lt;em&gt;We were using Solr before but it was constantly causing headaches in terms of scalability and complexity. I gave Xapian a go and so far I'm blown away by how awesome it is. Its incredibly lightweight, its scaled a 100 times better and everyone involved is happier.&lt;/em&gt;&lt;/blockquote&gt;

&lt;p&gt;I'm curious to hear what scaling and complexity problems they faced, but it's good to hear a strong endorsement of Xapian from a former Solr developer. That, and a quick check of the &lt;a href="http://xapian.org/users.php"&gt;current users&lt;/a&gt; page listing del.icio.us with over 100 million documents, seems to indicate that Xapian remains a strong contender in the search space. That being said, I work with very scalable Lucene-based solutions as well, just in Java projects.&lt;/p&gt;

</content>
  </entry>
  <entry>
    <author>
      <name>John Wang</name>
    </author>
    <id>urn:uuid:54765f65-4ef7-4bd7-834b-8631923cf4ce</id>
    <published>2006-10-02T03:08:00-05:00</published>
    <updated>2007-06-16T12:30:25-05:00</updated>
    <title type="html">Encoding Hashed UIDs: Base64 vs. Hex vs. Base32</title>
    <link href="http://www.dev411.com/blog/2006/10/02/encoding-hashed-uids-base64-vs-hex-vs-base32" rel="alternate" type="text/html"/>
    <category term="perl" scheme="http://www.dev411.com/blog/tag/perl" label="perl"/>
    <category term="mysql" scheme="http://www.dev411.com/blog/tag/mysql" label="mysql"/>
    <category term="xapian" scheme="http://www.dev411.com/blog/tag/xapian" label="xapian"/>
    <summary type="html">&lt;p&gt;I recently looked at using various encodings for hashed UIDs, e.g. UIDs generated by a crytographic hash algorithm such as SHA-1 or MD5. These are often useful when the UID does not need to have human meaning but should exhibit some uniformity, such as character set and length. I considered Base64 and hexadecimal first because they are commonly used by crypto libraries and then decided on Base64 and Base32 where appropriate. Base36 is actually the most compact case insensitive encoding (using Arabic numbers and Roman letters) but is not an option for me at the moment because there's no Perl module for it that will take arbitrary text and binary input at the moment. &lt;a href="http://search.cpan.org/~rhenssel/Math-Base36-0.02/Base36.pm"&gt;Math::Base36&lt;/a&gt; exists but only handles numbers.&lt;/p&gt;</summary>
    <content type="html">&lt;p&gt;I recently looked at using various encodings for hashed UIDs, e.g. UIDs generated by a crytographic hash algorithm such as SHA-1 or MD5. These are often useful when the UID does not need to have human meaning but should exhibit some uniformity, such as character set and length. I considered Base64 and hexadecimal first because they are commonly used by crypto libraries and then decided on Base64 and Base32 where appropriate. Base36 is actually the most compact case insensitive encoding (using Arabic numbers and Roman letters) but is not an option for me at the moment because there's no Perl module for it that will take arbitrary text and binary input at the moment. &lt;a href="http://search.cpan.org/~rhenssel/Math-Base36-0.02/Base36.pm"&gt;Math::Base36&lt;/a&gt; exists but only handles numbers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Base64&lt;/strong&gt; has the advantage of generating shorter representations because it uses a 64 character set instead of hexadecimal's 16. While Base 64 can use any 64 characters, many implementations follow &lt;a href="http://tools.ietf.org/html/rfc1421"&gt;IETF PEM RFC 1421&lt;/a&gt; which specifies A-Z, a-z, 0-9, / and + while using = for padding. Because + and / need to be URI escaped, Wikipedia mentions a variant that uses * and - in place of + and / respectively and removes line breaks as well as padding. I used this for a while but unfortunate * cannot be used in DOM ids. Because of this I've switched to using _- instead.&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;Hexadecimal&lt;/strong&gt; uses 0-9 and A-F and is case insensitive. The representations are longer due to the smaller character set, but case insensitivy can be advantageous in certain situations such as full text indexing with Xapian which is case sensitive, and MySQL which is case insensitive by default. It's common practices to lowercase all the indexed terms using Xapian while reserving uppercase characters for boolean and probabilistic term prefixes. After thinking about how the query parser works for a bit, my conclusion is that lower casing all the inexed terms will make life with Xapian a lot easier. To &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html"&gt;enable case sensitivity in MySQL&lt;/a&gt;, the table definition needs to be changed.&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;Base32&lt;/strong&gt; is the middle ground between the 16 character hex and 64 character Base64 encodings, but does not get included in crypto libraries. For UIDs, however, it is especially attractive because it uses a larger character set than hexadecimal resulting in shorter representations and is case insensitive for applications that work better with it. Base32, as defined in &lt;a href="http://tools.ietf.org/html/rfc3548"&gt;IETF RFC 3548&lt;/a&gt;, uses the characters A-Z and 2-7. Since Base32 isn't included in many crypto libraries, an extra encoding is needed, however, creating hashed UIDs only happens once per entry so it seems reasonable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are some modules on CPAN for Base32 including &lt;a href="http://search.cpan.org/~danpeder/MIME-Base32-1.01/Base32.pm"&gt;MIME::Base32&lt;/a&gt; and &lt;a href="http://search.cpan.org/~miyagawa/Convert-Base32-0.02/lib/Convert/Base32.pm"&gt;Convert::Base32&lt;/a&gt;. For Base32, I'm currently using a patched copy of MIME::Base32 because the CPAN version only handles uppercase. For Base64, I'm using _ and - as the two non-alphanumeric characters.&lt;/p&gt;</content>
  </entry>
</feed>

