Encoding Hashed UIDs: Base64 vs. Hex vs. Base32
I recently looked at using various encodings for hashed UIDs, e.g. UIDs generated by a crytographic hash algorithm such as SHA-1 or MD5. These are often useful when the UID does not need to have human meaning but should exhibit some uniformity, such as character set and length. I considered Base64 and hexadecimal first because they are commonly used by crypto libraries and then decided on Base64 and Base32 where appropriate. Base36 is actually the most compact case insensitive encoding (using Arabic numbers and Roman letters) but is not an option for me at the moment because there's no Perl module for it that will take arbitrary text and binary input at the moment. Math::Base36 exists but only handles numbers.
- Base64 has the advantage of generating shorter representations because it uses a 64 character set instead of hexadecimal's 16. While Base 64 can use any 64 characters, many implementations follow IETF PEM RFC 1421 which specifies A-Z, a-z, 0-9, / and + while using = for padding. Because + and / need to be URI escaped, Wikipedia mentions a variant that uses * and - in place of + and / respectively and removes line breaks as well as padding. I used this for a while but unfortunate * cannot be used in DOM ids. Because of this I've switched to using _- instead.
- Hexadecimal uses 0-9 and A-F and is case insensitive. The representations are longer due to the smaller character set, but case insensitivy can be advantageous in certain situations such as full text indexing with Xapian which is case sensitive, and MySQL which is case insensitive by default. It's common practices to lowercase all the indexed terms using Xapian while reserving uppercase characters for boolean and probabilistic term prefixes. After thinking about how the query parser works for a bit, my conclusion is that lower casing all the inexed terms will make life with Xapian a lot easier. To enable case sensitivity in MySQL, the table definition needs to be changed.
- Base32 is the middle ground between the 16 character hex and 64 character Base64 encodings, but does not get included in crypto libraries. For UIDs, however, it is especially attractive because it uses a larger character set than hexadecimal resulting in shorter representations and is case insensitive for applications that work better with it. Base32, as defined in IETF RFC 3548, uses the characters A-Z and 2-7. Since Base32 isn't included in many crypto libraries, an extra encoding is needed, however, creating hashed UIDs only happens once per entry so it seems reasonable.
There are some modules on CPAN for Base32 including MIME::Base32 and Convert::Base32. For Base32, I'm currently using a patched copy of MIME::Base32 because the CPAN version only handles uppercase. For Base64, I'm using _ and - as the two non-alphanumeric characters.