Planning Ahead for Open Source Storage Scaling
Posted in scalability Tue, 24 Oct 2006 16:56:00 GMT
Recently eWeek ran an article on eHarmony's storage scaling solution choice which discussed how they chose to go with proprietary solutions from 3PAR and ONStor. I was hoping to learn something interesting about their deployment architecture but the most interesting things I learned was that eHarmony has 8+ million users, 9+ million photos and their proprietary solution vendor choice. Some interesting quotes from Mark Douglas, eHarmony's VP of Technology:
- "We find ourselves having to buy storage about every 90 days."
- "The other solutions we considered had a learning curve and a level of complexity that we just didn't want to undertake."
- "There was going to be a lot of hands-on work to do with our six years' worth of data. We wanted a more automated system, for sure."
It seems like what happened is that they didn't plan for growth and by the time it hit them they were too busy and didn't want to deal with it. Going with proprietary solutions seemed like the easy way out. However, one has to wonder if relying on proprietary solutions is a good decision for further scaling needs. In Steve Bryant's article "Top 10 Reasons It's Almost Impossible to Compete with Google" he lists distributed infrastructure as the very first reason:
- Huge, Distributed Infrastructure -- The obvious advantage is Google's huge infrastructure, which is distributed across 450,000+ servers across the globe. By distributing its infrastructure, Google decreases router and switch delays and delivers faster performance to its worldwide users. Not only is search faster, but products work better too.
One thing that has been popularized about Google is their massive use of cheap, commodity hardware, not large proprietary systems like those that 3PAR and OnStor seem to build. While Google uses the closed-source GoogleFS, there are some similar FOSS solutions, namely SixApart/Danga's MogileFS. MogileFS was built for LiveJournal because the alternatives were, according to Brad Fitzpatrick's 2005 OSCON presentation (pdf):
- closed, non-existent, expensive, in development, complicated, ...
- scary/impossible when it came to data recovery
The PDF presentation is a bit long at 80 slides b/c it covers all of LiveJournal. I've extracted the MogileFS slides which is just 13 slides to give you an overview. If you are interested, it does make sense to read the full presentation because it also goes over Perlbal and memcached. One great thing about MogileFS is the automatic ability to make multiple backup copies means that RAID and tape backup are not required. A big difference compared to the "big iron" solutions that 3PAR and ONStor seem to provide. SixApart continues to support MogileFS as an FOSS project and recently held a MogileFS Users/Developers Summit at their San Francisco headquarters. Although MogileFS is the primary, stable and proven FOSS DFS at the moment, there are others in development, including the Hadoop DFS which is part of the Apache Lucence project.
The interesting thing about eHarmony's choice is that MogileFS is a free open-source solution that can more than fulfill their needs. LiveJournal has 60+ million images comprising 6-7TB of information stored in their MogileFS. A little forward planning can help your site scale storage without having to rely on proprietary solutions.