Moments of Computer Science: Why distributed data stores matter

After our recent discussion on Chord, a distributed hash table (DHT) implementation, I began to try to connect the ideas Chord presented with all of the other popular persistent storage technologies that have been advertising themselves recently, such as the NoSQL solutions, as I wanted to know how these ideas are related. Chord's decentralized approach to data lookup allows it to be robustly fault-tolerant as nodes enter and leave the system. While most companies that use such data stores don't have to worry about autonomous nodes entering and leaving at will, they do need an easy way to add additional nodes to their systems as their business scales upward. Also, their data storage systems, consisting of large clusters of commodity servers, must handle node failures gracefully, which means both redundant data storage and system stabilization algorithms are required. Lastly, many companies need to perform distributed computations on large data sets, which is made possible by the distributed nature of these data stores.

A few interesting commercial and open source implementations of distributed databases are the Apache Cassandra project, google's proprietary BigTable, the Hadoop-compatible Hypertable. While Cassandra, based on Dynamo, uses a DHT internally, BigTable and Hypertable both use a different data storage format. Wikipedia describes BigTable as a "sparse, distributed multi-dimensional sorted map".

While Chord itself may not have caught on in large commercial organizations like Yahoo, Google, or Facebook, it would be interesting to learn how much DHT's like Chord have influenced the design of the next generation of persistent data stores, and what problems they focus on trying to solve compared to Chord.

Moments of Computer Science

Thursday, September 16, 2010

Why distributed data stores matter

No comments:

Links

Blog Archive

About Me