Saturday, September 05, 2009

Distributed Hash Tables

I've looked shortly at DHT, als called Distributed Hash Tables. DHT is not only a way to store data in some cloud of computers, it can also serve as a resilient, robust cloud of storage. It's basically a technology that has a very simple manipulation interface (get, remove, set). The differences between specific DHT's relate to geography and oriƫntation of nodes, data versioning, hash generation, language of implementation and so on. If you want a clean version, look for projects like bamboo, openchord, opendht, and the likes. Other projects already put in some more functionality, like Project Voldemort.

There are some disadvantages to the use of DHT:
  • It doesn't provide absolute guarantees on data consistency and integrity (but doesn't necessarily make a total mess either, check Amazon's paper on "Dynamo").
  • It's not very useful for "group" queries, range queries or other kinds of data lookups.
  • It doesn't natively support events or triggers very well.
  • There is no authority in the network. So nodes have to cooperate between them in case certain decisions need to be made.
  • Lookup of data is O(log(n)) and may take 2 or more seconds, depending on the real location, how many nodes there are and individual latencies between nodes.
There are clear advantages too:
  • It's highly resilient to network leavers/joiners or other big changes around the network. Most implementations handle node changes very well.
  • Data is automatically distributed, according to specific configuration options. The specific way how data is eventually organized depends on the chosen strategy in the design. For example, the way how nodes are organized (-ing) together is very important.
  • Data is replicated across nodes, so it's difficult to lose it.
  • The removal of any single node doesn't impact the network overall in any way.
Now, if you read the wikipedia page, you'll also see products that use this technology. BitTorrent is a protocol that is very effective in the distribution of file parts, but before you can start sharing a file, you need a way to find it. Finding files works by creating hash keys of them, which is basically an alternate key. By setting up a server where you can browse contents and which has an index of files that are in the network, you can find the content you're looking for and then by downloading a .torrent file, you get the hash key used to find peers in the network that can serve the content you're looking for. From that point onwards, you can communicate with the individual peers to start downloading.

Because keys allow multiple values, you can add yourself as a peer to share the file as well, so that others can negotiate with you on the file parts you may have that they don't. So, DHT in this context is used as a very effective mechanism to find peers to communicate with.

There are other uses of DHT once you start building some logic on top of this simple interface. You could create a virtual file system for example. A key with "/" can be loaded with a set of values, which are the 'subdirectories' that are valid under /. Then you just keep going until there are no more files. To get a file directly, you should be able to look for a key : "/work/myfiles/important-document", in order to get information about the location of that file.

Because projects like Hadoop have different ways for large file replication (blocks of X MB) across a huge cluster, using a DHT layer under this system could store the locations of each replicated block. Nodes themselves can manage their own files this way. This could possibly remove the need for the authoritative and centralized HDFS file directory server and make the network overall more robust and resilient (the HDFS central server is a single point of failure).

It's also possible to store files in two formats: the /x/y/z way, which is meant for finding information and the /nodeX/files way, which indicates the files located on a single machine. If each node itself manages and maintains that information, the rest of the network can react to node crashes by just looking up this information from somewhere and act on this.

The above method for storing information doesn't deviate much from Prefix Hash Trees. A complication in DHT is that you cannot access data by parts of the key (a search) because they are all hashed and unreadable, so you can only find data when you have the key in its entirety. You could create a database in this network as a kind of directory, but that introduces a single point of failure again. The schemes above make this slightly more accessible. This is also the reason why DHT's are not the right solution for every problem. It works when you access lots of data by the same primary key. Facebook uses it for showing entire user profiles, slashdot for storing and retrieving rendered comments. The only thing you need next is a method for expiring entries.

DHT's are generally stored in memory, making them really fast for lookups. Most of the times this is acceptable, but sometimes you want a bit more persistence by having individual nodes store their bits around this network. The biggest problems are faced once a node goes offline for a longer period of time and then reconnects. It is possible that the data it stores is now partly expired and you don't want those bits of information to get back on the network overall. It is also possible that certain updates have changed the information, so that it is newer, which the old keys do not contain. Amazon's Dynamo has more or less solved this by the use of "vector clocks". At some point, you need reconciliation.

Virtually, you could store everything in this network, but you need to know of course what something is as you retrieve it. If you're only looking at non-binary data, an efficient method for storing the data could be JSON. I doubt you'd need "protocol buffers", as they're designed for streaming access to large amounts of data. But possibly you could use the compiler in the project anyway to store it in the described format. That allows you at least to have the storage part covered, so that later you can create apps on top of DHT in python, C or Java.

What is so nice about DHT? The design allows you to work on top of a distributed system, if designed right of course, where you need not worry about fault tolerance within the application. Basically, once you are connected, you just write and read to your heart's content and it should work.

No comments: