Distributed Systems: Consistent Hashing
We will learn consistent hashing, its usage in distributed systems, and how it plays a role in designing distributed systems such as databases and caches.
Join the DZone community and get the full member experience.
Join For FreeWelcome to the distributed systems series. In this article, we are going to learn about consistent hashing and its usage in distributed systems. Why consistent hashing is important and how it plays a role in designing distributed systems such as databases, cache, etc. Let’s first understand what is hashing and how it is used to distribute data across machines. Then, we will understand what is consistent hashing.
Hashing
Hashing is a technique that generates a unique ID for an object. A simple example would be the hashcode function in Java, which returns a unique ID for an immutable object. This returned ID is used to choose the bucket from an array of buckets for storage and retrieval. In order for this hashing function to return the correct value, the object or key that we use to hash should be immutable. This is how hashing works in Java to store and retrieve the value in the HashMap data structure. If you know how hashmap works, the concept is pretty similar in distributed systems. In distributed systems, we have an array of machines to store the data, and we have to decide which machines should hold the specific data. The following diagram explains how hashing is used to store {key, value} data on different machines.
The above example is pretty good for storing and distributing a large number of objects. Since the data is partitioned horizontally, a simple hash function based on key hash(key)%N would decide which machine to store the given {key, value}. N represents the number of machines in the cluster. What is the problem with the above approach? If we have to add or remove machines from the cluster, Everything {key, value} object that is stored on the cluster should be redistributed. This is not efficient due to the movement of all the keys in the cluster.
Why is this inefficient? Let’s just imagine we have a cache cluster of 100 machines. We have to redistribute all the keys when a new machine is added or removed from the cluster. We have to recalculate hash(key)%N for all keys and move them to corresponding machines. Surely, it’s not efficient. A better way to solve this problem is to use a different hashing mechanism called consistent hashing.
Consistent Hashing
Consistent hashing is a technique used in distributed systems to store and manage data efficiently. Let’s first understand how it works. It works by creating a hashring cluster with multiple points that range from start and end. The ring cluster is sorted in order. When a new machine is added to the ring cluster, It will take care of managing certain points in the circle. The below diagram explains how consistent hashing works. We created a simple hashring that has points ranging from 0 to 100. The points are ordered from 0 to 100, and each machine would take care of handling a subset of the entire range. As specified below, the First machine will store points up to 20, which means if the hash(key) returns a value that is ≤ 20, then that {key, value} is stored in the machine {20}.
How do we find the machine from a hash(key)? We can simply iterate or do a binary search in the range {0...100} to find the machine since points are already sorted. {key, value} is stored on the machine, which has the next higher range for the hash ID. As specified in the below example, hash(k1) returns 18, and hash(k2) returns 38. k1 is stored on machine {20} and k2 is stored on machine {40}.
So far, we have seen how to add {key, value} to the respective machine using consistent hashing. The process is the same for deleting objects from the cluster. We have to get the hash(key) to find the node that has higher points for the computed hash ID. Once it is found, we could simply delete the object from the cluster using a key.
In our example, we have built a distributed ring cluster with points that range from 0 to 100. But what if the hash ID for the key is higher than 100? In that case, we will add {key, value} to the first node {20} of the cluster. The previously mentioned scenario is applicable for deleting {key, value} from the cluster. If the hash(key) is higher than 100, then it would search the {key, value} in the first node to delete.
Adding or Removing Nodes
In the previous section of this post, we saw how to add or remove data from the cluster using consistent hashing. In this section, we will see how adding or removing nodes from the cluster affects the movement of the data in the cluster.
While adding a new node to the cluster, we will calculate a hash(serverId) to find where this node can be placed on our ring cluster. In the below example, we have removed two nodes {40} {50} from the cluster. When we remove these nodes, we could simply copy all the data that is stored on those nodes to node {60}. Now node{60} will take care of all the requests for hash(k2) and hash(k3). If we really look at this example, Instead of moving all the keys across the cluster, we have only moved the subset of the entire key ranges. This is the advantage of consistent hashing. Now, we have reduced the movement of key ranges to k/n, where k is the total number of keys and n is the number of machines in the cluster.
Let’s add a new node to our cluster. Hash (serverId) lies somewhere between {30} and {60}. Let’s consider this new node can take points up to {40}. When a new node is added, It has to copy all the data that belongs to {40} from {60}. Once it is done, {40} will take care of storing and retrieving hash(k2), which is 38, and node{60} will serve the request for hash(k3), which is 48.
So far, we have seen how consistent hashing is used to add or remove {key, value} from the cluster. How does it effectively minimize the data movements across the cluster while a new node is added or removed from the cluster?
One more thing to understand here is that even though consistent hashing helps to minimize data movement, It will not guarantee that data is uniformly distributed across all the nodes of the cluster. This completely depends on the hashing algorithm that we use.
Conclusion
In this article, we learned what consistent hashing is, why it is used in distributed systems such as databases and caches, and how it minimizes the data movement across the cluster. We have also understood how adding or removing a node in the cluster affects the movement of a subset of key ranges in the cluster, and finally, we understood why consistent hashing is used in distributed systems.
Published at DZone with permission of Vetriselvan M. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments