In this topic we shall understand what is Consistent Hashing and why it is important for system design or in Distributed System.
So before understanding about Consistent Hashing, let us understand about Hashing, working of Hashing and why we moved to Consistent Hashing.
What is hashing?
Hashing is a technique to index the data. With the help of hashing we can store and retrive data very easily.
How indexing will happen?
For indexing we use something called as “hash function”. A hash function will use a mathematical formula to create a value. That value will be stored in a “Hash Table”.
If you are storing in a key value pair, the search will be of O(n). But by using Hash Function the search will be of O(1). This is because the “Hash Function”.
Hashing will also help in sharding the keys. Now a day the data is distributed in multiple locations, hence we need to distribute the keys across multiple locations. Hashing will achieve the purpose.
Let us take an example to store the data into hash table.
We have below values, that are needed to be stored in hash table.
12, 27, 56, 84, 93, 96
My hash table consist of values 0 to 9. [Assume we have 10 servers to store the values]
My hash function is as below:
H(n) = n mod 10. It means it will take the modulus of the value and store it.
Hence 12 mod 10 is 2, hence stored in 2nd index/server
Hence 27 mod 10 is 7, hence stored in 7nd index/server
Hence 56 mod 10 is 6, hence stored in 6nd index/server
Hence 84 mod 10 is 4, hence stored in 4nd index/server
Hence 93 mod 10 is 3, hence stored in 3nd index/server
Hence 96 mod 10 is 6, hence stored in 6nd index/server
Hence when we need to check the value 12 is present or not, we just take the mod 10 value of 12 and then check if the value is present. This reduces the time to search.
Complications in Hashing:
As we can see from above both “56” and “96” both are in the same index/server. Hence only the latest value will be stored.
Because of this we can say that our hash function is not an optimal one.
An optimal Hash function should satisfy below points:
1. A hash function can be easily computed
2. All the values should be evenly distributed
3. There should be less or minimal collision
How to resolve the collision?
There are multiple ways to resolve collision. Some of the points are discussed below:
1. In case of multiple values at the same index, add the new values to a linked list. But searching and deleting will take time. This is also called as Open Hashing.
2. In this case of multiple values at the same index, check if the next index is free, then insert the new value in the next index. If the next index is not free, then go to next index and insert the value. This is called as Closed Hashing.
3. Quadratic Probing
4. Double Hashing
Below are the some of the list of hash function
“Jenkins hash function”
“NIST hash function competition”
Need for Consistent Hashing
So we got a basic idea about hashing in the above section. Now shall look at the need for Consistant Hashing.
As the data size increases, we add multiple servers to handle the data. In the above example, we did “mod 10” assuming that there are 10 servers. But if suppose we add 2 more servers, then we have to do “mod 12”. Then we have to update the hash values for all the values present. The same holds good when you remove the servers.
This requires lot of efforts and in some cases it will not be feasible.
Hence we need consistent hashing.
what is consistent hashing?
Consistent Hashing is way to store the key value pair in such a way that, during the service addition or deletion the data changes will be minimum.
In normal hashing our hash function was doing “k mod n”, where n is the number of servers.
But in consistent hashing we do “k/n” where n is the number of server and k is the key.
Let’s suppose we have 4 servers to store the data. We assign range of values to each of the servers as shown below:
Server 1 will hold the key from 0 to 10
Server 2 will hold the key from 11 to 20
Server 3 will hold the key from 21 to 30
Server 4 will hold the key from 31 to 40
In normal hashing the servers are arranged in the form of an array starting from 0 to n-1. But in consistent hashing we connect all the servers starting from 0 to n-1 and n-1 will connected to 0.
The advantage of this is if you add any new server, or need to remove a server, only set of values needs to be changed, instead of changing all the values.
Consistent hashing and Data persistence
So now we have done consistent hashing, but what if a server is not responding? Then all the values associated with those servers will not be accessible until that server is up and running.
Hence we need to replicate the data to the next servers. Here we can replicate the data from one server to the next 2 servers. Hence a total of 3 servers will be having the same data. Hence we have solved hardware down problem.
Now we have solved one issue, but this presents with another issue.
Now consider the scenario. If a server is down, then we store the latest value in another server. Once the original server is up, how to check if the data present is the latest?
Hence we follow a basic below rules in the situation.
When I write in to the server, there should be at least 2 servers up. If there is only 1 server available, then we skip that write, and write it in another time.
When I read, read from 2 servers.
Rule of thumb is “Number of server I Read + Number of server I write should be greater than the number of servers”. This will guarantee the data read is the latest data.
Open Source Tools using consistent hashing.