Table of Contents
Overview
In this tutorial, we will try to understand directory-based sharding. Other than directory-based sharding there are two types of sharding as well
- Key Based Sharding
- Range Based Sharding
Directory-based sharding keeps a static lookup table that keeps track of what shard is holding what data. Let’s consider an example. Assume you have to keep track of all restaurants present in different parts of the US. Each of the restaurants will belong to a certain location within the US. We can divide the US into multiple zones. Each of the zones will be stored in a separate shard.
Restaurant Table
Restaurant Name | Zone |
R1 | ZA |
R2 | ZB |
R3 | ZC |
R4 | ZA |
R5 | ZD |
R6 | ZB |
Now we will maintain a lookup table that will store the mapping of the zone to the shard
Lookup Table
Zone | Shard |
ZA | 1 |
ZB | 2 |
ZC | 3 |
ZD | 4 |
It could very well be the case that two of the zones are stored in the shard. That is also allowed. This look-up table is saved generally outside the application and the database
When to use Directory Based Sharding.
- When the cardinality of the column on which the lookup table is based is low.
For example, in the above case, we will have a fixed number of zones within the US. Even if there exist a million restaurant records, the number of zones is fixed. And hence the cardinality of the zone column in the Restaurants Table is low.
As a counter-example, we cannot use the employee_id in the lookup table because if there are let’s say 20K employees then there will be 20K different employee_id and the size of the lookup table will be unbounded as here the cardinality of the employee_id column is high
Another important factor is that other than the cardinality of a column being low, the column value should also not change. For example, consider an Order table. An order will have different status
- Init
- Success
- Failed
- Shipped
Here the cardinality of the status column is low but the order can move from Init to Success or Failed. From Success, it can move to Shipped. Hence status column is not the right field here to use for the lookup table as otherwise, the order would move from one shard to another shard once the order status changes.
Disadvantages of Directory-Based Sharding
- Overhead of a lookup table
There is an overhead consulting the table every time we want to know which shard the user data belongs to. And this lookup will happen for every read and write
- Lookup table being Single Point of Failure
The lookup table can be maintained in a separate place may be Redis or some other database. This lookup table can be a single point of failure. So we have to maintain multiple copies of a lookup table.
The Problem of Hot Shard
Like any other sharding technique, this sharding can also result in several hot shards. So generally when the directory-based sharding is done, the sharding key is chosen such that the data is evenly divided between the shards. In the above case, we can argue that some of the zones might contain more restaurants than others. In that case, we can two ways to fix it
Solution 1
- Map a few zones with less number of restaurants to the same shard. But map zone with a large number of restaurants to one shard. For example, let’s assume for the case above that ZA and ZB have a large number of restaurants, and ZC and ZD will have less number of restaurants. Then the look table can be like this
Lookup Table
Zone | Shard |
ZA | 1 |
ZB | 2 |
ZC | 3 |
ZD | 3 |
Solution 2
Use the hybrid approach. We can have the combination of zone and subzone decide the shard. For example, ZA contains so many restaurants that it cannot simply fit in one of the shards. Then the combination of zone and subzone will decide the shard
Restaurant Table
Restaurant Name | Zone | Sub Zone |
R1 | ZA | X |
R2 | ZB | X |
R3 | ZC | X |
R4 | ZA | Y |
R5 | ZD | X |
R6 | ZB | Y |
Zone | Shard |
ZA_X | 1 |
ZA_Y | 2 |
ZB_X | 3 |
ZB_Y | 4 |
ZC_X | 5 |
ZC_Y | 5 |
ZD_X | 6 |
ZD_Y | 6 |
Both the subzones of zone ZA and ZB map to a different shard. While subzones of zone ZC and ZD map to the same shard.
Conclusion
This is all about directory-based sharding. Hope you have liked the article