Table of Contents

Overview

In this tutorial, we will try to understand directory-based sharding. Other than directory-based sharding there are two types of sharding as well

Key Based Sharding

Range Based Sharding

Directory-based sharding keeps a static lookup table that keeps track of what shard is holding what data. Let’s consider an example. Assume you have to keep track of all restaurants present in different parts of the US. Each of the restaurants will belong to a certain location within the US. We can divide the US into multiple zones. Each of the zones will be stored in a separate shard.

Restaurant Table

Restaurant Name	Zone
R1	ZA
R2	ZB
R3	ZC
R4	ZA
R5	ZD
R6	ZB

Now we will maintain a lookup table that will store the mapping of the zone to the shard

Lookup Table

Zone	Shard
ZA	1
ZB	2
ZC	3
ZD	4

It could very well be the case that two of the zones are stored in the shard. That is also allowed. This look-up table is saved generally outside the application and the database

When to use Directory Based Sharding.

When the cardinality of the column on which the lookup table is based is low.

For example, in the above case, we will have a fixed number of zones within the US. Even if there exist a million restaurant records, the number of zones is fixed. And hence the cardinality of the zone column in the Restaurants Table is low.

As a counter-example, we cannot use the employee_id in the lookup table because if there are let’s say 20K employees then there will be 20K different employee_id and the size of the lookup table will be unbounded as here the cardinality of the employee_id column is high

Another important factor is that other than the cardinality of a column being low, the column value should also not change. For example, consider an Order table. An order will have different status

Init

Success

Failed

Shipped

Here the cardinality of the status column is low but the order can move from Init to Success or Failed. From Success, it can move to Shipped. Hence status column is not the right field here to use for the lookup table as otherwise, the order would move from one shard to another shard once the order status changes.

Disadvantages of Directory-Based Sharding

Overhead of a lookup table

There is an overhead consulting the table every time we want to know which shard the user data belongs to. And this lookup will happen for every read and write

Lookup table being Single Point of Failure

The lookup table can be maintained in a separate place may be Redis or some other database. This lookup table can be a single point of failure. So we have to maintain multiple copies of a lookup table.

The Problem of Hot Shard

Like any other sharding technique, this sharding can also result in several hot shards. So generally when the directory-based sharding is done, the sharding key is chosen such that the data is evenly divided between the shards. In the above case, we can argue that some of the zones might contain more restaurants than others. In that case, we can two ways to fix it

Solution 1

Map a few zones with less number of restaurants to the same shard. But map zone with a large number of restaurants to one shard. For example, let’s assume for the case above that ZA and ZB have a large number of restaurants, and ZC and ZD will have less number of restaurants. Then the look table can be like this

Lookup Table

Zone	Shard
ZA	1
ZB	2
ZC	3
ZD	3

Solution 2

Use the hybrid approach. We can have the combination of zone and subzone decide the shard. For example, ZA contains so many restaurants that it cannot simply fit in one of the shards. Then the combination of zone and subzone will decide the shard

Restaurant Table

Restaurant Name	Zone	Sub Zone
R1	ZA	X
R2	ZB	X
R3	ZC	X
R4	ZA	Y
R5	ZD	X
R6	ZB	Y

Zone	Shard
ZA_X	1
ZA_Y	2
ZB_X	3
ZB_Y	4
ZC_X	5
ZC_Y	5
ZD_X	6
ZD_Y	6

Both the subzones of zone ZA and ZB map to a different shard. While subzones of zone ZC and ZD map to the same shard.

Conclusion

This is all about directory-based sharding. Hope you have liked the article