Skip to content

Welcome to Tech by Example

Menu
  • Home
  • Posts
  • System Design Questions
Menu

Understanding Directory-Based Sharding

Posted on August 29, 2022August 29, 2022 by admin

Table of Contents

  • Overview
  • When to use Directory Based Sharding.
  • Disadvantages of Directory-Based Sharding
  • The Problem of Hot Shard
    • Solution 1
    • Solution 2
  • Conclusion

Overview

In this tutorial, we will try to understand directory-based sharding. Other than directory-based sharding there are two types of sharding as well

  • Key Based Sharding
  • Range Based Sharding

Directory-based sharding keeps a static lookup table that keeps track of what shard is holding what data. Let’s consider an example.  Assume you have to keep track of all restaurants present in different parts of the US. Each of the restaurants will belong to a certain location within the US. We can divide the US into multiple zones. Each of the zones will be stored in a separate shard.

Restaurant Table

Restaurant NameZone
R1ZA
R2ZB
R3ZC
R4ZA
R5ZD
R6ZB

Now we will maintain a lookup table that will store the mapping of the zone to the shard

Lookup Table

ZoneShard
ZA1
ZB2
ZC3
ZD4

It could very well be the case that two of the zones are stored in the shard. That is also allowed. This look-up table is saved generally outside the application and the database

When to use Directory Based Sharding.

  • When the cardinality of the column on which the lookup table is based is low.

For example, in the above case, we will have a fixed number of zones within the US. Even if there exist a million restaurant records, the number of zones is fixed. And hence the cardinality of the zone column in the Restaurants Table is low.

As a counter-example, we cannot use the employee_id in the lookup table because if there are let’s say 20K employees then there will be 20K different employee_id and the size of the lookup table will be unbounded as here the cardinality of the employee_id column is high

Another important factor is that other than the cardinality of a column being low, the column value should also not change. For example, consider an Order table. An order will have different  status

  • Init
  • Success
  • Failed
  • Shipped

Here the cardinality of the status column is low but the order can move from Init to  Success or Failed. From Success, it can move to Shipped. Hence status column is not the right field here to use for the lookup table as otherwise, the order would move from one shard to another shard once the order status changes.

Disadvantages of Directory-Based Sharding

  • Overhead of a lookup table

There is an overhead consulting the table every time we want to know which shard the user data belongs to. And this lookup will happen for every  read and write

  • Lookup table being Single Point of Failure

The lookup table can be maintained in a separate place may be Redis or some other database. This lookup table can be a single point of failure.  So we have to maintain multiple copies of a lookup table. 

The Problem of Hot Shard

Like any other sharding technique, this sharding can also result in several hot shards. So generally when the directory-based sharding is done,  the sharding key is chosen such that the data is evenly divided between the shards. In the above case, we can argue that some of the zones might contain more restaurants than others. In that case, we can two ways to fix it

Solution 1

  • Map a few zones with less number of restaurants to the same shard.  But map zone with a large number of restaurants to one shard. For example, let’s assume for the case above that ZA and ZB have a large number of restaurants, and ZC and ZD will have less number of restaurants. Then the look table can be  like this

Lookup Table

ZoneShard
ZA1
ZB2
ZC3
ZD3

Solution 2

Use the hybrid approach. We can have the combination of zone and subzone decide the shard. For example, ZA contains so many restaurants that it cannot simply fit in one of the shards. Then the combination of zone and subzone will  decide the shard

Restaurant Table

Restaurant NameZoneSub Zone
R1ZAX
R2ZBX
R3ZCX
R4ZAY
R5ZDX
R6ZBY
ZoneShard
ZA_X1
ZA_Y2
ZB_X3
ZB_Y4
ZC_X5
ZC_Y5
ZD_X6
ZD_Y6

Both the subzones of zone ZA and ZB map to a different shard. While subzones of zone ZC and ZD map to the same shard.

Conclusion

This is all about directory-based sharding. Hope you have liked the article

©2025 Welcome to Tech by Example | Design: Newspaperly WordPress Theme