Pastebin System Design - Welcome to Tech by Example

Overview

First, let’s look at the definition of Pastebin. Pastebin service allows you to share text and images over a link which can be shared with multiple users. So basically Pastebin is a service that allows you to share data temporarily over a link.
Requirement analysis is a critical part of any system design question. It is divided into two parts.

Functional Requirements – Defines the Business Requirements

Non-Functional Requirements – Defines the quality attributes of a system such as performance, scalability, security, etc.

Let’s see some of the functional and non-functional requirements of the Pastebin service.

Functional Requirements

Users should be able to paste the text or upload an image and then share that text or image using a unique URL which will be generated by the Pastebin service with others

Users should be able to set expiry for an URL. If not specified by default the expiry would be 1 week

Users can be either logged in or anonymous

The user should be able to log in and view the pastebins that have been generated by him

Other users can access the paste text content or image whenever they access the paste URL.

Non Functional Requirements

The system should be highly durable. The unique URL once generated should persist.

The system should be strongly consistent. What it means is that once a paste is generated then the system should be able to return that paste in the next immediate call.

The system should be highly available

The system should be fault-tolerant

There should be no single point of failure

Capacity Estimate

We will do a capacity estimate for three things

Network Estimation

Data Estimation

Traffic Estimation

Traffic Estimation

Assume the number of active users per day 200K

Number of pastebins created – 200K

Each of the pastebins created is read 10 times. Total read – 200K*10 = 2000K

So our system is more read-heavy than write–heavySo total writes per sec = 200K/24*60*60 ~ 3 requests/sec.Considering peak traffic = 200 request/sec.Total read per sec = 2000K/24*60*60 – 30 requests/secConsidering peak traffic = 2000 request/sec

Data Estimation

While creating Pastebin a user can specify either text or an image. Assume the max text size that is allowed to the user is 5MB and the maximum image size is 10 MB.

Also, assume that the ratio of text upload vs image upload is 9:1

Text Upload

Number of text upload = 180K

Max size of text upload – 5MB

Average size of text upload – 10KB

Total Size per day = 10KB * 180K = 1.8 GB per day

Assume no paste is getting expired, then the total size required for 3 years will be = 1.8*3*365 ~= 2TB

Image Upload

Number of image upload = 20K

Max size of text upload – 10MB

Average size of image upload – 100KB

Total Size per day = 100KB * 20K = 2 GB per day

Assume no paste is getting expired, then the total size required for 3 years will be = 2*3*365 ~= 2.2TB

Total Size Needed = 2TB + 2.2TB =~ 4.2 TB for 3 years

Database Schema

We need to store each paste created by the user along with the corresponding text or image. As we already mentioned that text can go up to the size of 5MB and image can go up to the size of 10 MB.

We cannot store images in the database otherwise it will be a bad design as there will be much database IO involved. Also storing text could also be not stored in the database if it is larger in size. So below strategy can be adopted for text and image storage

Text Storage

If the text is less than 10 KB then it can be stored as part of the database.

If not then the text will be stored in blob storage. We can use Amazon S3 here. The link of the S3 will be stored in the database

Image Storage

The image will always be stored in S3

We don’t have any ACID requirements so we can use No SQL databases to store the paste that will be created. We can use Cassandra Database in this case.

We will have the below tables

User Table

Paste Table

User Table

It will have below fields

user_id
user_name
password_encrypted
created
updated

Paste Table

It will have below fields

paste_id – It will be a UUID
paste_type – It will either text or image
text – If the paste_type is text and text size is less than 10kb then it will be populated
s3_url – Populated in two conditions. If the paste_type is text and text size is greater than 10kb. When the paste_type is an image
user_id
created
updated

How Unique URLs will be generated

Since the paste is meant to be created and also share with other users it will be good if we have a short URL generated so that it is easy to share. So now we will look at how we can generate the short URL for the paste. For that, we will have another service called Key Generation Service that will be used to generate a short key which then will be used in the created paste URL

High-Level Design

On a high level let’s discuss what will be the higher flow and what all services would exist.

There will be an API gateway on which every request from all the users will land.

There will be a Pastebin service. This service holds the responsibility of generating all the paste URLs.

There will be a Key Generation Service that holds the responsibility of generation of short keys

The Pastebin service will call the key generation service whenever it needs new keys.

When the Pastebin service has exhausted all the key ranges it will publish a Kafka/SNS message specifying that the key ranges have been exhausted.

This message will be picked by a Key Recovery service which will be a worker. It will mark the key range as free so that it can be picked again. This worker will also delete the paste created for all the keys in the range from the database.

We will cache the latest paste created as they are more likely to be shared and accessed after they are created

Below is the high-level diagram of the service

Let’s see some of the details of the Paste Bin service and the Key Generation Service

Paste Bin Service

This service will be the interface to all the APIs that exist in our system. The PasteBin service will expose an API to create the paste as well as it will also expose an API to read that paste.

Create Paste – It will interact with the Key Generation Service to get the short keys. These keys will be used for generating the URL. It will then make an entry into the database for that paste. It will also make an entry into the cache.

Read Paste – It will first see if the paste already exists in the cache. If yes then it will return that paste. If not it will fetch the paste from the database.

Making Create and Read Paste more efficient When text is in KBs or when paste contains an image then creating as well as reading paste via PaateBin service is a challenge. Here we can do one optimization. We can directly upload the large text or image to the Blob Storage which could be Amazon S3 or HDFS. How do we do that

For Upload

Let’s User A wants to create a Paste that contains an image. The client will send a request to the server to send the presigned URL to which the client can upload the image

The server will respond with a pre-signed URL whose validity can be of few hours. You can read this article to get an idea of the pre-signed URL https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html . Basically, it is an URL that is already signed with a token and hence it can be used to directly upload to that URL without requiring any further authentication. This is also called direct upload. The server will also return the image_id here

The client will upload the image to that URL. It will directly be stored in S3

Now the client will call the Create Paste API. While calling the Create Paste API, it will pass in the image_id in the request body as well. The server will then have the image_id. Since it has image_id, it will know that the image corresponding to the Paste has already been uploaded to the Blob Storage which is S3 here. It will save the path as a field while saving the paste in DB

With the above approach, we prevent bytes of image or text via the PasteBin service which is an optimization in terms of cost and performance

For download

User A has created a paste that will be shared with User B. So user B will be reading the paste

While reading, the reverse happens. The PasteBin service fetches the paste from cache or DB.

Once it fetches the paste, it will see if it contains a large text or image. If yes then it will fetch the S3 location from the DB.

Then it will generate a Presigned URL again for that S3 location. The client is returned both the paste information as well the S3 URL

Using this Presigned S3 URL, the client can directly download the corresponding big paste or image file.

Key Generation Service

There will be a KGS service that holds the responsibility of generating the keys. First, let’s see what should be the length of each key. Possible options of length are 6,7,8. Only base64 URL-safe characters could be used to generate the key. Therefore

For 6- We have 64^6= 68.7 billion options

For 7 – We have 64^7 = ~3500 Billion options

For 8 – We have 64^8= trillion options

We can now assume that 68.7 billion entries will be enough so we can have 6 characters for the key. Now the question is how these are going to be maintained in the Database. If we are storing 68 billion entries in the database then it might be too many entries and a waste of resources.

One option is to store ranges of keys in the databases. We can have a range of 64 where we only store the first five characters which will act as a prefix for all 64 keys which can be generated from this prefix.

Let’s say we have the below prefix

adcA2

Then below 64 keys can be generated from this

adcA2[a-z] – 26 keys

adcA2[A-Z] – 26 keys

adcA2[0-9] – 10 keys

adcA2[-_] – 2 keys

We can store these ranges in DB. So for 6 characters, we will have overall 64^5 entries in the database. The keys will be returned by the Key Service to the Tiny URL services in ranges and batches only. The Tiny URL service will then use this prefix to generate 64 keys and serve 64 different create tiny URL requests. This is optimization as the PasteBin service only needs to call the Key Generation Service only when it has exhausted all its 64 keys. So there will be one call from PasteBin service to the Key Generation Service for generating 64 short URLsLet’s now see the points for the KGS service

Database Schema

Which database to use

How to resolve concurrency issues

How to recover key_prefix

What happens if the key ranges are getting exhausted

What if the paste never expires

Is not KGS service a single point of failure?

Database Schema

There will just be a single table that will store the range of keys i.e prefix. Below will be the fields in the table

key_prefix

key_length – It will always be 6 for now. These fields exist if we need 7 length keys in any scenario

used – If this is true then the key prefix is currently in use. If false then it is free to be used

created

updated

Which Database to Use

We don’t have any ACID requirements so we can use the No SQL database. Also, we might have very large data to save as well so No SQL might be better suited. This system will be a write-heavy as well as a read-heavy system. So we can use Cassandra Database here. We can do the capacity estimates of the DB and based on that we can decide on the number of shards we want to have. Each of the shards would be properly replicated as well

There is one more optimization we can do here to improve the latency. We can refill free key ranges in the cache and the KGS service can directly pick from there instead of going to the database every time.

How to resolve concurrency issues

It could very well happen that two requests see the same prefix or range as free. Since there are multiple servers reading from the key DB simultaneously we might have a scenario where two or more servers read the same key as free from the key DB. There are two ways to resolve the concurrency issues we just mentioned

Two or more servers read the same key but only one server will be able to mark that key_prefix as used in the database. Concurrency is at DB level that is each row is locked before being updated and we can leverage that here. Db will return back to the server whether any record was updated or not. If the record was not updated then the server can fetch a new key. If the record was updated then that server has got the right key.

The other option is to use a transaction that does Find and Update in one transaction. Each Find and Update will return a unique key_prefix every time. This is probably not a recommended option because of the load it puts on the database

How to recover key_prefix

Tiny URL service once it has exhausted the range of keys, then it will enter that range into another table from which it can be recovered and put back as free after 2 weeks. We know for sure that after two weeks the keys will be free as we have an expiry of two weeks

What happens if the key ranges are getting exhausted

This will be an unexpected condition. There will be a background worker that will check if the key ranges are getting exhausted. If yes then it can generate ranges for 7 length keys. But how it will know if the key ranges are getting exhausted. For keeping a rough count there could be another table that will store the user count of used keys.

Whenever any range is allotted by the KGS to Tiny URL service it will publish a message that will be picked by a synchronous worker that is going to decrease the count of used keys by 1.

Similarly, whenever a range is free we can increment this counter.

What if the PasteBin never expires

It is easy to extend the above service to serve paste that never expires.

Just that our short string will not be limited to 6 length characters. We can use 7 lengths, 8 length characters, or even 9 lengths as the need arises.

There will be no key recovery service

Once a key_range has been allotted we can remove it from the key DB as it is never meant to be freed or recovered

Is not KGS service a single point of failure?

To prevent it we will have proper replication of the key database. Also, there will be multiple app servers for the service itself. We will also have proper autoscaling set up. We can also have Disaster Management

Other common components

Other common components could be

User Service – It holds the user profile information.

Token/Auth Service – Management of User tokens

SMS Service- It is used for sending any kind of message back to the user. For example – OTP

Analytics Service – This could be used to track any kind of analytics

Non-Functional Requirements

Let’s discuss some non-functional requirements now

Scalability

The first thing to consider with the above design is the scalability factor. The scalability of each of the components in the system is very important. Here are scalability challenges you can face and their possible resolutions

Each of the machines in the paste_bin service and KGS service could only serve a limited number of requests. Hence each service here should have proper autoscaling set in so that based on the number of requests we can add instances up and autoscale them when needed

Your Kafka/SNS system might not be able to take that much load. We can scale horizontally but to a limit. If that is becoming a bottleneck then depending upon the geography or userId we can have two or more such systems. Service discovery could be used to figure out which Kafka system a request needs to go to.

Another important factor of scalability here is that we have designed our system in such a way so that none of the services is bogged with too many things to do. There is a separation of concerns and wherever there was too much of a responsibility on service, we have broken it down

Low latency

We can cache the newly created paste when it is created with some expiry of course. As and when a paste is created it is more likely to be accessed in some time. It will reduce latency for many of the read calls.

We also created batches of key or key ranges. This prevents the Paste Bin service to call the KGS service every time and overall improves the latency.

There is one more optimization we can do here to improve the latency. We can refill free key ranges in the cache and the KGS service can directly pick from there instead of going to the database every time.

Availability

In order for the system to be highly available, it is very important to have redundancy/backups for almost all components in the system. Here are some of the things that need to be done.

In the case of our DB, we need to enable replication. There should be multiple slaves for each of the master shard nodes.

For Redis we also need replication.

For data redundancy, we could be multi-region as well. This could be one of the benefits if one of the regions goes down.

Disaster Recovery could also be set up

Alerting and Monitoring

Alerting and Monitoring is also very important non-functional requirement. We should monitor each of our services and set up proper alerts as well. Some of the things that could be monitored are

API Response Time

Memory Consumption

CPU Consumption

Disk Space Consumption

Queue Length

….

Moving closer to user location

There are a couple of architectures that could be followed here. One such is Cell Architecture. You can read more about cell architecture here – https://github.com/wso2/reference-architecture/blob/master/reference-architecture-cell-based.md

Avoiding Single Point of Failures

A single point of failure is that part of a system that when stops working then it would lead the entire system to fail. We should try to prevent any single point of failure as well in our design. By redundancy and going multi-region we can prevent such things

Conclusion

This is all about the system design of Pastebin service. Hoped you have liked this article. Please share feedback in the comments