Table of Contents
Overview
First, let’s look at the definition of Pastebin. Pastebin service allows you to share text and images over a link which can be shared with multiple users. So basically Pastebin is a service that allows you to share data temporarily over a link.
Requirement analysis is a critical part of any system design question. It is divided into two parts.
- Functional Requirements – Defines the Business Requirements
- Non-Functional Requirements – Defines the quality attributes of a system such as performance, scalability, security, etc.
Let’s see some of the functional and non-functional requirements of the Pastebin service.
Functional Requirements
- Users should be able to paste the text or upload an image and then share that text or image using a unique URL which will be generated by the Pastebin service with others
- Users should be able to set expiry for an URL. If not specified by default the expiry would be 1 week
- Users can be either logged in or anonymous
- The user should be able to log in and view the pastebins that have been generated by him
- Other users can access the paste text content or image whenever they access the paste URL.
Non Functional Requirements
- The system should be highly durable. The unique URL once generated should persist.
- The system should be strongly consistent. What it means is that once a paste is generated then the system should be able to return that paste in the next immediate call.
- The system should be highly available
- The system should be fault-tolerant
- There should be no single point of failure
Capacity Estimate
We will do a capacity estimate for three things
- Network Estimation
- Data Estimation
- Traffic Estimation
Traffic Estimation
- Assume the number of active users per day 200K
- Number of pastebins created – 200K
- Each of the pastebins created is read 10 times. Total read – 200K*10 = 2000K
So our system is more read-heavy than write–heavySo total writes per sec = 200K/24*60*60 ~ 3 requests/sec.Considering peak traffic = 200 request/sec.Total read per sec = 2000K/24*60*60 – 30 requests/secConsidering peak traffic = 2000 request/sec
Data Estimation
- While creating Pastebin a user can specify either text or an image. Assume the max text size that is allowed to the user is 5MB and the maximum image size is 10 MB.
- Also, assume that the ratio of text upload vs image upload is 9:1
Text Upload
- Number of text upload = 180K
- Max size of text upload – 5MB
- Average size of text upload – 10KB
- Total Size per day = 10KB * 180K = 1.8 GB per day
- Assume no paste is getting expired, then the total size required for 3 years will be = 1.8*3*365 ~= 2TB
Image Upload
- Number of image upload = 20K
- Max size of text upload – 10MB
- Average size of image upload – 100KB
- Total Size per day = 100KB * 20K = 2 GB per day
- Assume no paste is getting expired, then the total size required for 3 years will be = 2*3*365 ~= 2.2TB
Total Size Needed = 2TB + 2.2TB =~ 4.2 TB for 3 years
Database Schema
We need to store each paste created by the user along with the corresponding text or image. As we already mentioned that text can go up to the size of 5MB and image can go up to the size of 10 MB.
We cannot store images in the database otherwise it will be a bad design as there will be much database IO involved. Also storing text could also be not stored in the database if it is larger in size. So below strategy can be adopted for text and image storage
Text Storage
- If the text is less than 10 KB then it can be stored as part of the database.
- If not then the text will be stored in blob storage. We can use Amazon S3 here. The link of the S3 will be stored in the database
Image Storage
- The image will always be stored in S3
We don’t have any ACID requirements so we can use No SQL databases to store the paste that will be created. We can use Cassandra Database in this case.
We will have the below tables
- User Table
- Paste Table
User Table
It will have below fields
- user_id
- user_name
- password_encrypted
- created
- updated
Paste Table
It will have below fields
- paste_id – It will be a UUID
- paste_type – It will either text or image
- text – If the paste_type is text and text size is less than 10kb then it will be populated
- s3_url – Populated in two conditions. If the paste_type is text and text size is greater than 10kb. When the paste_type is an image
- user_id
- created
- updated
How Unique URLs will be generated
Since the paste is meant to be created and also share with other users it will be good if we have a short URL generated so that it is easy to share. So now we will look at how we can generate the short URL for the paste. For that, we will have another service called Key Generation Service that will be used to generate a short key which then will be used in the created paste URL
High-Level Design
On a high level let’s discuss what will be the higher flow and what all services would exist.
- There will be an API gateway on which every request from all the users will land.
- There will be a Pastebin service. This service holds the responsibility of generating all the paste URLs.
- There will be a Key Generation Service that holds the responsibility of generation of short keys
- The Pastebin service will call the key generation service whenever it needs new keys.
- When the Pastebin service has exhausted all the key ranges it will publish a Kafka/SNS message specifying that the key ranges have been exhausted.
- This message will be picked by a Key Recovery service which will be a worker. It will mark the key range as free so that it can be picked again. This worker will also delete the paste created for all the keys in the range from the database.
- We will cache the latest paste created as they are more likely to be shared and accessed after they are created
Below is the high-level diagram of the service
Let’s see some of the details of the Paste Bin service and the Key Generation Service
Paste Bin Service
This service will be the interface to all the APIs that exist in our system. The PasteBin service will expose an API to create the paste as well as it will also expose an API to read that paste.
Create Paste – It will interact with the Key Generation Service to get the short keys. These keys will be used for generating the URL. It will then make an entry into the database for that paste. It will also make an entry into the cache.
Read Paste – It will first see if the paste already exists in the cache. If yes then it will return that paste. If not it will fetch the paste from the database.
Making Create and Read Paste more efficient When text is in KBs or when paste contains an image then creating as well as reading paste via PaateBin service is a challenge. Here we can do one optimization. We can directly upload the large text or image to the Blob Storage which could be Amazon S3 or HDFS. How do we do that
For Upload
- Let’s User A wants to create a Paste that contains an image. The client will send a request to the server to send the presigned URL to which the client can upload the image
- The server will respond with a pre-signed URL whose validity can be of few hours. You can read this article to get an idea of the pre-signed URL https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html . Basically, it is an URL that is already signed with a token and hence it can be used to directly upload to that URL without requiring any further authentication. This is also called direct upload. The server will also return the image_id here
- The client will upload the image to that URL. It will directly be stored in S3
- Now the client will call the Create Paste API. While calling the Create Paste API, it will pass in the image_id in the request body as well. The server will then have the image_id. Since it has image_id, it will know that the image corresponding to the Paste has already been uploaded to the Blob Storage which is S3 here. It will save the path as a field while saving the paste in DB
With the above approach, we prevent bytes of image or text via the PasteBin service which is an optimization in terms of cost and performance
For download
- User A has created a paste that will be shared with User B. So user B will be reading the paste
- While reading, the reverse happens. The PasteBin service fetches the paste from cache or DB.
- Once it fetches the paste, it will see if it contains a large text or image. If yes then it will fetch the S3 location from the DB.
- Then it will generate a Presigned URL again for that S3 location. The client is returned both the paste information as well the S3 URL
- Using this Presigned S3 URL, the client can directly download the corresponding big paste or image file.
Key Generation Service
There will be a KGS service that holds the responsibility of generating the keys. First, let’s see what should be the length of each key. Possible options of length are 6,7,8. Only base64 URL-safe characters could be used to generate the key. Therefore
- For 6- We have 64^6= 68.7 billion options
- For 7 – We have 64^7 = ~3500 Billion options
- For 8 – We have 64^8= trillion options
We can now assume that 68.7 billion entries will be enough so we can have 6 characters for the key. Now the question is how these are going to be maintained in the Database. If we are storing 68 billion entries in the database then it might be too many entries and a waste of resources.
One option is to store ranges of keys in the databases. We can have a range of 64 where we only store the first five characters which will act as a prefix for all 64 keys which can be generated from this prefix.
Let’s say we have the below prefix
adcA2
Then below 64 keys can be generated from this
- adcA2[a-z] – 26 keys
- adcA2[A-Z] – 26 keys
- adcA2[0-9] – 10 keys
- adcA2[-_] – 2 keys
We can store these ranges in DB. So for 6 characters, we will have overall 64^5 entries in the database. The keys will be returned by the Key Service to the Tiny URL services in ranges and batches only. The Tiny URL service will then use this prefix to generate 64 keys and serve 64 different create tiny URL requests. This is optimization as the PasteBin service only needs to call the Key Generation Service only when it has exhausted all its 64 keys. So there will be one call from PasteBin service to the Key Generation Service for generating 64 short URLsLet’s now see the points for the KGS service
- Database Schema
- Which database to use
- How to resolve concurrency issues
- How to recover key_prefix
- What happens if the key ranges are getting exhausted
- What if the paste never expires
- Is not KGS service a single point of failure?
Database Schema
There will just be a single table that will store the range of keys i.e prefix. Below will be the fields in the table
- key_prefix
- key_length – It will always be 6 for now. These fields exist if we need 7 length keys in any scenario
- used – If this is true then the key prefix is currently in use. If false then it is free to be used
- created
- updated
Which Database to Use
We don’t have any ACID requirements so we can use the No SQL database. Also, we might have very large data to save as well so No SQL might be better suited. This system will be a write-heavy as well as a read-heavy system. So we can use Cassandra Database here. We can do the capacity estimates of the DB and based on that we can decide on the number of shards we want to have. Each of the shards would be properly replicated as well
There is one more optimization we can do here to improve the latency. We can refill free key ranges in the cache and the KGS service can directly pick from there instead of going to the database every time.
How to resolve concurrency issues
It could very well happen that two requests see the same prefix or range as free. Since there are multiple servers reading from the key DB simultaneously we might have a scenario where two or more servers read the same key as free from the key DB. There are two ways to resolve the concurrency issues we just mentioned
- Two or more servers read the same key but only one server will be able to mark that key_prefix as used in the database. Concurrency is at DB level that is each row is locked before being updated and we can leverage that here. Db will return back to the server whether any record was updated or not. If the record was not updated then the server can fetch a new key. If the record was updated then that server has got the right key.
- The other option is to use a transaction that does Find and Update in one transaction. Each Find and Update will return a unique key_prefix every time. This is probably not a recommended option because of the load it puts on the database
How to recover key_prefix
Tiny URL service once it has exhausted the range of keys, then it will enter that range into another table from which it can be recovered and put back as free after 2 weeks. We know for sure that after two weeks the keys will be free as we have an expiry of two weeks
What happens if the key ranges are getting exhausted
This will be an unexpected condition. There will be a background worker that will check if the key ranges are getting exhausted. If yes then it can generate ranges for 7 length keys. But how it will know if the key ranges are getting exhausted. For keeping a rough count there could be another table that will store the user count of used keys.
- Whenever any range is allotted by the KGS to Tiny URL service it will publish a message that will be picked by a synchronous worker that is going to decrease the count of used keys by 1.
- Similarly, whenever a range is free we can increment this counter.
What if the PasteBin never expires
It is easy to extend the above service to serve paste that never expires.
- Just that our short string will not be limited to 6 length characters. We can use 7 lengths, 8 length characters, or even 9 lengths as the need arises.
- There will be no key recovery service
- Once a key_range has been allotted we can remove it from the key DB as it is never meant to be freed or recovered
Is not KGS service a single point of failure?
To prevent it we will have proper replication of the key database. Also, there will be multiple app servers for the service itself. We will also have proper autoscaling set up. We can also have Disaster Management
Other common components
Other common components could be
- User Service – It holds the user profile information.
- Token/Auth Service – Management of User tokens
- SMS Service- It is used for sending any kind of message back to the user. For example – OTP
- Analytics Service – This could be used to track any kind of analytics
Non-Functional Requirements
Let’s discuss some non-functional requirements now
Scalability
The first thing to consider with the above design is the scalability factor. The scalability of each of the components in the system is very important. Here are scalability challenges you can face and their possible resolutions
- Each of the machines in the paste_bin service and KGS service could only serve a limited number of requests. Hence each service here should have proper autoscaling set in so that based on the number of requests we can add instances up and autoscale them when needed
- Your Kafka/SNS system might not be able to take that much load. We can scale horizontally but to a limit. If that is becoming a bottleneck then depending upon the geography or userId we can have two or more such systems. Service discovery could be used to figure out which Kafka system a request needs to go to.
- Another important factor of scalability here is that we have designed our system in such a way so that none of the services is bogged with too many things to do. There is a separation of concerns and wherever there was too much of a responsibility on service, we have broken it down
Low latency
- We can cache the newly created paste when it is created with some expiry of course. As and when a paste is created it is more likely to be accessed in some time. It will reduce latency for many of the read calls.
- We also created batches of key or key ranges. This prevents the Paste Bin service to call the KGS service every time and overall improves the latency.
- There is one more optimization we can do here to improve the latency. We can refill free key ranges in the cache and the KGS service can directly pick from there instead of going to the database every time.
Availability
In order for the system to be highly available, it is very important to have redundancy/backups for almost all components in the system. Here are some of the things that need to be done.
- In the case of our DB, we need to enable replication. There should be multiple slaves for each of the master shard nodes.
- For Redis we also need replication.
- For data redundancy, we could be multi-region as well. This could be one of the benefits if one of the regions goes down.
- Disaster Recovery could also be set up
Alerting and Monitoring
Alerting and Monitoring is also very important non-functional requirement. We should monitor each of our services and set up proper alerts as well. Some of the things that could be monitored are
- API Response Time
- Memory Consumption
- CPU Consumption
- Disk Space Consumption
- Queue Length
- ….
Moving closer to user location
There are a couple of architectures that could be followed here. One such is Cell Architecture. You can read more about cell architecture here – https://github.com/wso2/reference-architecture/blob/master/reference-architecture-cell-based.md
Avoiding Single Point of Failures
A single point of failure is that part of a system that when stops working then it would lead the entire system to fail. We should try to prevent any single point of failure as well in our design. By redundancy and going multi-region we can prevent such things
Conclusion
This is all about the system design of Pastebin service. Hoped you have liked this article. Please share feedback in the comments