How would you design Pastebin

How would you design Pastebin

If you are new to this blog, then read How to ace the system design interview blog first. I have described the guidelines involved in solving a system interview problem. We will be applying those steps here. 

Interview Scenario:

Interviewer: Walks in the room. After the initial introductions, “ I would like to test you on your system design skills”

You: Sure! (Internally you are just hoping that the problem is a easy one :))

Interviewer: Let’s start by designing a system for a service similar to pastebin. How would you go about designing such a system?

You: Ok. Here is what I would do

Then you follow the steps given below.

We will use the following high level steps to solve this problem. You can find the step by step guide to solve such problems here

High Level Steps:

  • Scope the problem and clarify the requirements
  • Do estimations 
  • High Level Design
  • Design core components
    • Define API
  • Detailed design (Depends on the time allocated for the interview)
  • Resolve any bottlenecks in the design
  • Summarize the solution

Step 1: Scope the problem and clarify the requirements

Define the functionality of the product (PasteBin) to the interviewer so that you are aligned. Pastebin is a website where users can paste text and the app generates a unique URL to access the pasted text. The URL is randomly generated by the app. 

The scope is limited to:

  • User enters random text and app generates a random and unique URL
  • User can enter the random URL to view the original content pasted
  • User can be anonymous
  • System is always available
  • System is reliable
  • Unique/ random URL should not be an easily guessable value

Out of scope:

  • User login or account registration
  • Custom URL links by the user

You can proceed to the next step only after confirming these assumptions with the interviewer

If the interviewer challenges you on out of scope requirements, then you can still stick to your script by letting them know that you will revisit the requirements at the end

Step 2: Do estimations

Before starting estimations, you would need to state some base assumptions to kickstart the calculations.

In this case, we are looking at the following assumptions and estimations…………

  • Avg text paste size is 10KB
  • Read to write ratio is 10:1
  • 1 million users
  • 1 million paste a day and 365M paste a year
  • 10 million reads per million writes/pastes
  • Paste per sec = 1M/24*3600 = 12 pastes/sec
  • Read per sec = 10M/24*3600 = 116 reads/sec
  • Data storage – Assume links are stored for 1 year
  • Data stored in 1 day  = 10KB*1M = 10 GB/day
  • Data stored in 1 year = 10GB * 365 = ~4TB data to be stored

You can use the above calculations to create a high level design. 

Step 3: High Level Design

What are the components involved in this design? You have a web server that talks to the end client. The server routes the pasted data into storage. We can use object level storage such as S3 or NoSQL document storage for storing the data. We would also need another database to store the encoded URL values and related data such as timestamp and ip address. 

Pastebin high level system design

User pastes the test. The server generates a unique URL and stores the data in the data storage and the URL data in the other database. When the user invokes a read request, then the server checks the URL against the URL database and if there is a match, then forwards the pasted contents back to the user. 

Step 4: Design core components

How do we generate the unique URL? Solving this piece of the puzzle is critical in the design process. We need to generate the URL string from a known data. We will hash the IP address and the time stamp that we obtained from the user. Applying an algorithm such as MD5 should produce a 128 bit hash value. Next we can encode the hash into Base 62 or 64. 

Base 62 encodes to [a-zA-Z0-9] without any special characters. Base 64 includes special characters. There is only one hash result for the original input and Base 62 is deterministic. We can then pick the first 6 digits out of the hash to use as the URL string.

How many links are estimated if we save the data for 1 year = 365 M pastes a yr. The number of combinations are 62^6 or 64^6. If we pick Base 62 encoding the number of keys generated is 

62^6 = ~57 Billion strings should cover the possible cases that we need. This holds true even if we need to store the URL records for more than 1 yr.

The encoding produces 128/6 = produces 21 characters. We are planning to use 6 characters to define the unique URL. We can pick the first 6 characters and then add that to the URL. We should have a check to check that the URL is not duplicated before sending to the client. If the character string is duplicated then we need to discard the data and regenerate the keys again.

API Definition:

There are two API needed, one to write and one to read. 

For writing, we can use a Write API with the following parameters – pasted data, timestamp, IP address, expiry data.

For read, we can use the ReadAPI with the following parameters – Generated URL string

Step 5: Detailed Design

How do we speed up the design and how do we improve the responsiveness of the design?

We can improve the speed of the design by generating keys offline and then using those keys in real time to generate unique URLs for the pasted text. We can leverage an offline key service that generates the keys in advance and stores them in a database. Everytime we get a new paste request, we can append the 6 digit keys generated to the URL. No need to encode the timestamp or IP address. The encoded keys once used will be stored in a different database. Hence there will be two databases for the keys. One with the new keys and other with the used keys. We can use the MD5 hash on a random value and then use base 62 encoding to generate the keys. We need to allocate storage for the keys. 

What is the storage size for offline keys. We have calculated earlier that the total possible combination of keys is 57 billion strings. There are six characters per key and assuming 1 byte per character

1 Key = 6 Bytes 

57 Billion keys  = 57 *6 = 342GB storage for all possible keys. 

Once we get a read request the server will first check the keys datastore to check if the key is used and then use the key to find the actual pasted data stored in the object database.

We will add a load balancer to the design to improve the responsiveness. Load balancers can be added between the client and the web server. The load balancers can also be added between the server to the data storage. 

Pastebin detailed system level design (Interview Guide)

Step 6: Resolve Bottlenecks

The key generation service is a bottleneck and we will solve it by adding a backup key server that will be a mirror copy of the original server. We will cache 20% of the requests for faster response time. 

Step 7: Summary

Finally summarize the detailed design to the interviewer by going through the flow and confirming that the design meets the initial assumptions and constraints. Acknowledge that the next steps would be to work on excluded scope such as the user login.

Hopefully, this example helps you understand solving system design questions. If you would like me to attempt other questions, then please leave a comment or reach out at [email protected]

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments