How to design a Google Web Crawler

How to design a Google Web Crawler

If you are new to this blog, then read How to ace the system design interview blog first. I have described the guidelines involved in solving a system design interview problem. We will be applying those steps here.

Interview Scenario:

Interviewer: Walks in the room. After the initial introductions, “ I would like to test you on your system design skills”

You: Sure! (Internally you are just hoping that the problem is a easy one :))

Interviewer: Let’s start by assuming that you are a program manager working at Google. You are tasked with creating a new bot to crawl and index the web pages in the world. This is a complex task. I would like to see how you would go about designing such a system. I am also interested in your thoughts about scaling this system.

You: Ok. Here is what I would do

This design question can be asked in multiple ways. For example,

  • How do you design a web crawler
  • How do you design crawler bot for Google

We will use the following high level steps to solve this problem.

High Level Steps:

  1. Scope the problem and clarify the requirements
  2. Do estimations
  3. High-Level Design
  4. Design core components
    1. Define API
  5. Detailed design (Depends on the time allocated for the interview)
  6. Resolve any bottlenecks in the design
  7. Summarize the solution

Step 1: Scope the problem and clarify the requirements

Define the product and service first. What is your startup/ service?

What does a search engine like Google do? Google indexes all the web pages in the world and stores both the metadata along with certain pages, and media content downloaded. Whenever anyone in the world searches for something, Google will serve content relevant to the search term from their storage.

How does Google index and serve the content? there two main components for this task or service is

  • A bot that crawls through the entire WWW to find, download, and index all the contents.
  • The clever search algorithm that will understand the context of the search term and find the right URLs and pages to display for the user.

We will be creating a system for the first component here (bot crawler). A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what every webpage on the web is about so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program.

The scope is limited to:

  • We will limit the crawler only to HTML pages. Media content is out of scope
  • The system is highly scalable
  • Service is highly available

Out of scope:

  • indexing media content is out of scope
  • No search personalization. All search is anonymous

You can proceed to the next step only after confirming these assumptions with the interviewer

If the interviewer challenges you on out of scope requirements, then you can still stick to your script by letting them know that you will revisit the requirements at the end

Step 2: Do estimations

Before starting estimations, you would need to state some base assumptions to kickstart the calculations.

In this case, we are looking at the following assumptions and estimations…………

  • Crawl 10 billion pages a month
  • We get 100 billion search queries a month
  • Rate of bot crawl = 10B/ 30243600 = 3900 webpages/ sec
  • Storage. We are crawling on HTML pages and assume the average page size is 1 MB.
  • Total storage needed is 10B* 1MB = 10 PB per month
  • Total storage

You can use the above calculations to create a high level design.

Step 3: High Level Design

What are the components involved in this design?

We will be designing the crawler service that will crawl through the world wide web hunting for URLs to download and index. The service does not know where to start. We kickstart the crawler process by providing a list of see URL to the system. The crawler will then visit the seed list of URL’s and scan the pages and start downloading them. The crawler will start parsing the documents to find a new URL listed. If new URLs are found then these are extracted and stored in the URL storage. The downloaded pages are stored in the data storage. The crawler service will check the URL store for new URL’s to parse and goes back to its crawl process.

A high level overview of the crawler design in shown below.

Web Crawler System Design

Step 4: Design core components

HTML Downloader:

HTML downloader will read the URL from the URL storage and start fetching/crawling the site. After crawling the HTNL downloader will download the relevant pages for content to be extracted.

Content Extractor:

The content extractor serves two functions

  1. URL extractor: The service will extract all the links from the pages downloaded. The extractor will also check for URL duplicates and remove them. Then the new URL links are stored in the URL store.
  2. Document Extractor: The service will process the downloaded pages and extract the relevant document info, to be stored in the data store. In this design, we only deal with document extraction. In real systems, the service should be capable of extracting media content also.

Step 5: Detailed Design

We add two more components to the design. One is the URL filter and the other is the URL Queue.

URL Filter: The URL filter will check for redundancy and filter out URL’s to be downloaded. The service will also check for blacklist and malicious URL’s and block them.

URL Queue: We need a service that will read the URL from the URL storage and queue them for the HTML downloader service. The queue can also include a priority service to decide which URL needs to be sent to the top of the stack for download.

Data Storage:

As we start scaling to accommodate more downloaded pages and media content, we need to plan for storing a large amount of data. We need to store these efficiently and scalability is a big issue. We should follow data sharding.

Sharding at the core is splitting your data up to where it resides in smaller chunks, spread across distinct separate buckets. A bucket could be a table, a Postgres schema, or a different physical database. Then as you need to continue scaling you’re able to move your shards to new physical nodes thus improving performance.

Web Crawler detailed system design

Step 6: Resolve Bottlenecks

Add redundancy to the design by adding backup servers to the design. We would also have multiple distributed databases, due to data sharding. Hence, we need servers that would aggregate the data back from different shards. These aggregator servers will be connected to the application servers. We need to add load balancers to the design for traffic distribution.

We will add a content distribution network (CDN) to the design for scaling purposes. A content delivery network (CDN) refers to a geographically distributed group of servers that work together to provide fast delivery of Internet content.

A CDN allows for the quick transfer of assets needed for loading Internet content including HTML pages, JavaScript files, stylesheets, images, and videos. The popularity of CDN services continues to grow, and today the majority of web traffic is served through CDNs, including traffic from major sites like Facebook, Netflix, and Amazon.

Step 7: Summary

Finally, summarize the detailed design to the interviewer by going through the flow and confirming that the design meets the initial assumptions and constraints. Acknowledge that the next steps would be to work on excluded scope such as the custom URL option.

Hopefully, this example helps you understand solving system design questions. If you would like me to attempt other questions, then please leave a comment or reach out at [email protected]

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments