Design A Web Crawler - System Design

System requirements

Functional:

+ Create a database of websites with metadata.

Non-Functional:

+ Resilient

+ Consistent

Capacity estimation

Our system should be resilient, and should be able to support partial failures.

The system should be able to process billions of documents (websites).

API design

the crawler is just the kickstart the job.

The html fetcher has the following functions:

should_fetch(robots_txt: string, is_legal: str):

# Determines whether a certain website should be indexed or not. Determines whether a certain website is legal or illegal.

extract_urls(website):

# Returns a list of websites extracted from the `website` HTML

all these work items will go into a storage system.

the deduper system reads from both of the storages and reduces the amount of URLs that are either exactly the same or similar but with different meaning (example URLs with different fragments).

is_site_same(website1, website2):

returns True if 2 URLs are the same or if the content is the same.

extract_metadata(website): metadata.

metadata_extractor will take a website and will return everything we can understand from it

then we will store the file in the DB

Database design

Documents DB

DocumentId

URL

LastUpdatedDate

MetadataDB

PornMetadata

ShoppingMetadata

KnowledgeMetadata

High-level design

At a high level we need two different processing pipelines. One that refreshes items that will probably not change often (like wikipedia), and that runs less frequently.

And another pipeline that fetches content that is updated frequently (e.g. news websites, some shopping, etc...)

Request flows

-from crawler

-fetch website HTML and related websites (out links)

-store the html into some intermediate storage

-run a deduping pipeline to only process items that are indeed different

-extract metadata from these documents

-store the result in a db

Detailed component design

For the pipeline choices, i would go with apache beam. Each square in the high level diagram is almost a step in the diagram.

For the Database, we can go with either a relational or not relational. There's no specification of whether the data would be read, which would make a difference here.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

In the current design, how do we address over fetching the same website over and over.

Proposal: Use a cache (Redis) with a LRU eviction policy. This will allow us to avoid fetching popular websites like facebook.com, etc...

How do we avoid recursion?

We can keep a separate table of the visited websites.

What happens if one of the machines dies at any of the steps?

Our underlying framework in apache beam should be able to retry. We should also serialize the output of each step so that we can re-start the pipeline if any of the steps completely fails.

Future improvements

how can we design an e2e system that ensures the pipeline before we start it?