Design A Web Crawler - System Design

System requirements

Functional:

+ Create a database of websites with metadata.

Non-Functional:

+ Resilient

+ Consistent

Capacity estimation

Our system should be resilient, and should be able to support partial failures.

The system should be able to process billions of documents (websites).

API design

the crawler is just the kickstart the job.

The html fetcher has the following functions:

should_fetch(robots_txt: string, is_legal: str):

# Determines whether a certain website should be indexed or not. Determines whether a certain website is legal or illegal.

extract_urls(website):

# Returns a list of websites extracted from the `website` HTML

all these work items will go into a storage system.

the deduper system reads from both of the storages and reduces the amount of URLs that are either exactly the same or similar but with different meaning (example URLs with different fragments).

is_site_same(website1, website2):

returns True if 2 URLs are the same or if the content is the same.

extract_metadata(website): metadata.

metadata_extractor will take a website and will return everything we can understand from it

then we will store the file in the DB

Database design

Documents DB

DocumentId

URL

LastUpdatedDate

High-level design

At a high level we need two different processing pipelines. One that refreshes items that will probably not change often (like wikipedia), and that runs less frequently.

And another pipeline that fetches content that is updated frequently (e.g. news websites, some shopping, etc...)

Request flows

-from crawler

-fetch website HTML and related websites (out links)

-store the html into some intermediate storage

-run a deduping pipeline to only process items that are indeed different

-extract metadata from these documents

-store the result in a db

Detailed component design

For the pipeline choices, i would go with apache beam. Each square in the high level diagram is almost a step in the diagram.

For the Database, we can go with either a relational or not relational. There's no specification of whether the data would be read, which would make a difference here.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?