+ Create a database of websites with metadata.
+ Resilient
+ Consistent
Our system should be resilient, and should be able to support partial failures.
The system should be able to process billions of documents (websites).
the crawler is just the kickstart the job.
The html fetcher has the following functions:
should_fetch(robots_txt: string, is_legal: str):
# Determines whether a certain website should be indexed or not. Determines whether a certain website is legal or illegal.
extract_urls(website):
# Returns a list of websites extracted from the `website` HTML
all these work items will go into a storage system.
the deduper system reads from both of the storages and reduces the amount of URLs that are either exactly the same or similar but with different meaning (example URLs with different fragments).
is_site_same(website1, website2):
returns True if 2 URLs are the same or if the content is the same.
extract_metadata(website): metadata.
metadata_extractor will take a website and will return everything we can understand from it
then we will store the file in the DB
Documents DB
DocumentId
URL
LastUpdatedDate
MetadataDB
PornMetadata
ShoppingMetadata
KnowledgeMetadata
At a high level we need two different processing pipelines. One that refreshes items that will probably not change often (like wikipedia), and that runs less frequently.
And another pipeline that fetches content that is updated frequently (e.g. news websites, some shopping, etc...)
-from crawler
-fetch website HTML and related websites (out links)
-store the html into some intermediate storage
-run a deduping pipeline to only process items that are indeed different
-extract metadata from these documents
-store the result in a db
For the pipeline choices, i would go with apache beam. Each square in the high level diagram is almost a step in the diagram.
For the Database, we can go with either a relational or not relational. There's no specification of whether the data would be read, which would make a difference here.
Explain any trade offs you have made and why you made certain tech choices...
In the current design, how do we address over fetching the same website over and over.
Proposal: Use a cache (Redis) with a LRU eviction policy. This will allow us to avoid fetching popular websites like facebook.com, etc...
How do we avoid recursion?
We can keep a separate table of the visited websites.
What happens if one of the machines dies at any of the steps?
Our underlying framework in apache beam should be able to retry. We should also serialize the output of each step so that we can re-start the pipeline if any of the steps completely fails.
how can we design an e2e system that ensures the pipeline before we start it?