+ Create a database of websites with metadata.
+ Resilient
+ Consistent
Our system should be resilient, and should be able to support partial failures.
The system should be able to process billions of documents (websites).
the crawler is just the kickstart the job.
The html fetcher has the following functions:
should_fetch(robots_txt: string, is_legal: str):
# Determines whether a certain website should be indexed or not. Determines whether a certain website is legal or illegal.
extract_urls(website):
# Returns a list of websites extracted from the `website` HTML
all these work items will go into a storage system.
the deduper system reads from both of the storages and reduces the amount of URLs that are either exactly the same or similar but with different meaning (example URLs with different fragments).
is_site_same(website1, website2):
returns True if 2 URLs are the same or if the content is the same.
extract_metadata(website): metadata.
metadata_extractor will take a website and will return everything we can understand from it
then we will store the file in the DB
Documents DB
DocumentId
URL
LastUpdatedDate
At a high level we need two different processing pipelines. One that refreshes items that will probably not change often (like wikipedia), and that runs less frequently.
And another pipeline that fetches content that is updated frequently (e.g. news websites, some shopping, etc...)
-from crawler
-fetch website HTML and related websites (out links)
-store the html into some intermediate storage
-run a deduping pipeline to only process items that are indeed different
-extract metadata from these documents
-store the result in a db
For the pipeline choices, i would go with apache beam. Each square in the high level diagram is almost a step in the diagram.
For the Database, we can go with either a relational or not relational. There's no specification of whether the data would be read, which would make a difference here.
Explain any trade offs you have made and why you made certain tech choices...
Try to discuss as many failure scenarios/bottlenecks as possible.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?