+ Create a database of websites with metadata.
+ Resilient
+ Consistent
out of scope:
analytics
details on the metadata computed since it can be too big, and how this indexer will be used it's not clear.
Our system should be resilient, and should be able to support partial failures.
The system should be able to process billions of documents (websites).
the crawler is just the kickstart the job.
The html fetcher has the following functions:
should_fetch(robots_txt: string, is_legal: str):
# Determines whether a certain website should be indexed or not. Determines whether a certain website is legal or illegal.
extract_urls(website):
# Returns a list of websites extracted from the `website` HTML
all these work items will go into a storage system.
the deduper system reads from both of the storages and reduces the amount of URLs that are either exactly the same or similar but with different meaning (example URLs with different fragments).
is_site_same(website1, website2):
returns True if 2 URLs are the same or if the content is the same.
extract_metadata(website): metadata.
metadata_extractor will take a website and will return everything we can understand from it
then we will store the file in the DB
Documents DB
DocumentId
URL
LastUpdatedDate
MetadataDB
PornMetadata
ShoppingMetadata
KnowledgeMetadata
At a high level we need two different processing pipelines. One that refreshes items that will probably not change often (like wikipedia), and that runs less frequently.
And another pipeline that fetches content that is updated frequently (e.g. news websites, some shopping, etc...)
-from crawler
-fetch website HTML and related websites (out links)
-store the html into some intermediate storage
-run a deduping pipeline to only process items that are indeed different
-extract metadata from these documents
-store the result in a db
For the pipeline choices, i would go with apache beam. Each square in the high level diagram is almost a step in the pipeline.
For the Database, I would go with a nosql system with eventual consistency (like dynamo), since this would be optimal for writes.
The deduper cleans URLs. This is to make sure that abc.com#fragment equals abc.com. The cleaner part makes sure that if a website is non-responsive, we don't move it to the next step.
the metadata extractor should be able to handle different types of resources, video, text, image, etc...
i am serializing steps after they complete in the pipeline, since i want to make sure that the pipeline can be restarted if needed. this comes at the expense of adding additional latency. it's a tradeoff of resilience vs latency.
In the current design, how do we address over fetching the same website over and over.
Proposal: Use a cache (Redis) with a LRU eviction policy. This will allow us to avoid fetching popular websites like facebook.com, etc...
How do we avoid recursion?
We can keep a separate table of the visited websites.
What happens if one of the machines dies at any of the steps?
Our underlying framework in apache beam should be able to retry. We should also serialize the output of each step so that we can re-start the pipeline if any of the steps completely fails.
what happens if we can't fetch the html for one of the websites? we add it to the end of the list again, we should re-try a given number of times to fetch websites periodically. if we can't fetch it, we should log it so that engineers can debug it.
how can we design an e2e system that ensures the pipeline before we start it?
assuming we identify a list of websites that we don't want to index, how can we improve the system such that next time we run the pipeline, we don't index such websites.
It's possible that websites can contain the same content, in such case, it is probably not worthy to compute metadata for such. we could have a table that maps checksum -> metadata. this way we could save some resources.