The system must be able to fetch and store content from a list of urls. And extract any other hyperlinks from their content.
The type of content it has to be able to download are files and html files.
Configurability on limits.
We will limit it to only html crawling
Scalability
Performance
Cost Efficiency
Multi-Threaded
100M urls x
10 pages per second per thread
10 threads
=
1M segs
Define what APIs are expected from the system...
URL_PROCESSING_STATUS
// url after dynamic parameter and query parameter processing
simpleUrl: string , primary key
parentUrl: string | null
valid: boolean
depth: number
contentUrl: text
retry: number
lastStep: validation | fetching | parsing
lastStepStarted: timestamp
So the high level workflow is that
Urls go through a pipeline with
1- validation
2- content fetching
3- url parsing
Each consisting of workers, message queues and cron job supervisors and updating the status in the key/value DB.
A batch of urls are sent to an orchestrator service that controls the whole pipeline for enqueueing, stopping, retrying jobs manually and observing them. As well as fetching content or extracting data.
Through a user interface an initial set of urls are fetched, as well as configuration details that are loaded into the orchestrator. These details include limits on depth, retries, total url processing, and domain filtering. These urls go through the pipeline where workers in different steps update a key/value DB and save content to object storage.
The pipeline is made of three steps.
Each step has a cron job supervisor, to check if jobs have timed-out and need re-processing. A message queue so messages are not lost and are not taken more than once by workers. Workers that do the steps, update status and save results.
The validation worker makes sure urls hadn't already been visited and avoids similar but non meaningful urls from being visited, as when dynamic urls generate part of the content and result in infinite content, example a calculator with url parameters for operations. It also enforces configuration settings to make sure we don't visit certain urls, based on depth and other parameters.
The content fetching workers make requests and store resulting content to the object storage.
The parsing workers, load the content from the object storage and extract any new urls that might be needed to be crawled or validated
The supervisors are multiple cron jobs assigned to each step. They take the jobs from the key/value DBs .To split the jobs, deterministic partitioning is done dividing the whole number of timedout jobs, jobs that are marked as unfinished after their timeout has exceeded, and push them back into the queues.
We choose a multi step pipeline approach with workers, rather than doing everything in a single step without message queues, for the processing scalability, additional debugging capabilities, decoupling and reproduceability.
We choose messageQueues vs Kafka as they achieve persistence, single processing of each job, good scalability by horizontal scaling while being lower in cost.
We choose a key/value storage sacrificing uneeded relation querying capabilities or complex objects as in a SQL or documentDB. For their lower cost and easiness of scaling, while being very fit for our use case, keeping track of a URL_PROCESSING_STATUS.
We use cron jobs vs an always running supervisor, to be able to deterministically split the load, scale horizontally and provide predictability. If the jobs where running all the time and not at the same time in fixed intervals, they would not be able to deterministically divide the jobs.
The key/value storage could fail in which case jobs would not be processable until it was back. This would affect only certain shards usually and the supervisors would schedule these back as soon as they are up.
If workers are unavailable the messageQueue keep the persistance.
If the messageQueues go down the key value storage keeps the job status which allows the supervisor to reschedule them once they are back.
If the supervisors themselves fail. The key/value DB state can be used to re-schedule the jobs once back.
In any case, even if the different pieces fully failed, which is harder as all components are horizontally scalable and decoupled. We can recover for errors and re-processing is not done.
We could dive deeper into how the content is fetched and streamed into and from the key/value storage. We could compress and abstract it before storage.
We could involve an analysis step as well as how when we delete content after processing once our web crawler is used for a particular use case.