Design A Web Crawler - System Design

System requirements

Functional:

Allow users to add their website to be crawled on a registration website. They paste a URL to be crawled. By default, we keep the rate at which the website is polled low.
Users will be shown an error when trying to add the same URL twice.
We need to proactively check for up-to-date information on a website.
We will need to be able to search this information we crawled.
We want the search to be relevant, so we'll need to rank these searches according to what we believe to be relevant.
I was told proactively searching the entirety of the world-wide web is not in scope for this. We only need to search the links we're given.
Search needs to update as we receive updates from clients (both the searchers and the services which have signed up to be searched). We want this to be as fast it can be.
Optional: Also giving the option for users to proactively update their website's state by sending us state-updates whenever their site has a major change. They need to determine how often that should be. However, we need to set a policy where we don't allow more than 1 update every 1 hour to prevent network flooding.

Non-Functional:

Search needs to be fast, latency 200-500 ms.
Adding of URLs -- the registration page needs to be fast, but the processing of urls quickly is less important. Can be offloaded to workers.

Capacity estimation

I was told I can expect 500 visits to the URL registration page a day. Of those, about 100 of those users will actually add a URL.

Let's say 2% of those people are looking to delete a URL too.

so in general, the number of updates we get only grows over time.

100 * 365 = 36,500 links per year.

with 10% growth rate, let's say about

40k links this year.

next year, 45k, etc.

in five years can be handling 250k+ links.

That'll be 250k requests we'll have to do repeatedly.

Let's say we limit the updates to be once a day. That's

250k/24 = 10k requests an hour /60 = 200 requests a minute = 4 TPS.

We will need some logic in order to make sure that those 250k requests are spread out evenly though.

And let's say 1% of links decide to implement manual crawling. We implemented that to get users to optimize how often they send us requests. But let's assume the worst case -- that they send us requests once every hour.

250k/100 = 250 req an hour / 60 ~= 5 TPS. So 10 TPS total, which can grow over time.

API design

registerURL(url) to allow for the registration of a URL

throw Exception if URL already registered)

deleteURL(url) remove a url (and if it has children, its children) from the system.

updateURL(url) available to 3Ps, allows for 3Ps to update a URL proactively.

throws ThrottleException if the url is being requested for update too many times in a row.

search(searchTerm) gets list of urls with metadata.

Database design

IndexScheduleDB - stores all the URLS which are scheduled to be crawled. Will have

ScheduledUrls

triggerTime -- useful for retrieving the entry when querying (stored in UTC).

recurrenceTime -- time to wait in between updates. defaults to full day in seconds. Used to calculate the next triggerTime.

url - url to crawl

timeLastCrawled - possibly optional. To make sure we're not crawling too soon.

Wrapped by a RedisCache to make the retrieval of items for a triggerTime quick.

BackLinksDB - GraphDB. Represents each url and the websites which link to it. Useful for determining the ranking algorithm.

SearchDB

Vector database. This is written to by recommendation algorithm and contains the space in which SearchService will draw urls from.

URLMetadata

{

url

description

thumbnail

metadata

}

URLMetadata is written to by the IndexerService and stores the metadata related to an indexed url.

High-level design

The design should work like this.

All of these services will be horizontally scalable, behind load balancers in an ECS Fargat instance:

URLRegistrationService - This service, behind a load balancer, takes in requests from registration frontend to add a url to the crawled DB.
ScheduleDeterminationService - This Service is what determines exactly when the user's URL should be crawled. It takes in a URL, the time in which the URL was requested to be called, and calls IndexScheduleDB. On the high level it will call IndexScheduleDB and get a range of times (will add a blockTriggerTime field to the IndexScheduleDB or also shard the DB to separate by time blocks) to get the number of requests at a given time. That's the naive implementation, but actually you can optimize that by adding a caching layer in front of the DB. This will store the number of items in each triggerTime and use that metric in order to determine the desired scheduleDetermination time within a given timeframe. That will have near constant performance since there will only be 86400 seconds to choose from, and the determiner should already have a time range it's searching for. Once it finds a bucket which doesn't have too many schedules and also isn't too off in the future, it'll make a write to IndexSchedule DB to save this time.
IndexScheduleDB - stores the schedules to be triggered at any given point in time, as in the urls that should be crawled at X time. This is written to whenever a user makes a request to register their Url.
There is CrawlerSchedulerService. Every 1 second (X seconds), this service will read IndexScheduleDB for data on what entries have a TriggerTime. If the TriggerTime matches, the service will first second off a request to crawl that page to SQS queue. Then send a request to SQS queue to update IndexScheduleDB.
SQS Queue + CrawlerService - runs processes to crawl pages coming into its SQS queue. Connects to Redis Cache which stores urls it has crawled in the last hour. If a url which was fanned out while crawling is found, it skips that URL.
1. If an outbound link is found, write that to BacklinkDB
IndexerService - Takes the results of CrawlerService and generates an index of each page, including metadata to help with search such as a vector representation of the page, description, icon etc. Stores in IndexDB
PageRankService - Takes the results of IndexerService + reads the BacklinkDB + page age + recentUserInteraction to generate the page's ranked weight so that when it's retrieved during a search, it has a higher chance of showing up. Writes these weights to a vectorDB/k clusting datab ase called searchDB.
SearchDB - allows for search. Searches searchDB, gets top urls and matches with indexDB to get metadata to return search results.

Request flows

RegisterURL -> url -> API Gateway + Rate Limiter -> Load Balancer -> ScheduleDeterminationService -> IndexScheduleDB <-- polled by CrawlerSchedulerService --> SQS Queue -> CrawlerService -> writes to BackLinkDB, reads RedisCache to check for crawling duplicates -> SQS -> IndexerService -> writes to MetadataDB/IndexDB + sends request to SQS -> PageRankService (spawns workers) -> SearchDB
UpdateURL -> url -> Rate Limiter -> API Gateway -> LB -> ScheduleDeterminationService (rest is same flow as RegisterURL)
DeleteUrl -> API Gateway -> LB -> DeleteUrlService -> Publishes event to delete URL -> SQS queue to ScheduleDeterminationService which takes in that event and deletes from TriggerDB + SQS to UrlMetadataDeletionService which makes a call to SearchDB to delete URL and IndexDB to delete index and backlinkDB to delete link.
Search -> search term -> LB -> searchService takes request, reads searchDB, then returns result.

Detailed component design

For assigning a schedule time -- We can use Bucket sort + hashing to pick the ideal trigger time.

We have a RedisCache store buckets for certain intervals -- 1 hour + 30 minutes + 1 minute + 1 second for the scheduling.

ScheduleDeterminationService should for example, look for a random time in the next twelve hours. Once it finds a random time, it could use the RedisCache to get how many triggers live at that time. If the number is high, then we pick the next available time.

Actually, since the time is only restricted to one day, let's have the redisCache store a key {timeMap}: {jsonMap of all the times and how many respective triggers there are in that bucket}. That way we only need to do one DB call to a cache in order to determine the optimal result.

The purpose of doing this by the way is to get an even distribution of times in the trigger. We could also just to a random number (which would give an even distribution over time as well) and if SLA to update a given page is NOT a priority, then that is actually what I would recommend since it would be faster. But this is very fast as well and allows for SLA to first update to be shorter.

The search results could use something called K clustering. This is borrowed from concepts in machine learning. Imagine you have a multidimensional space with a number of data points. this can be n dimensions, but let's pretend it's only three for easy visualization. let's say the three are (location,

Trade offs/Tech choices

Choosing to use SQS queue for CrawlerService and IndexService (event-based) vs. processing all in an API call.

Pros:

Allows for long processing of heavy operations (both of these operations are very heavy).
Async calls allowed.

Cons:

Means updates are not immediate.

Considering the use cases and the amount of traffic we're getting, this is a worth-it tradeoff. Crawling does not have to be immediate -- the user isn't even on our service when the pages are crawled, they wouldn't notice if the page was crawled now or in an hour.

Failure scenarios/bottlenecks

Too many requests pile up in CrawlerService
- Increase pollers, scale up service.
Search fails, either the DB fails or the service fails
- Service fails - scale service
- DB fails - Use read replication DBs to add redundancy. Use Cache in front of DB to cache frequently searched terms.
Outdated or Missing urls:
- Allow users to request updates proactively (updateURL flow).

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?