I was told I can expect 500 visits to the URL registration page a day. Of those, about 100 of those users will actually add a URL.
Let's say 2% of those people are looking to delete a URL too.
so in general, the number of updates we get only grows over time.
100 * 365 = 36,500 links per year.
with 10% growth rate, let's say about
40k links this year.
next year, 45k, etc.
in five years can be handling 250k+ links.
That'll be 250k requests we'll have to do repeatedly.
Let's say we limit the updates to be once a day. That's
250k/24 = 10k requests an hour /60 = 200 requests a minute = 4 TPS.
We will need some logic in order to make sure that those 250k requests are spread out evenly though.
And let's say 1% of links decide to implement manual crawling. We implemented that to get users to optimize how often they send us requests. But let's assume the worst case -- that they send us requests once every hour.
250k/100 = 250 req an hour / 60 ~= 5 TPS. So 10 TPS total, which can grow over time.
registerURL(url) to allow for the registration of a URL
throw Exception if URL already registered)
deleteURL(url) remove a url (and if it has children, its children) from the system.
updateURL(url) available to 3Ps, allows for 3Ps to update a URL proactively.
throws ThrottleException if the url is being requested for update too many times in a row.
search(searchTerm) gets list of urls with metadata.
IndexScheduleDB - stores all the URLS which are scheduled to be crawled. Will have
ScheduledUrls
triggerTime -- useful for retrieving the entry when querying (stored in UTC).
recurrenceTime -- time to wait in between updates. defaults to full day in seconds. Used to calculate the next triggerTime.
url - url to crawl
timeLastCrawled - possibly optional. To make sure we're not crawling too soon.
Wrapped by a RedisCache to make the retrieval of items for a triggerTime quick.
BackLinksDB - GraphDB. Represents each url and the websites which link to it. Useful for determining the ranking algorithm.
SearchDB
Vector database. This is written to by recommendation algorithm and contains the space in which SearchService will draw urls from.
URLMetadata
{
url
description
thumbnail
metadata
}
URLMetadata is written to by the IndexerService and stores the metadata related to an indexed url.
The design should work like this.
All of these services will be horizontally scalable, behind load balancers in an ECS Fargat instance:
For assigning a schedule time -- We can use Bucket sort + hashing to pick the ideal trigger time.
We have a RedisCache store buckets for certain intervals -- 1 hour + 30 minutes + 1 minute + 1 second for the scheduling.
ScheduleDeterminationService should for example, look for a random time in the next twelve hours. Once it finds a random time, it could use the RedisCache to get how many triggers live at that time. If the number is high, then we pick the next available time.
Actually, since the time is only restricted to one day, let's have the redisCache store a key {timeMap}: {jsonMap of all the times and how many respective triggers there are in that bucket}. That way we only need to do one DB call to a cache in order to determine the optimal result.
The purpose of doing this by the way is to get an even distribution of times in the trigger. We could also just to a random number (which would give an even distribution over time as well) and if SLA to update a given page is NOT a priority, then that is actually what I would recommend since it would be faster. But this is very fast as well and allows for SLA to first update to be shorter.
The search results could use something called K clustering. This is borrowed from concepts in machine learning. Imagine you have a multidimensional space with a number of data points. this can be n dimensions, but let's pretend it's only three for easy visualization. let's say the three are (location,
Choosing to use SQS queue for CrawlerService and IndexService (event-based) vs. processing all in an API call.
Pros:
Cons:
Considering the use cases and the amount of traffic we're getting, this is a worth-it tradeoff. Crawling does not have to be immediate -- the user isn't even on our service when the pages are crawled, they wouldn't notice if the page was crawled now or in an hour.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?