robots.txt file to only crawl areas of the website that are allowed to be crawled.initial estimation is 100 * 10 ^ 6 new urls daily
html content is 100kb per page
media is 500 kb per image
5 media per page
100 pages per url
with this numbers we can estimate QPS and size requirements:
1) QPS = 100 * 10 ^ 6 / 24 * 60 * 60 = 1157 QPS
2) Peark QPS = QPS * 2 = 2300 QPS
3) Total number of pages = 100 * 10 ^ 6 * 100 = 10 billion pages
4) total number of pages * 500 kb / 1024 / 1024 / 1024 = 4656 TB of storage for html
5) total number of pages * 5 * 600 kb / 1024 / 1024 /1024 = 27939 TB of storage for HTML
6)If one server can hold 1TB we need 5 servers for HTML and 28 servers for media
GET /crawled/html
GET /crawled/media
POST /url
POST /url/batch
POST /url/schedule
POST /url/filter
POST /url/robots
POST /rate
GET /settings
URL <--Page
URL: + String URL PK FK
URL : +int WorkerId
URL : +DateTime LastUpdated INDEX
Page <--Media
Page : +String URL PK FK
Page : +String Html
Page : +int MediaKey
Page : +DateTime LastUpdated INDEX
Media : +int MediaKey PK FK
Media : +String MediaLink
Media : +DateTime LastUpdated INDEX
1) Rate limiter - limits the number of user requests and maximum number of submitted urls per user using leaking bucket or queue algorithms
2\) URL frontier - a service that manages URL downloaders and parsers queues
3) HTML downloader - downloads url from the internert, contancts DNS cache and DNS service to get l3 address from the URL
4) COntent Parser - parses web page content and extracts media and links
5) Content filter - filters content based on user rules like adulkt only or country banned content
6) DuplicateHashSet - a set that checks if we have already seen page before, used to reject duplicates
7) HTML storage - a blob or document storage to store web pages
8) URL extractro - extracts URLS from a web page
9) URL duplicate hash set - a set to check if we have already seen a URL, used to reject already seen urls and to not crawl them 2 times
10) URL storage - stores already seen URLS
11) Media storage - a blob storage for media
1) A user submits a set of seed (starting URLS) to crawl)
2) Rate limiter checks if a URLs are within a limit range
3) If success rate limiter passes urls to a URL frontier
4) URL frontier computes priority of the link based on the keywrods inside the link, then it chooses a queue worker among html downloaders based on hash(link) % workerCount
5) a URL is passed to the HTML downloader worker which issues a networking request to DNS (if the DNS cache is empty), downloads the contents, parses it with COntent parser and filters with ContentFilter
6) Parsed content is passed to the DuplicateHashSet which computes the hash of the contents and checks if we have already seen a web page. If yes, and the downloaded date difference is smaller then the predefined param we pass the page futher if not the page is rejected as duplicated
7) Page HTML contents is saved in HtmlStorage
8) Page Media contents is saved in MediaStorage
9) Page URLS are extracted and passed to the UrlDuplicateHSash set that computes the hash of the links and checks if we have seen the url. If we havent the URL is passed further is we have the URL is rejected
When extarcting new URLs the system respects robots.txt and ignores url in #ignore section
10) URL is saved in URL storage
11) New extracted URLS are passed back to URL frontier in recursive manner to continue crawling
12) Notification service notifies client about operation completion
1) URL frontier service is responsible for scheduling workers that crawl Internet for pages It consists of the following:
2) For DNS every worker has a local DNS cache if it has a cache miss the worker contacts a global DNS cache for every worker. If it is a miss the worker contants a DNS server
3) When a url is fully downloaded the notification service notifies a user about success
To notify the usre we use long polling strategy because the HTML downloader is a long process and we dont need real-time updates
For queuing system we can use Apache kafka with priority computation or our own solution
Care must be taken to access shared resources by multiple workes like a global DNS cache or DuplicateSets. For ease of implementation and not a big expected contention we choose a straightforward locking mechanism for the whole service.
3) To scale page and media storage we can shard the storage by region or an ip address range
1) If a web crawler fails in the process of a download we can restart the worker with the same URL links
2) IN the case of data overflow we can spawn more workers
3) DFS algortihm is not used as the depth of links can go very deep for the internet, instead we use BFS
4) To prevent too deep recursive links we can limit the URL length or also maintain the already seen URL for every worker and check if we have already seen one
1) Add more sophisticated fiter to the page content, use ML to to filter the page
2) Log time took to process the page by a worker and add the data to monitoring subsystem to monitor system performance and have an adaptive strategy for a different hosts