restful API
/api/addURL
/api/getNextURL
/api/crawlPage
/api/getCrawledData
/api/removeURL
/api/getMediaForURL
URL table urlId, crawlId
CrawledData table urlId, crawlId, metadata, cssContent, mediaId, htmlContent
Media table: mediaId, crawlId, mediaUrl, mediaType,
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Optimization Strategies:
Rate limiting strategies
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue.
How to maintain Freshness?
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource-intensive. A few strategies to optimize freshness are listed as follows:
Explain any trade offs you have made and why you made certain tech choices...
BFS and problems with BFS
BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:
Standard BFS does not consider the priority of a URL. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc.
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?