Loading...
Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...
Crawler workers - threaded
Message queue inc dead letter queue (for http status code, TTL, etc)
Frontier of website URL seeded into queue
Crawler workers check DNS/robots.txt and see if they can pull from site then if can save html in db
then go to next in queue
if site not found or something add to dead letter queue
if waiting bc of dns/robots.txt time then just go to next in queue if not the same site
repeat
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.