"Web traps" and other problems can be controlled by a breadth-first crawling strategy.
A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.
The out-urls from that phase become the in-urls of the next phase.
If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.
At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.
A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.
The out-urls from that phase become the in-urls of the next phase.
If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.
At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.