Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Web traps" and other problems can be controlled by a breadth-first crawling strategy.

A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.

The out-urls from that phase become the in-urls of the next phase.

If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.

At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: