Ask YC: Existing Crawlers in Python

yan · on Sept 9, 2008

Does it have to be Python? I'm sure you can use any webcrawler to actually crawl, and use Python to analyze the results.

Nutch (http://lucene.apache.org/nutch/) is a project to create a search engine, with a big crawling component. You can also find a list of crawlers here: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawler...

dshah · on Sept 10, 2008

+1 for Nutch. We're using it at my startup HubSpot and it has worked well for us.

rams · on Sept 9, 2008

Harvestman is written in Python.

http://www.harvestmanontheweb.com/ http://harvestman.everythingability.com/

yourabi · on Sept 9, 2008

I would also take a look at Heritrix (http://crawler.archive.org/) -- it's what powers the wayback machine.

gojomo · on Sept 9, 2008

Thanks for the plug!

As a developer of Heritrix, I can't honestly say it's compact or Python, but it is well-behaved, highly customizable (both by settings and by many Java extension points), and capable of high-volume crawling for many purposes.

You could also embed Python code via Jython with a little work, if necessary.

pragmatic · on Sept 10, 2008

wget. Integreated it with a C# application just fine. It outputs the pages to file and produces a nice parsable crawl log. It is single thread but unless you're crawling wikipedia, you won't have a problem. Small tools that work well are a good start. I had problems with many of the multi-threaded crawlers, they seemed to trip over themselves, wget was fast and rock solid.

groovyone · on Sept 9, 2008

Thanks for these. They both look 'high end' rather than small and customizable, but I'll check them out. Thanks for the tips and links