Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask YC: Existing Crawlers in Python
13 points by groovyone on Sept 9, 2008 | hide | past | favorite | 7 comments
Hi there. Hope you don't mind me posting. We're trying to create a system for analyzing web pages and classifying them. We have done the classification using CRM114 (thanks for this link which we were passed previously) but we're now looking to create a reliable, fast and robust crawler. We have gone through Twisted and created something basic, but the question has been raised of what is out there already that has already been tested, ensures that it doesn't overloads sites, conforms to robots.txt and that can work across multiple servers? We've looked at Pyro for the multiple server element (which looks fine) but we're struggling a little. I thought I'd ask here if anyone has any pointers for a great, compact Python crawler that we could use?

Thanks in advance

Neil



Does it have to be Python? I'm sure you can use any webcrawler to actually crawl, and use Python to analyze the results.

Nutch (http://lucene.apache.org/nutch/) is a project to create a search engine, with a big crawling component. You can also find a list of crawlers here: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawler...


+1 for Nutch. We're using it at my startup HubSpot and it has worked well for us.



I would also take a look at Heritrix (http://crawler.archive.org/) -- it's what powers the wayback machine.


Thanks for the plug!

As a developer of Heritrix, I can't honestly say it's compact or Python, but it is well-behaved, highly customizable (both by settings and by many Java extension points), and capable of high-volume crawling for many purposes.

You could also embed Python code via Jython with a little work, if necessary.


wget. Integreated it with a C# application just fine. It outputs the pages to file and produces a nice parsable crawl log. It is single thread but unless you're crawling wikipedia, you won't have a problem. Small tools that work well are a good start. I had problems with many of the multi-threaded crawlers, they seemed to trip over themselves, wget was fast and rock solid.


Thanks for these. They both look 'high end' rather than small and customizable, but I'll check them out. Thanks for the tips and links




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: