Hi there. Hope you don't mind me posting. We're trying to create a system for analyzing web pages and classifying them. We have done the classification using CRM114 (thanks for this link which we were passed previously) but we're now looking to create a reliable, fast and robust crawler. We have gone through Twisted and created something basic, but the question has been raised of what is out there already that has already been tested, ensures that it doesn't overloads sites, conforms to robots.txt and that can work across multiple servers? We've looked at Pyro for the multiple server element (which looks fine) but we're struggling a little. I thought I'd ask here if anyone has any pointers for a great, compact Python crawler that we could use?
Thanks in advance
Neil
Nutch (http://lucene.apache.org/nutch/) is a project to create a search engine, with a big crawling component. You can also find a list of crawlers here: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawler...