Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a great starting point. Can anyone recommend any resources for how to best set up a remote scraping box on AWS or another similar provider? Pitfalls, best tools to help manage/automate scripts etc. I've found a few "getting started" tutorials like this one but I haven't been able to find anything good that discusses scraping beyond running basic scripts on your local machine.


http://scrapy.org/ is more systemic than these ad-hoc solutions, and http://scrapyd.readthedocs.org/en/latest/ is the daemon into which one can deploy scrapers if you want a little more structure.

The plugin architecture alone makes Scapy a hands-down winner over whatever you might dream up on your own. And with any good plugin architecture, it ships with several optional toys, you can always add your own, and there is a pretty good community (e.g. https://github.com/darkrho/scrapy-redis)

http://crawlera.com/ will start to enter into your discussion unless you have a low-volume crawler, and http://scrapinghub.com/ are the folks behind Crawlera and (AFAIK) sponsor (or actually do) the development for Scrapy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: