I remember the pain it was to write custom scrapers every time (I used to do it with Perl, btw).
They have a custom browser with a nice interface, but the biggest thing are the so called "Connectors": you instruct the system into how to query and parse results and Import.IO will give you an API endpoint for this query, now automatized.
One can, say, create a "connector" which can query Airbnb and parse results, then create another "connector" which queries booking.com. Now it is possible to use the API to make a query for Boa Vista, Roraima (my city) and get the dataset.
I am not affiliated with them in any way, just a very happy old-school scrapper.
Other browser based screen scrapers that are in the space are 80 legs, kiminolabs, Mozenda and OutWit Hub, I'm sure there are more. Last time I checked, import.io was a fairly lightweight browser wrapper.
I also write web scrapers using Perl and Python, recently have been gravitating towards Python as the code looks more readable. I don't use browser based scrapers because the sites I scrape are usually more complex so it is just easier to write my own code, and they lack functionality and control of the data, and there is the overhead of learning the terminology and how it works.
I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.
I don't recommend scrapy. Classic example of a framework that should have been a library. It will work up until a point and then it will railroad your app and you will have a really painful time breaking out of the 'scrapy' way of doing things. Classic 'framework' problem.
I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).
I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).
You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.
N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.
I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.
# It requires scrapy from github.
# Save it to tickets.py and execute
# "scrapy runspider tickets.py" from the command line
from urlparse import urljoin
import scrapy
class TicketSpider(scrapy.Spider):
name = 'tickets'
start_urls = ['http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets']
def parse(self, response):
for listing in response.css('p.row'):
price_txt = listing.css('span.price').re('(\d+)')
if not price_txt:
continue
price = int(price_txt[0])
if 100 < price <= 250:
url = urljoin(response.url, listing.css('a::attr(href)').extract()[0])
print ' '.join(listing.css('::text').extract())
print url
print
There is no reason to prefer Scrapy for extracting information from a single webpage, but on the other hand it is not any harder than BS+pyquery+requests.
Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is basically this[1]. We used to do forecasting but found that wasn't really useful[2] or worth the time it took to maintain, so we nixed it.
- [2]: We haven't included Craigslist because the data is much less structured and inexperienced users may have a Bad Time™. YMMV
- [3]: It was also a royal pain in the ass to maintain. I know because I had to update the underlying data provided to the model, and also modify it whenever available data changed :( . Here is a blog post on why we removed it from the product in general: http://chairnerd.seatgeek.com/removing-price-forecasts
Combine this with Pushover[0] to get alerted whenever there is a new lowest price. I had to resort to scraping+pushover to snatch a garage parking spot in SF.
I did this recently when trying to get tickets to a sold out Cloud Nothings show. I'd scrape Craiglist for postings every 10 minutes, and then send myself a text if any of the posts were new. I ended up getting tickets the day before the show.
Since the show was at a very small venue (capacity of maybe 500), I didn't have to worry about a constant stream of false positives. I would have needed to handle these if I were searching for tickets to a sold out <popular band> show, since ticket brokers just spam Craigslist constantly with popular terms.
If you don't mind me asking. Who pays for the bandwidth cost of running "giggr"? It doesn't look like you have any ads running. Or are you monetizing on it in some other way?
Doesn't work for me. Which Python version is required?
Traceback (most recent call last):
File "./tickets.py", line 20, in <module>
for listing in soup.findall('p', {'class': 'row'}):
TypeError: 'NoneType' object is not callable
Lol, there's some ridiculous stuff on that page: "Management of the meeting places, with for each some stats for the number of contacts met, your success rate, which seduction methods have been used, if they worked or not, if you’ve slept with the contact, which sexual positions, etc."
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
URL = 'http://philadelphia.craigslist.org/search/sss?sort=date&quer...
BASE = 'http://philadelphia.craigslist.org/cpg/'
response = requests.get(URL)
soup = BeautifulSoup(response.content)
for listing in soup.find_all('p',{'class':'row'}):