Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Finding the best ticket price – Simple web scraping with Python (danielforsyth.me)
44 points by danielforsyth on June 17, 2014 | hide | past | favorite | 33 comments


A shorter, more comprehensible version:

import requests

from bs4 import BeautifulSoup

from urlparse import urljoin

URL = 'http://philadelphia.craigslist.org/search/sss?sort=date&quer...

BASE = 'http://philadelphia.craigslist.org/cpg/'

response = requests.get(URL)

soup = BeautifulSoup(response.content)

for listing in soup.find_all('p',{'class':'row'}):

    if listing.find('span',{'class':'price'}):

        price = int(listing.text[2:6])

        if 100 < price <=250:

            print listing.text

            print urljoin(BASE, listing.a['href']) + '\n'


Thanks for posting this, I am still very new to python and your website has taught me a lot. Appreciate the feedback.


thanks


Some months ago I found https://import.io/ and it just blow my mind.

I remember the pain it was to write custom scrapers every time (I used to do it with Perl, btw).

They have a custom browser with a nice interface, but the biggest thing are the so called "Connectors": you instruct the system into how to query and parse results and Import.IO will give you an API endpoint for this query, now automatized.

One can, say, create a "connector" which can query Airbnb and parse results, then create another "connector" which queries booking.com. Now it is possible to use the API to make a query for Boa Vista, Roraima (my city) and get the dataset.

I am not affiliated with them in any way, just a very happy old-school scrapper.

Nice walkthrough: http://www.youtube.com/watch?v=_16O10Wx2W4

UPDATE:

Unsurprisingly, import.io was Hacker News stuff in the past: https://news.ycombinator.com/item?id=7582858


Other browser based screen scrapers that are in the space are 80 legs, kiminolabs, Mozenda and OutWit Hub, I'm sure there are more. Last time I checked, import.io was a fairly lightweight browser wrapper.

I also write web scrapers using Perl and Python, recently have been gravitating towards Python as the code looks more readable. I don't use browser based scrapers because the sites I scrape are usually more complex so it is just easier to write my own code, and they lack functionality and control of the data, and there is the overhead of learning the terminology and how it works.


I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.

[0]: http://scrapy.org/


I don't recommend scrapy. Classic example of a framework that should have been a library. It will work up until a point and then it will railroad your app and you will have a really painful time breaking out of the 'scrapy' way of doing things. Classic 'framework' problem.

I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).


Agreed. I used BeautifulSoup in combination with Celery.

To me scraping is such a specific thing it's best to write your own 'framework'.


I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).

You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.

N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.

I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.


Scrapy spider that is doing exactly the same::

    # It requires scrapy from github.
    # Save it to tickets.py and execute 
    # "scrapy runspider tickets.py" from the command line

    from urlparse import urljoin
    import scrapy
    
    class TicketSpider(scrapy.Spider):
        name = 'tickets'
        start_urls = ['http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets']
    
        def parse(self, response):
            for listing in response.css('p.row'):
                price_txt = listing.css('span.price').re('(\d+)')
                if not price_txt:
                    continue
                price = int(price_txt[0])
                if 100 < price <= 250:
                    url = urljoin(response.url, listing.css('a::attr(href)').extract()[0])
                    print ' '.join(listing.css('::text').extract())
                    print url
                    print
There is no reason to prefer Scrapy for extracting information from a single webpage, but on the other hand it is not any harder than BS+pyquery+requests.


I had a go "just for fun" using curl, grep, sed, and tr. Probably too much regex?

    #!/bin/sh
    #
    # tickets.sh - A "no BS" ticket price scraper. Output in CSV format.
    #              Uses standard issue Unix utilities only.
    #              No soup for you!
    
    
    URL="http://philadelphia.craigslist.org"
    QUERY="firefly+tickets"
    
    RESULTS=`curl -s -m 10 "$URL/search/sss?sort=date&query=$QUERY" \
            | grep '<p class=\"row' \
            | sed 's!^[ \t]*!!; \
                   s!>[ \t]*<!><!g; \
                   s![,:]! !g; \
                   s!<p class=\"row[^/]*\"\([^\"]*\)\" class=\"[^#]*\">&#x0024;\([0-9]\{1,\}\)</span>[^.]*>\([A-Z]\{1\}[a-z]\{2\} \{1,\}[0-9]\{1,2\}\)[^.]*<a h[^>]*\.html">\([^<]*\)</a>\([^.]*</p>\)!\1,$\2,\3,\4:!g; \
                   s!   *! !g; \
                   s!,  *!,!g' \
            | tr ':' '\n'`
    
    echo "$RESULTS"


Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is basically this[1]. We used to do forecasting but found that wasn't really useful[2] or worth the time it took to maintain, so we nixed it.

- [1]: As an example, here is the Firefly event the OP was scraping. : https://seatgeek.com/firefly-music-festival-tickets

- [2]: We haven't included Craigslist because the data is much less structured and inexperienced users may have a Bad Time™. YMMV

- [3]: It was also a royal pain in the ass to maintain. I know because I had to update the underlying data provided to the model, and also modify it whenever available data changed :( . Here is a blog post on why we removed it from the product in general: http://chairnerd.seatgeek.com/removing-price-forecasts


Combine this with Pushover[0] to get alerted whenever there is a new lowest price. I had to resort to scraping+pushover to snatch a garage parking spot in SF.

[0] https://pushover.net/


Useful article, I use lxml myself. Find that this is a good resource: http://jakeaustwick.me/python-web-scraping-resource/


I did this recently when trying to get tickets to a sold out Cloud Nothings show. I'd scrape Craiglist for postings every 10 minutes, and then send myself a text if any of the posts were new. I ended up getting tickets the day before the show.

Since the show was at a very small venue (capacity of maybe 500), I didn't have to worry about a constant stream of false positives. I would have needed to handle these if I were searching for tickets to a sold out <popular band> show, since ticket brokers just spam Craigslist constantly with popular terms.


This reminds me of something I knocked up back in 2006. It's not a scraper, it's not Python, but here you are:

http://giggr.com/?q=klaxons

Searches multiple UK ticket sites and returns the artist page matching the query.

Clicking a header label (i.e. Ticketweb) switches to that provider.

Double-clicking the header re-searches based on the value of the search box.

I use it for the 9am scramble for newly released tickets.

Oh, it seems Ticketmaster has broken. Maybe I'll fix that one day... I haven't used it in a while.


If you don't mind me asking. Who pays for the bandwidth cost of running "giggr"? It doesn't look like you have any ads running. Or are you monetizing on it in some other way?


It's a static web page on a Linode I use for other projects and purposes.

Even if millions of people decided to suddenly use it, the cost would be almost nothing.

I might even consider putting the free CloudFlare in front of it to ensure the cost is nothing (one static HTML file cached forever).

Heh, just looked at the source code again... it's a single request web page, not even an external CSS or JavaScript file.

You can't get cheaper really.


Looks like it just uses browser-side js requests to get the search pages, so it would use minimal bandwidth.


Doesn't work for me. Which Python version is required?

    Traceback (most recent call last):
      File "./tickets.py", line 20, in <module>
        for listing in soup.findall('p', {'class': 'row'}):
    TypeError: 'NoneType' object is not callable


I am using 2.7.6


Maybe wrong BS version?

    $ sudo pip install BeautifulSoup
    Downloading/unpacking BeautifulSoup
      Downloading BeautifulSoup-3.2.1.tar.gz
      Running setup.py (path:/tmp/pip_build_root/BeautifulSoup/setup.py) egg_info for package BeautifulSoup
        
    Installing collected packages: BeautifulSoup
      Running setup.py install for BeautifulSoup
        
    Successfully installed BeautifulSoup
    Cleaning up...


Ah yes thats the problem, I am using beautifulsoup4==4.3.2.

Try pip install beautifulsoup4


I found the problem. I think your listing ate '_'. It should be 'soup.find_all' instead of 'soup.findall' and 'link_end' instead of 'linkend'


Good find! Fixed it, thanks!


You should integrate this in weboob [0]

[0] http://weboob.org/


That really isn't a good name.


Seems to be their thing

QHandJoob

QFlatBoob


Wow. Spurred on by your discoveries, I found this: http://weboob.org/applications/qhavedate

"QHaveDate is a graphical application able to interact with dating websites, and help you manage your numerous conquests."

I, uh, don't really know where to start with that.


Lol, there's some ridiculous stuff on that page: "Management of the meeting places, with for each some stats for the number of contacts met, your success rate, which seduction methods have been used, if they worked or not, if you’ve slept with the contact, which sexual positions, etc."


It's funny because the previous name of qhavedate was qhavesex.

It's sad that the only thing that people can see from this project is the name.


This is not very good code. Here's a little better refactor - https://github.com/realpython/interview-questions/blob/maste...


There is also iftt.com that can poll a specific CL search and email you when something hits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: