Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: https://github.com/aGHz/structominer/blob/master/examples/hn...

I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from https://github.com/aGHz/structominer.



Hey,

I've actually built something similar to this myself, I plan on writing an article in the future with something along these lines.

Yours look pretty polished though, good job!


Yeah, I'm pretty sure anyone having to do any moderate amount of scraping eventually arrives at a similar solution.

Thanks! It's still a work in progress, so if you have anything you'd like to see in there, I'd love to hear about it (I also welcome code contributions if you're so inclined).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: