Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also recommend Mechanize for Ruby (uses nokogiri under the covers).

http://mechanize.rubyforge.org/

http://phantomjs.org/ or "its easier to deal with cousin" http://casperjs.org/ for very client heavy sites.

FWIW, you can get away with HTML only scrapers most of the time, you just need to look harder to find all the data. Totally recommend using "View page source" as that would always give you the original HTML vs the possibly altered DOM (after JS has run on the page) that you might see with Dev Tools/Firebug.



Mechanize is far and away the best and easiest way to scrape with Ruby until anything is rendered in javascript, which is explicitly not supported.

I tend to use Mechanized until I can't, then switch to Watir. Over time, I've found myself just strait up picking up Watir as it runs your browser directly and supports javascript rendering as a result.


How is performance with Watir? With casperjs a page takes me on an avg. 5-10 secs. to process.


Not great. About the same...


I recommend Selenium before I'd recommend PhantomJS in situations where Mechanize/Nokogiri don't cut the mustard,

I've found Selenium scripts much easier to comprehend, modify, and maintain over time than the PhantomJS scripts.


Check out casperjs, it should make life easier. Phantomjs by itself is extremely cumbersome in my experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: