Can it run inline Javascript as the page is loaded or do I have to explicitly tell it what JS to run? I want to scrape some pages that use JS packers to obfuscate their code so that it's only loaded by real browsers, but if I just use curl all I see is JS that needs to be evaluated before I can get anything useful out of it.
It actually is in the case I'm talking about. I'm talking about illegal websites where the only money generated is by advertisements on human eyeballs. They go way out of their way to make sure no scrapers/robots can see the videos on the page since it costs them money for bandwidth. In addition to referrer checking and captcha, they also have inline javascript that evals itself to un-obfuscate itself and load the video on the page so that if someone somehow beats the first two methods and loads it by a command line interface, they still don't get the URL to the video.
I sometimes `wget` the URLs in my spambox out of genuine curiosity as to what people are actually sending me, and there are a bunch of common patterns.
What's very surprising is the "obfuscated eval" statement -- some term which ultimately evaluates to 'window' is queried for something crazy like:
which ultimately is 'eval'. This is often combined with some sort of self-decrypting almost-binary-looking payload hidden in a div somewhere and requested by div.innerHTML. Replacing the "eval" with "console.log" can give you the decrypted payload, usually a redirect to another redirect to something which runs a Flash script, which is where my analysis stops.
I am not sure why they do this. My first thought is "to prevent being automatically taken down by The Man", but The Man could afford to automatically dispose of a computer while monitoring its network traffic, rebooting from a fixed disk image like a LiveCD afterwards. So it shouldn't be too hard to automatically discover the domains, IPs, and malicious programs involved. I don't know why you'd obfuscate a redirection.
The signature for them is that they start off with:
eval(function(p,a,c,k,e,d))...
I've seen them most commonly used in not-quite-so-legal streaming websites online where their biggest problem is blocking robots from scraping their site and losing their advertising revenue.