Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can it run inline Javascript as the page is loaded or do I have to explicitly tell it what JS to run? I want to scrape some pages that use JS packers to obfuscate their code so that it's only loaded by real browsers, but if I just use curl all I see is JS that needs to be evaluated before I can get anything useful out of it.


"JS packers to obfuscate their code so that it's only loaded by real browsers"

this is probably not what's happening. more likely, it's obfuscated for other reasons. curl doesn't parse or execute javascript.


It actually is in the case I'm talking about. I'm talking about illegal websites where the only money generated is by advertisements on human eyeballs. They go way out of their way to make sure no scrapers/robots can see the videos on the page since it costs them money for bandwidth. In addition to referrer checking and captcha, they also have inline javascript that evals itself to un-obfuscate itself and load the video on the page so that if someone somehow beats the first two methods and loads it by a command line interface, they still don't get the URL to the video.


pretty cool, wasn't aware of this at all. thanks for the explanation. but even if it were unpacked, curl wouldn't execute it.


On the other hand, I've learned something new about the paranoia of people who don't actually work in web development.

Explains those NoScript people quite well.


What do you mean? Who in your scenario here doesn't actually work in web development?


Can you provide some examples of such sites? I would love to learn more about this technique.


I sometimes `wget` the URLs in my spambox out of genuine curiosity as to what people are actually sending me, and there are a bunch of common patterns.

What's very surprising is the "obfuscated eval" statement -- some term which ultimately evaluates to 'window' is queried for something crazy like:

    w[(typeof 3)[4] + "va" + (document.body + "")[17]]
which ultimately is 'eval'. This is often combined with some sort of self-decrypting almost-binary-looking payload hidden in a div somewhere and requested by div.innerHTML. Replacing the "eval" with "console.log" can give you the decrypted payload, usually a redirect to another redirect to something which runs a Flash script, which is where my analysis stops.

I am not sure why they do this. My first thought is "to prevent being automatically taken down by The Man", but The Man could afford to automatically dispose of a computer while monitoring its network traffic, rebooting from a fixed disk image like a LiveCD afterwards. So it shouldn't be too hard to automatically discover the domains, IPs, and malicious programs involved. I don't know why you'd obfuscate a redirection.


The signature for them is that they start off with:

    eval(function(p,a,c,k,e,d))...
I've seen them most commonly used in not-quite-so-legal streaming websites online where their biggest problem is blocking robots from scraping their site and losing their advertising revenue.


okay then, that's the usual daftlogic's[1] output.

I thought you are talking about something new.

[1] http://www.daftlogic.com/projects-online-javascript-obfuscat...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: