Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Six Months of HackerNews Front Page Data (mattmazur.com)
80 points by matt1 on March 7, 2010 | hide | past | favorite | 16 comments


I hacked together a quick script to generate new HN story titles from your list of existing ones. It is a pretty shoddy job -- no syntax-level modeling -- but some of the ones it generated are still pretty amusing:

mark cuban: how to say facebook

how ravelry scales to remove ipad stories from wind

google patents its way to copenhagen

bert and rails ecosystem white paper

how phusion built a blog posting

2010 conference makes the last days of android phones

is like sex. it's better when harvard teaches networking

thunderbird and one hell of the illusion of music

customer development and lying with nginx

toddlers develop individualized rules for scalewell startup fund

bill gates sums up massive data failure leads to control robots

the fuel for running our financial system

i have become a programming language

the future of instant approval

scheme that 'cancer-proofs' rodent's cells

the design and getting your business

h.264 to reach 1 billion rows into the expression problem

the bible that runs on your vc "closing" fees

ask pg: quick tips on different sql implementations

the insanely great in the free version of iphone

scalable apps on vetting opportunities

mona lisa's smile a frozen sculpture of programming

coelacanth: lessons from moleskine to rule your code

results with people: do what would never launch


I put the python scripts here if anyone wants to play with them:

http://people.csail.mit.edu/eob/files/hn/

The code wasn't written to be anything more than a quick toy.. so don't zing me for its poor quality :)


Based upon that data it looks like the best time to submit the data is between 12 and 16 UTC.

Edit: This is with thirty seconds of tossing it through awk. It is pretty well distributed so maybe it is insignificant. I only counted articles that reached 1st place, you should parse it yourself rather than take my word for it of course. And a graph would be nice.


I happened to be playing with R today, so I took a stab at making a chart:

http://tinyurl.com/hnrank


Rule of thumb for when to submit seems to be: whenever PST people are awake, and not eating meals.


It has been just today that I discussed the prospects of analyzing HN front page posts with a friend.

Promise to come up with interesting results. Thank you.


and my friend didn't even wait up for me.

http://news.ycombinator.com/item?id=1175223


Thanks for the dataset. FWIW: %s/"//g takes it from 170M to 100M

Of course, compression negates the saving, but it still seemed odd.


Good point. I just went with the default export settings--if I do it again in the future, I'll definitely do it this way.


The current data format is harder to read using python csv module. This code will convert it to python compatible csv : http://gist.github.com/325195

It's bit slow(~15 seconds), but it's a one time job.


Thanks a lot Matt. That should be one heck of a dataset to play with.


Does anyone have a full dump of HN posts and comments?


Some were posted roughly a year ago, but they're no longer up. I might have them somewhere, give me some time to dig.


Thank you so much. This is all I needed to make my HN points predictor for newly submitted stories.

Now if I only find a nice chunk of time on a lazy weekend..


Hrmm....so now we will see if your tool explodes when a URL to the site reaches the front page.

Kinda like when you google google.


Doing statistical coolness now. Will post results later. Stay tuned...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: