Hacker Newsnew | past | comments | ask | show | jobs | submit | JasonPunyon's commentslogin

Thanks for taking it for a spin! I'm working on why this is slow now.


You may remember a carbon copy of this event from a year ago. https://meta.stackexchange.com/questions/389922/june-2023-da...

Discussion from then https://news.ycombinator.com/item?id=36257523


If anyone wants their data back in a way they can use it, it's right here https://seqlite.puny.engineering

And I'd be remiss if I didn't point out that their trade dress is MIT licensed. https://stackoverflow.design

Have fun.


Good catch! Fix going up now.


Pretty neat article!

(Don't have much to say, just didn't want "typo" to be my only feedback)


Thanks so much, that’s really kind.


Nice post!

Yep, my process is similar. It goes...

  - decompress (users|posts)  
  - split into batches of 10,000  
  - xsltproc the batch into sql statements  
  - pipe the batches of statements into sqlite in parallel using flocks for coordination
On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.


This site is on a Cloudflare R2 bucket because (and only because) they have free egress. While not datacenter sized some of these files are large. Just opening up your credit card to 10 cents a gigabyte will be a bad time anywhere else.


I mean fine, write an article that overweights recent events to weave a tale of the decline of Stack Exchange...but don't leave out the moderator strike, the data dump fiasco, or the CEO that simply doesn't get it. Jon Ericson's writing on the issue is much more informative. https://jlericson.com/



Stackoverflow has over 500 employees ?!


I don't know why people are so often surprised about the number of employees in a company. My company has half the number of employees, we're not remotely as relevant as SO.


Why is that surprising?


It's basically a wiki with well under a terrabyte of data total and a billion requests a month (modest load in the grand scheme of web apps). It runs on less than 10 servers (https://www.datacenterdynamics.com/en/news/stack-overflow-st...). It's kind of bonkers to have hundreds of engineers supporting a handful of servers.


- Sales people, account managers, etc. for their ads business

- Sales people, account managers, etc. for their Teams product.

- Sales people, account managers, etc. for their Enterprise self-hosted product.

- Sales people, account managers, etc. for sponsored tags, collectives, etc.

- Support for the above (and the public Stack Exchange sites).

- Engineers for the above (and the public Stack Exchange sites).

- Community managers (who, among other things, fight abuse).

It all adds up. From what I remember most people working for SO weren't engineers, not even years ago (many were involved with the jobs site back then). There used to be a "Our Team" page which listed everyone who worked for SO, but it seems that's gone now.


500 employees, not 500 engineers.


PCA/SVD aim for maximizing explained variance, not preserving distance. They tend to "preserve" large distances at the expense of smaller ones, but that's not an explicit goal, nor can you bound the distortion. The [answer here](https://stats.stackexchange.com/a/176801/60) gives a pretty good intuition about why.

You can also compare them on computational complexity, where random projection (O(numPoints * numOriginalDimensions * numProjectedDimensions) smokes PCA or SVD which are cubic in the number of original dimensions.

And then there's simplicity. The random projection method turns on sampling from a normal distribution and then doing a matrix multiplication. There's a whole lot more about PCA to understand (standardizing your data, calculating the covariance matrix, eigenvector decomposition). I doubt I could implement it correctly myself, and I surely couldn't do it in high dimension.


Nothing is dying. Our data dumps are live at https://archive.org/details/stackexchange as they've always been.


I wish your older snapshots weren't deleted. Any chance to get them back?


I had no idea this existed. This is awesome. Thanks for the link.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: