More

JasonPunyon · 2025-07-08T16:52:31 1751993551

Thanks for taking it for a spin! I'm working on why this is slow now.

JasonPunyon · on July 12, 2024

You may remember a carbon copy of this event from a year ago. https://meta.stackexchange.com/questions/389922/june-2023-da...

Discussion from then https://news.ycombinator.com/item?id=36257523

JasonPunyon · on May 6, 2024

If anyone wants their data back in a way they can use it, it's right here https://seqlite.puny.engineering

And I'd be remiss if I didn't point out that their trade dress is MIT licensed. https://stackoverflow.design

Have fun.

JasonPunyon · on April 10, 2024

Good catch! Fix going up now.

rawling · on April 10, 2024

Pretty neat article!

(Don't have much to say, just didn't want "typo" to be my only feedback)

JasonPunyon · on April 10, 2024

Thanks so much, that’s really kind.

JasonPunyon · on March 14, 2024

Nice post!

Yep, my process is similar. It goes...

  - decompress (users|posts)  
  - split into batches of 10,000  
  - xsltproc the batch into sql statements  
  - pipe the batches of statements into sqlite in parallel using flocks for coordination

On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.

JasonPunyon · on March 14, 2024

This site is on a Cloudflare R2 bucket because (and only because) they have free egress. While not datacenter sized some of these files are large. Just opening up your credit card to 10 cents a gigabyte will be a bad time anywhere else.

JasonPunyon · on Aug 14, 2023

I mean fine, write an article that overweights recent events to weave a tale of the decline of Stack Exchange...but don't leave out the moderator strike, the data dump fiasco, or the CEO that simply doesn't get it. Jon Ericson's writing on the issue is much more informative. https://jlericson.com/

JasonPunyon · on June 9, 2023

https://stackoverflow.blog/2023/05/10/a-message-from-prashan...

nolok · on June 9, 2023

Stackoverflow has over 500 employees ?!

floydian10 · on June 9, 2023

I don't know why people are so often surprised about the number of employees in a company. My company has half the number of employees, we're not remotely as relevant as SO.

Kiro · on June 9, 2023

Why is that surprising?

qbasic_forever · on June 9, 2023

It's basically a wiki with well under a terrabyte of data total and a billion requests a month (modest load in the grand scheme of web apps). It runs on less than 10 servers (https://www.datacenterdynamics.com/en/news/stack-overflow-st...). It's kind of bonkers to have hundreds of engineers supporting a handful of servers.

arp242 · on June 9, 2023

- Sales people, account managers, etc. for their ads business

- Sales people, account managers, etc. for their Teams product.

- Sales people, account managers, etc. for their Enterprise self-hosted product.

- Sales people, account managers, etc. for sponsored tags, collectives, etc.

- Support for the above (and the public Stack Exchange sites).

- Engineers for the above (and the public Stack Exchange sites).

- Community managers (who, among other things, fight abuse).

It all adds up. From what I remember most people working for SO weren't engineers, not even years ago (many were involved with the jobs site back then). There used to be a "Our Team" page which listed everyone who worked for SO, but it seems that's gone now.

biorach · on June 9, 2023

500 employees, not 500 engineers.

JasonPunyon · on Dec 12, 2017

PCA/SVD aim for maximizing explained variance, not preserving distance. They tend to "preserve" large distances at the expense of smaller ones, but that's not an explicit goal, nor can you bound the distortion. The [answer here](https://stats.stackexchange.com/a/176801/60) gives a pretty good intuition about why.

You can also compare them on computational complexity, where random projection (O(numPoints * numOriginalDimensions * numProjectedDimensions) smokes PCA or SVD which are cubic in the number of original dimensions.

And then there's simplicity. The random projection method turns on sampling from a normal distribution and then doing a matrix multiplication. There's a whole lot more about PCA to understand (standardizing your data, calculating the covariance matrix, eigenvector decomposition). I doubt I could implement it correctly myself, and I surely couldn't do it in high dimension.

JasonPunyon · on Nov 2, 2017

Nothing is dying. Our data dumps are live at https://archive.org/details/stackexchange as they've always been.

tchalla · on Nov 2, 2017

I wish your older snapshots weren't deleted. Any chance to get them back?

throwaway2016a · on Nov 2, 2017

I had no idea this existed. This is awesome. Thanks for the link.