- decompress (users|posts)
- split into batches of 10,000
- xsltproc the batch into sql statements
- pipe the batches of statements into sqlite in parallel using flocks for coordination
On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.
This site is on a Cloudflare R2 bucket because (and only because) they have free egress. While not datacenter sized some of these files are large. Just opening up your credit card to 10 cents a gigabyte will be a bad time anywhere else.
I mean fine, write an article that overweights recent events to weave a tale of the decline of Stack Exchange...but don't leave out the moderator strike, the data dump fiasco, or the CEO that simply doesn't get it. Jon Ericson's writing on the issue is much more informative. https://jlericson.com/
I don't know why people are so often surprised about the number of employees in a company.
My company has half the number of employees, we're not remotely as relevant as SO.
It's basically a wiki with well under a terrabyte of data total and a billion requests a month (modest load in the grand scheme of web apps). It runs on less than 10 servers (https://www.datacenterdynamics.com/en/news/stack-overflow-st...). It's kind of bonkers to have hundreds of engineers supporting a handful of servers.
- Sales people, account managers, etc. for their ads business
- Sales people, account managers, etc. for their Teams product.
- Sales people, account managers, etc. for their Enterprise self-hosted product.
- Sales people, account managers, etc. for sponsored tags, collectives, etc.
- Support for the above (and the public Stack Exchange sites).
- Engineers for the above (and the public Stack Exchange sites).
- Community managers (who, among other things, fight abuse).
It all adds up. From what I remember most people working for SO weren't engineers, not even years ago (many were involved with the jobs site back then). There used to be a "Our Team" page which listed everyone who worked for SO, but it seems that's gone now.
PCA/SVD aim for maximizing explained variance, not preserving distance. They tend to "preserve" large distances at the expense of smaller ones, but that's not an explicit goal, nor can you bound the distortion. The [answer here](https://stats.stackexchange.com/a/176801/60) gives a pretty good intuition about why.
You can also compare them on computational complexity, where random projection (O(numPoints * numOriginalDimensions * numProjectedDimensions) smokes PCA or SVD which are cubic in the number of original dimensions.
And then there's simplicity. The random projection method turns on sampling from a normal distribution and then doing a matrix multiplication. There's a whole lot more about PCA to understand (standardizing your data, calculating the covariance matrix, eigenvector decomposition). I doubt I could implement it correctly myself, and I surely couldn't do it in high dimension.