In the projects I work on, things go down all the time, for various reasons (har...

dgroshev · on March 23, 2023

Networking issues and even hardware issues are very unlikely if you can fit everything into one box, and you can get a lot in one box nowadays (TB+ RAM, 128+ core servers are now commodity). MTBF on servers is on the order of years, so hardware failure is genuinely rare until you get too many servers into one distributed system. And even then, two identical boxes (instead of binpacking into a cluster, increasing failure probability) go a very long way.

It's a vicious circle. We build distributed multi-node systems, overlay software-configured networks, self-healing clusters, separate distributed control planes, split everything into microservices, but it all makes systems more fragile unless enough effort is spent on supporting all that infrastructure. Google might not have a choice to scale vertically, but the overwhelming majority of companies do. Hell, even StackOverflow still scales vertically after all these years! I know startups with no customers who use more servers than StackOverflow does.