Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not going to put this down, because it sounds like they're quite happy with the results. But they haven't written about a few things that I find to be important details:

First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc. Now, they have two ways of doing everything (one for their k8s stuff that remains in the cloud, one for what they have migrated). And the stuff in the cloud is the mature, been-using-it-for-years stuff, and the new stuff seemingly hasn't been battle-tested beyond a couple small services.

Now that's fine, and migrating a small service and hanging the Mission Accomplished banner is a win. But it's not a win that says "we're ready to move our big, money-making services off of k8s". My suspicion is that handling the most intensive services means replacing all of the moving parts of k8s with lots of k8s-shaped things, and things which are probably less-easily glued together than k8s things are.

Another thing that strikes me is that if you look at their cloud spend [0], three of their four top services are _managed_ services. You simply will not take RDS and swap it out 1:1 for Percona MySQL, it is not the same for clusters of substance. You will not simply throw Elasticsearch at some linux boxes and get the same result as managed OpenSearch. You will not simply install redis/memcached on some servers and get elasticache. The managed services have substantial margin, but unless you have Elasticsearch experts, memcached/redis experts, and DBAs on-hand to make the thing do the stuff, you're also going to likely end up spending more than you expect to run those things on hardware you control. I don't think about SSDs or NVMe or how I'll provision new servers for a sudden traffic spike when I set up an Aurora cluster, but you can't not think about it when you're running it yourself.

Said another way, I'm curious as to how they will reduce costs AND still have equally performant/maintainable/reliable services while replacing some unit of infrastructure N with N+M (where M is the currently-managed bits). And also while not being able to just magically make more computers (or computers of a different shape) appear in their datacenter at the click of a button.

I'm also curious how they'll handle scaling. Is scaling your k8s clusters up and down in the cloud really more expensive than keeping enough machines to handle unexpected load on standby? I guess their load must be pretty consistent.

[0] https://dev.37signals.com/our-cloud-spend-in-2022/



> First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc.

You can also hire people with direct relevant experience with these tools. You have to ramp up new developers to use the bespoke in house tooling instead.


I think the entire scaling thing is a bit like "manual memory management vs GC", both have advantages and disadvantages.


Yes and no. Different types of memory management essentially accomplish the same thing. The way you build for them and their performance characteristics vary. In that way, scaling is the same.

But scaling is different in that you're physical ability to scale up with on-prem is bounded by physically procuring/installing/running servers, whereas in the cloud that's already been done by someone else weeks or months ago. When you shut off on-prem hardware, you don't get a refund on the capex cost (you're only saving on power/cooling, maybe some wear and tear).

It's not just that you need to plan differently, it's that you need to design your systems to be less elastic. You have fixed finite resources that you cannot exceed, which means even if you have money to throw at a problem, it doesn't matter: you cannot buy your way out of a scaling problem in the short-medium term. If you run out of disk space, you're out of disk space. If you run out of servers with enough RAM for caching, you're evicting data from your cache. The systems you build need to work predictably weeks or months out, and that is a fundamentally different way of building large systems.


This is it, and what so many anti-cloud people are missing. For start ups, how can you possibly take a gamble on trying to predict what your traffic is going to be and paying upfront for dedicated servers. It puts you in a loose-loose situation - your product is not the right fit, you've got a dedicated server you are not using. Your product is a success - well now you need to go and order another server, better hope you can get it spun up in time before everything falls over. I worked at a startup where we saw 1000x increase in load in a day due to a customer's app going viral. On prem would have killed us, cloud saved us.

And you are bang on about managed services. RDS is expensive no doubt, but having your 4 person dev team burn through your seed round messing around with database back ups and failover is a far higher cost.

Of course some companies grow out of the cloud, they have full time ops engineers and can predict traffic ahead of time - for sure, go back to on prem. But for people to hold up articles like this and say "I always said cloud was pointless!" is just absurd.


OK, if you don't want to get good at planning as a company, that's fine. It's OK, just please don't pretend that it's impossible.

I worked at a startup that did the crazy scaling with physical servers just fine. No problem. The marketing department knew ahead of time when something was likely to go viral, IT/Dev knew how much capacity was needed per user and procurement knew lead time on hardware + could keep the vendors in the loop so that hardware would be ready on short notice.

With good internal communication it really is possible to be good at capacity management and get hardware on short notice if required.

Normally we would have servers racked and ready about 2 weeks after ordering, but it could be done in under half a day if required.

Edit: (we had our own datacentre and the suppliers were in a different state)


> The marketing department knew ahead of time when something was likely to go viral

That's fine when it's your product. The situation I'm talking about was a SaaS product providing backend services for customers app. Our customers didn't know if their app was going to go viral, there is no way we could have known. I maintain on-prem would have been totally inappropriate in this situation.

Also, "the marketing department knew ahead of time when something was likely to go viral"...that is quite a statement. They must have beeen some marketing department.


Eh... I used to think they were just a normal marketing department. I have since learned that they were good and most places have bad ones.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: