> Within an hour of V8 pushing the fix for this, our build automation alerted me...

z3t4 · on April 13, 2021

Look for a company who's business model includes uptime, security and scalability. And is big enough to not outsource those parts. And in a mature market where customers can tell the difference.

Cthulhu_ · on April 13, 2021

I once worked for a company that tried to set up a new service, they asked for 99.99999% uptime. This worked really well for the 'ops' team which focused on the AWS setup and automation, but meanwhile the developers (of which I was one, but I didn't have any say in things because I was 'just' a front-ender) fucked about with microservices, first built in NodeJS (with a postgres database storing mostly JSON blobs behind them), then in Scala. Not because it was the best solution (neither microservices nor scala), but because the developers wanted to, and the guys responsible for hiring were Afraid that they'd get mediocre developers if they went for 'just' java.

I'm just so tired of the whole microservices and prima donna developer bullshit.

feoren · on April 14, 2021

99.99999% uptime is about 3 seconds of downtime per year. Yikes! Does any service on Earth have that level of uptime?

scottlamb · on April 14, 2021

No. In some sense it doesn't matter though. There are plenty of services that have less than their claimed reliability:

* They set an easy measurement that doesn't match customer experience, so they say they're in-SLO when common sense suggests otherwise.

* They require customers jump through hoops to get a credit after a major incident.

* The credits are often not total and/or are tiered by reliability (so you could have a 100% uptime and not give a 100% discount if you serve some errors). At the very most, they give the customer a free month. It's not as if they make the customer whole on their lost revenue.

With a standard industry SLA, you can have a profitable business claiming uptime you never ever achieve.

jeffbee · on April 13, 2021

Also look at their job ads. If they are looking to hire a devops to own their ci/cd pipeline, that means they don’t have one (and, with that approach, will never have one).

amelius · on April 13, 2021

My guess is that the main feature which enables this kind of automation is that they can take down any node without consequences. So they can just install an update on all the machines, and then reboot/restart the software on the machines sequentially. If you have implemented redundancy correctly, then software updating becomes simple.

kentonv · on April 13, 2021

We actually update each machine while it is serving live traffic, with no downtime.

We start a new instance of the server, warm it up (pre-load popular Workers), then move all new requests over to the new instance, while allowing the old instance to complete any requests that are in-flight.

Fewer moving parts makes it really easy to push an update at any time. :)

mk89 · on April 14, 2021

What happens if you have long running tasks in the worker?

Xunjin · on April 13, 2021

Can you include me too?! I wish I could automate the hell like they do :P

lofties · on April 13, 2021

You can!

throwaway823882 · on April 13, 2021

Specifically you can do two things: 1) planned incremental improvements, 2) simpler designs.

For 1), write down the entire manual workflow. Start automating pieces that are easy to automate, even if someone has to run the automation manually. Continue to automate the in-between/manual pieces. For this you can use autonomation to fall back to manual work if complete automation is too difficult/risky.

For 2), look at your system's design. See where the design/tools/implementation/etc limit the ability to easily automate. To replace a given workflow section, you can a) replace some part of your system with a functionally-equivalent but easier to automate solution, or b) embed some new functionality/logic into that section of the system that extends and slightly abstracts the functionality, so that you can later easily replace the old system with a simpler one.

To get extra time/resources to spend on the automation, you can do a cost-benefit analysis. Record the manual processes' impact for a month, and compare this to an automated solution scaled out to 12-36 months (and the cost to automate it). Also include "costs" like time to market for deliverables and quality improvements. Business people really like charts, graphs, and cost saving estimates.

unnamed10 · on April 13, 2021

Thank you for the rundown!