Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Within an hour of V8 pushing the fix for this, our build automation alerted me that it had picked up the patch and built a new release of the Workers Runtime for us. I clicked a button to start rolling it out. After quick one-click approvals from EM and SRE, the release went to canary. After running there for a short time to verify no problems, I clicked to roll it out world-wide, which is in progress now. It will be everywhere within a couple hours. Rolling out an update like this causes no visible impact to customers.

Great workflow! I long for the day when I can start for a company that actually has their automation as efficient as this.

Few question, do you have a way of differentiating critical patches as this? If so, does that trigger an alert for the on-call person? Or do you still wait until working hours before such a change is pushed?



Look for a company who's business model includes uptime, security and scalability. And is big enough to not outsource those parts. And in a mature market where customers can tell the difference.


I once worked for a company that tried to set up a new service, they asked for 99.99999% uptime. This worked really well for the 'ops' team which focused on the AWS setup and automation, but meanwhile the developers (of which I was one, but I didn't have any say in things because I was 'just' a front-ender) fucked about with microservices, first built in NodeJS (with a postgres database storing mostly JSON blobs behind them), then in Scala. Not because it was the best solution (neither microservices nor scala), but because the developers wanted to, and the guys responsible for hiring were Afraid that they'd get mediocre developers if they went for 'just' java.

I'm just so tired of the whole microservices and prima donna developer bullshit.


99.99999% uptime is about 3 seconds of downtime per year. Yikes! Does any service on Earth have that level of uptime?


No. In some sense it doesn't matter though. There are plenty of services that have less than their claimed reliability:

* They set an easy measurement that doesn't match customer experience, so they say they're in-SLO when common sense suggests otherwise.

* They require customers jump through hoops to get a credit after a major incident.

* The credits are often not total and/or are tiered by reliability (so you could have a 100% uptime and not give a 100% discount if you serve some errors). At the very most, they give the customer a free month. It's not as if they make the customer whole on their lost revenue.

With a standard industry SLA, you can have a profitable business claiming uptime you never ever achieve.


Also look at their job ads. If they are looking to hire a devops to own their ci/cd pipeline, that means they don’t have one (and, with that approach, will never have one).


My guess is that the main feature which enables this kind of automation is that they can take down any node without consequences. So they can just install an update on all the machines, and then reboot/restart the software on the machines sequentially. If you have implemented redundancy correctly, then software updating becomes simple.


We actually update each machine while it is serving live traffic, with no downtime.

We start a new instance of the server, warm it up (pre-load popular Workers), then move all new requests over to the new instance, while allowing the old instance to complete any requests that are in-flight.

Fewer moving parts makes it really easy to push an update at any time. :)


What happens if you have long running tasks in the worker?


Can you include me too?! I wish I could automate the hell like they do :P


You can!


Specifically you can do two things: 1) planned incremental improvements, 2) simpler designs.

For 1), write down the entire manual workflow. Start automating pieces that are easy to automate, even if someone has to run the automation manually. Continue to automate the in-between/manual pieces. For this you can use autonomation to fall back to manual work if complete automation is too difficult/risky.

For 2), look at your system's design. See where the design/tools/implementation/etc limit the ability to easily automate. To replace a given workflow section, you can a) replace some part of your system with a functionally-equivalent but easier to automate solution, or b) embed some new functionality/logic into that section of the system that extends and slightly abstracts the functionality, so that you can later easily replace the old system with a simpler one.

To get extra time/resources to spend on the automation, you can do a cost-benefit analysis. Record the manual processes' impact for a month, and compare this to an automated solution scaled out to 12-36 months (and the cost to automate it). Also include "costs" like time to market for deliverables and quality improvements. Business people really like charts, graphs, and cost saving estimates.


Thank you for the rundown!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: