Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.
But that has nothing to do with blaming processes vs people.
If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.
If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?
If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.
Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.
But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.
You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.
Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.
I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.
To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.
If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.
To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.
What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.
But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.
The issue is that this view always relies on stuff like "make people triple check everything".
- How does that relate to making a config change?
- How do you practically implement a system where someone has to triple check everything they do?
- How do you stop them just clicking 'confirm' three times?
- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?
I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.
And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?
This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.
(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)