Human error is never a root cause, but systems can always be improved upon and made to be more resilient.
When analysing an incident or problem, it can be tempting to use human error as a root cause. If we dig in deeper, though, what appears to be human error is caused by an underlying failure of process or environment. How can that be? Here are some possibilities:
– A fragile, poorly instrumented, or overly complex system can cause humans to make mistakes
– A process that doesn’t take into account human needs, such as sleep, context or skill can also cause humans to make mistakes
– A process of hiring and training operators may be broken, allowing the wrong operators into the environment.
Furthermore, “root cause” itself is a problematic statement, as there is rarely a single issue that leads to errors and incidents. Complex systems lead to complex failures, and adding humans into the mix complicates things further. Instead of thinking in terms of root cause, I suggest you consider a list of contributing factors, prioritised by risk and impact.
Being able to recover quickly from failure is more important than having failures less often. This is in part due to the increased complexity of failures today.
When you create a system that rarely breaks, you create a system that is inherently fragile. Will your team be ready to do repairs when the system does fail? Will it even know what to do? Systems that have frequent failures that are controlled and mitigated such that their impact is negligible have teams that know what to do when things go sideways. Processes are well documented and honed, and automated remediation becomes actually useful rather than hiding in the dark corners of your system.
While I’m definitely not saying failure should be an acceptable condition, I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure rather than trying to prevent it.
Ronald Heifetz is the King Hussein bin Talal Senior Lecturer in Public Leadership at Harvard University’s John F. Kennedy School of Government. For the past twenty years, he has generated critical works that have influenced leadership theory in every domain. Heifetz often draws on the metaphor of the dance floor and the balcony.
Let’s say you are dancing in a big ballroom. . . . Most of your attention focuses on your dance partner, and you reserve whatever is left to make sure you don’t collide with dancers close by. . . . When someone asks you later about the dance, you exclaim, “The band played great, and the place surged with dancers.”
But, if you had gone up to the balcony and looked down on the dance floor, you might have seen a very different picture. You would have noticed all sorts of patterns. . . you might have noticed that when slow music played, only some people danced; when the tempo increased, others stepped onto the floor; and some people never seemed to dance at all. . . . the dancers all clustered at one end of the floor, as far away from the band as possible. . . . You might have reported that participation was sporadic, the band played too loud, and you only danced to fast music.
. . .The only way you can gain both a clearer view of reality and some perspective on the bigger picture is by distancing yourself from the fray. . . .
If you want to affect what is happening, you must return to the dance floor.*
So you need to be both among the dancers and up on the balcony. That’s where the magic is, going back and forth between the two, using one to leverage the other.
* Heifetz, R., and Linsky, M. Leadership on the Line: Staying Alive Through the Dangers of Leading.Boston: Harvard Business School Press, 2002.
Following on from my previous post on there’s no such thing as a small change…
Please do not make any changes to a production system – a live system – without first testing for any side effects. For example, please do not read a blog post or a book chapter, and then check your system and find you are using manual memory management – and then just turn on automatic memory management. Query plans may change and performance may be impacted. One of three things could happen:
- Things run exactly the same
- Things run better than they did before
- Things run much worse than they did before
Exercise caution before making changes; test the proposed change first!
Universities used to be centres of learning. Now most of them are corporations with huge marketing divisions, massive administration costs, crazy slogans, a fixation with dodgy rankings, an obsession with what is often low grade and banal research and an increasing reliance on casual low-cost staff. Otherwise it’s all good!
If the application performance limits the business processes it is supposed to be supporting, the application must be tuned.
Today’s businesses depend heavily on their databases. Should applications and data become unavailable, the entire business may halt. Revenue and customers may be lost and penalties may be incurred. Bad press can have a lasting effect on both customers and stock prices. Certainly, providing continuous data availability is essential for today’s businesses.
It’s hard to believe that Microsoft is 41 years old. In that time, its had its ups (think Windows XP with around one billion sales) and its downs (think Windows ME, which lasted for less than 18 months). But one thing that’s clear is Microsoft has cleverly re-invented itself, re-booted and disrupted its own business in a massive way. Some would argue Microsoft is now “cool” again.