Aside

Root Cause

Human error is never a root cause, but systems can always be improved upon and made to be more resilient.

When analysing an incident or problem, it can be tempting to use human error as a root cause. If we dig in deeper, though, what appears to be human error is caused by an underlying failure of process or environment. How can that be? Here are some possibilities:

– A fragile, poorly instrumented, or overly complex system can cause humans to make mistakes

– A process that doesn’t take into account human needs, such as sleep, context or skill can also cause humans to make mistakes

– A process of hiring and training operators may be broken, allowing the wrong operators into the environment.

Furthermore, “root cause” itself is a problematic statement, as there is rarely a single issue that leads to errors and incidents. Complex systems lead to complex failures, and adding humans into the mix complicates things further. Instead of thinking in terms of root cause, I suggest you consider a list of contributing factors, prioritised by risk and impact.

Advertisements
Aside

Why MTTR Over MTBF?

Being able to recover quickly from failure is more important than having failures less often. This is in part due to the increased complexity of failures today.

When you create a system that rarely breaks, you create a system that is inherently fragile. Will your team be ready to do repairs when the system does fail? Will it even know what to do? Systems that have frequent failures that are controlled and mitigated such that their impact is negligible have teams that know what to do when things go sideways. Processes are well documented and honed, and automated remediation becomes actually useful rather than hiding in the dark corners of your system.

While I’m definitely not saying failure should be an acceptable condition, I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure rather than trying to prevent it.

Aside

The Dance Floor and the Balcony

Ronald Heifetz is the King Hussein bin Talal Senior Lecturer in Public Leadership at Harvard University’s John F. Kennedy School of Government. For the past twenty years, he has generated critical works that have influenced leadership theory in every domain. Heifetz often draws on the metaphor of the dance floor and the balcony.

Let’s say you are dancing in a big ballroom. . . . Most of your attention focuses on your dance partner, and you reserve whatever is left to make sure you don’t collide with dancers close by. . . . When someone asks you later about the dance, you exclaim, “The band played great, and the place surged with dancers.”

But, if you had gone up to the balcony and looked down on the dance floor, you might have seen a very different picture. You would have noticed all sorts of patterns. . . you might have noticed that when slow music played, only some people danced; when the tempo increased, others stepped onto the floor; and some people never seemed to dance at all. . . . the dancers all clustered at one end of the floor, as far away from the band as possible. . . . You might have reported that participation was sporadic, the band played too loud, and you only danced to fast music.

. . .The only way you can gain both a clearer view of reality and some perspective on the bigger picture is by distancing yourself from the fray. . . .

If you want to affect what is happening, you must return to the dance floor.*

So you need to be both among the dancers and up on the balcony. That’s where the magic is, going back and forth between the two, using one to leverage the other.

_______

* Heifetz, R., and Linsky, M. Leadership on the Line: Staying Alive Through the Dangers of Leading.Boston: Harvard Business School Press, 2002.

Aside

Test Your Changes

Following on from my previous post on there’s no such thing as a small change…

Please do not make any changes to a production system – a live system – without first testing for any side effects. For example, please do not read a blog post or a book chapter, and then check your system and find you are using manual memory management – and then just turn on automatic memory management. Query plans may change and performance may be impacted. One of three things could happen:

  • Things run exactly the same
  • Things run better than they did before
  • Things run much worse than they did before

Exercise caution before making changes; test the proposed change first!

There is no such thing as a small change

“We want to limit the length of a review in the product to 140 characters, because we may want to use SMS at some stage. That’s a small change, right?”

Wrong.

There are no small changes when you’re committed to delivering quality software. Let’s look at the above case. A naïve programmer may well get this coded in three minutes—after all it’s just an if-statement.

A background in consulting, where you are paid for your time, teaches you to ask a few questions before proceeding with ‘small changes’. Let’s start with some easy questions.

What happens when the review is above 140 characters? Do we crop the string, or display an error message to the user? If we display an error, where does it appear? What does it say? Who is going to write the error message? How do we explain to the user why we’re limiting them to 140 characters? How will these errors look? Do we have a style defined? If not, who is designing it?

But wait, there’s more…

In the unlikely event that we have answers to hand for all of the above concerns, we’re still not finished. Just doing this server-side is a messy way to handle an error. We should do this client-side. But if we’re going to do client-side validation then I’d have a few more questions…

Who’s writing the JavaScript? Does the JavaScript display the same type of error as the server-side code? If not, what’s the new style? How does it behave without JavaScript? How do we ensure that any new update to the 140 character requirement affect both client-side and server-side validation?

We’re still not done. Look at this from a users point of view. They’re already frustrated by having to limit a review to 140 characters for a bizarre reason they won’t understand, and now we’re asking them to guess how long their message is? There must be a better way. Let’s give them a character counter. Oh, well that raises a few more questions…

Nearly there…

Who is going to write this character counter? If we’re using one we found on the net, then who wants to test it in our target browsers (i.e. not just Chrome 27 and beyond).

Also, where is the count of letters displayed on the screen? What does the count look like? Of course, the style should change as the user approaches zero characters, and should definitely look erroneous when they’ve used more than 140 characters—or should it stop accepting input at that point? If so, what happens when they paste something in? Should we let them edit it down, or alert them?

When we’ve implemented the character counter, styled all the errors, implemented the server-side validations, and checked it in all of our supported browsers then it’s just a case of writing tests for it and then deploying it. Assuming your time to production is solid, this bit will be straightforward.

All of this happily ignores the fact that users will wonder why someone wrote an eighty word review just before them and now they’re only allowed write a 140 character one. Obviously we’ll need to keep support in the loop on this, and update our documentation, API, iPhone, and Android apps. Also, what do we do with all the previous reviews? Should we crop them, or leave them as is?

Don’t get me started on how we’re gonna deal with all the funky characters that people use these days… good luck sending them in a text message. We’ll probably need to sanitize the input string of rogue characters, and this means new error messages, new server-side code… the list goes on.

Once you get through all of this you will have your feature in place, and this is just for a character count. Now try something that’s more complex than an if-statement. There are no tiny features when you’re doing things properly. This is why as a UX designer you need a good understanding of what it takes to implement a feature before you nod your head and write another bullet point.

You can’t be serious…

Yes, this was a rant. Yes, most of the above decisions can be made on the fly by experienced developers, but not all of them. Yes, you can use maxlength, but this only addresses one of the points above, and even then only in an HTML5 context.

Often what seems like a two minute job can often turn into a two hour job when the bigger picture isn’t considered. Features that seemed like ‘good value’ at a two minute estimate are rightfully out of scope at two hours.

Key point: Scope grows in minutes, not months. Look after the minutes, and the months take care of themselves.

Agreeing to features is deceptively easy. Coding them rarely is. Maintaining them can be a nightmare. When you’re striving for quality, there are no small changes.

Red means stop, green means go… snow means…?

One more example to remind you that there is no such thing as a “small” change.

LED lights are an excellent lighting solution due to their longevity and power efficiency. Replace all traffic lights with LED lights, “small” change right? Wrong.  It turns out that they may not be the best choice in all conditions. Normally, the excess heat generated by incandescent bulbs is enough to melt the snow off lights so that they remain visible even in freezing conditions. Traffic lights that employ LED lighting, while far more power efficient and reliable than older ones, aren’t able to melt the snow that accumulates.

Snow blocking traffic signals is a significant problem as it has already led to dozens of accidents and at least one fatality.