Aside

Why MTTR Over MTBF?

Being able to recover quickly from failure is more important than having failures less often. This is in part due to the increased complexity of failures today.

When you create a system that rarely breaks, you create a system that is inherently fragile. Will your team be ready to do repairs when the system does fail? Will it even know what to do? Systems that have frequent failures that are controlled and mitigated such that their impact is negligible have teams that know what to do when things go sideways. Processes are well documented and honed, and automated remediation becomes actually useful rather than hiding in the dark corners of your system.

While I’m definitely not saying failure should be an acceptable condition, I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure rather than trying to prevent it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s