Aside

Root Cause

Human error is never a root cause, but systems can always be improved upon and made to be more resilient.

When analysing an incident or problem, it can be tempting to use human error as a root cause. If we dig in deeper, though, what appears to be human error is caused by an underlying failure of process or environment. How can that be? Here are some possibilities:

– A fragile, poorly instrumented, or overly complex system can cause humans to make mistakes

– A process that doesn’t take into account human needs, such as sleep, context or skill can also cause humans to make mistakes

– A process of hiring and training operators may be broken, allowing the wrong operators into the environment.

Furthermore, “root cause” itself is a problematic statement, as there is rarely a single issue that leads to errors and incidents. Complex systems lead to complex failures, and adding humans into the mix complicates things further. Instead of thinking in terms of root cause, I suggest you consider a list of contributing factors, prioritised by risk and impact.

Advertisements
Aside

Why MTTR Over MTBF?

Being able to recover quickly from failure is more important than having failures less often. This is in part due to the increased complexity of failures today.

When you create a system that rarely breaks, you create a system that is inherently fragile. Will your team be ready to do repairs when the system does fail? Will it even know what to do? Systems that have frequent failures that are controlled and mitigated such that their impact is negligible have teams that know what to do when things go sideways. Processes are well documented and honed, and automated remediation becomes actually useful rather than hiding in the dark corners of your system.

While I’m definitely not saying failure should be an acceptable condition, I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure rather than trying to prevent it.

Aside

The Dance Floor and the Balcony

Ronald Heifetz is the King Hussein bin Talal Senior Lecturer in Public Leadership at Harvard University’s John F. Kennedy School of Government. For the past twenty years, he has generated critical works that have influenced leadership theory in every domain. Heifetz often draws on the metaphor of the dance floor and the balcony.

Let’s say you are dancing in a big ballroom. . . . Most of your attention focuses on your dance partner, and you reserve whatever is left to make sure you don’t collide with dancers close by. . . . When someone asks you later about the dance, you exclaim, “The band played great, and the place surged with dancers.”

But, if you had gone up to the balcony and looked down on the dance floor, you might have seen a very different picture. You would have noticed all sorts of patterns. . . you might have noticed that when slow music played, only some people danced; when the tempo increased, others stepped onto the floor; and some people never seemed to dance at all. . . . the dancers all clustered at one end of the floor, as far away from the band as possible. . . . You might have reported that participation was sporadic, the band played too loud, and you only danced to fast music.

. . .The only way you can gain both a clearer view of reality and some perspective on the bigger picture is by distancing yourself from the fray. . . .

If you want to affect what is happening, you must return to the dance floor.*

So you need to be both among the dancers and up on the balcony. That’s where the magic is, going back and forth between the two, using one to leverage the other.

_______

* Heifetz, R., and Linsky, M. Leadership on the Line: Staying Alive Through the Dangers of Leading.Boston: Harvard Business School Press, 2002.

Aside

Test Your Changes

Following on from my previous post on there’s no such thing as a small change…

Please do not make any changes to a production system – a live system – without first testing for any side effects. For example, please do not read a blog post or a book chapter, and then check your system and find you are using manual memory management – and then just turn on automatic memory management. Query plans may change and performance may be impacted. One of three things could happen:

  • Things run exactly the same
  • Things run better than they did before
  • Things run much worse than they did before

Exercise caution before making changes; test the proposed change first!

Querying the alert log via SQL

Quick tip regarding the Oracle database alert log (from 11g onwards). There is a fixed table X$DBGALERTEXT:


SQL> select message_text from X$DBGALERTEXT where rownum <= 30;

MESSAGE_TEXT
-----------------------------------------------------------------------------------------------------------------------------------------
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 2
Number of processor cores in the system is 2
Number of processor sockets in the system is 1
Shared memory segment for instance monitoring created
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
IMODE=BR
ILAT =27
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.

ORACLE_HOME = /u01/app/oracle/product/11.2.0/orcl
System name:Linux
Node name:ODIGettingStarted
Release:2.6.39-400.17.1.el6uek.x86_64
Version:#1 SMP Fri Feb 22 18:16:18 PST 2013
Machine:x86_64
Using parameter settings in client-side pfile /u01/app/oracle/admin/orcl/pfile/init.ora on machine ODIGettingStarted
System parameters with non-default values:

30 rows selected.

My personal opinion? This can be useful if you're looking to create some custom alert log monitoring. However I still prefer to  monitor my alert logs using shell scripts since accessing this X$ table requires the instance to be up and operational. But if you don't have access to the OS then this could be useful.

I also found the following Metalink note:
High CPU for Queries on X$DBGALERTEXT (Doc ID 2056666.1)

APPLIES TO:

Oracle Database – Enterprise Edition – Version 11.2.0.1 and later
Information in this document applies to any platform.

SYMPTOMS

  • Query on X$DBGALERTEXT consumes high CPU taking a long time to complete.For example:
SELECT count(*)
FROM X$DBGALERTEXT
WHERE to_date(to_char(originating_timestamp, ‘dd-mon-yyyy hh24:mi’), ‘dd-mon-yyyy hh24:mi’) > to_date(to_char(systimestamp – .00694, ‘dd-mon-yyyy hh24:mi’), ‘dd-mon-yyyy hh24:mi’) /* last 10 minutes */
AND (
message_text = ‘ORA-00600’
OR message_text LIKE ‘útal%’
OR message_text LIKE ‘%error%’
OR message_text LIKE ‘%ORA-%’
OR message_text LIKE ‘%terminating the instance%’
);
  • It can also cause ORA-700 [dbgrfafr_1].

Comments in your SPFILE

Consider using the COMMENT= clause to document why a particular change was made the next time you make a change to a parameter using an SPFILE:

SQL> alter system set pga_aggregate_target=512m comment='Changed 04-JUN-2018, AWR recommendation, MR';

System altered.

 

SQL> select value, update_comment from v$parameter where name = 'pga_aggregate_target';SQL> select value, update_comment from v$parameter where name = 'pga_aggregate_target';
VALUE      UPDATE_COMMENT
-------------------- ---------------------------------------------------------------------------------------------------
536870912  Changed 04-JUN-2018, AWR recommendation, MR