Data quality can be a real competitive advantage for companies that get it right. However, many data professionals who realise the long-term benefits of data accuracy still struggle to gain support for comprehensive and effective data quality programs.
Many organisations are facing unprecedented pressure in today’s Amazon-dominated world. The reality is that there is a “garbage in, garbage out” cycle of inaccuracies plaguing supply chain data, which can both create inefficiency and negatively impact the consumer experience. For example, one small measurement error can mean a shipment will not fit into the warehouse space assigned, causing a company to incur thousands of dollars in unnecessary costs. Additionally, a product ingredient missing from a product listing can cause an adverse reaction in a particularly vocal consumer using social media, leading to long-term damage to the brand’s reputation.
There are three pillars that each promote product information accuracy:
Human error is never a root cause, but systems can always be improved upon and made to be more resilient.
When analysing an incident or problem, it can be tempting to use human error as a root cause. If we dig in deeper, though, what appears to be human error is caused by an underlying failure of process or environment. How can that be? Here are some possibilities:
– A fragile, poorly instrumented, or overly complex system can cause humans to make mistakes
– A process that doesn’t take into account human needs, such as sleep, context or skill can also cause humans to make mistakes
– A process of hiring and training operators may be broken, allowing the wrong operators into the environment.
Furthermore, “root cause” itself is a problematic statement, as there is rarely a single issue that leads to errors and incidents. Complex systems lead to complex failures, and adding humans into the mix complicates things further. Instead of thinking in terms of root cause, I suggest you consider a list of contributing factors, prioritised by risk and impact.
Being able to recover quickly from failure is more important than having failures less often. This is in part due to the increased complexity of failures today.
When you create a system that rarely breaks, you create a system that is inherently fragile. Will your team be ready to do repairs when the system does fail? Will it even know what to do? Systems that have frequent failures that are controlled and mitigated such that their impact is negligible have teams that know what to do when things go sideways. Processes are well documented and honed, and automated remediation becomes actually useful rather than hiding in the dark corners of your system.
While I’m definitely not saying failure should be an acceptable condition, I’m positing that since failure will happen, it’s just as important (or in some cases more important) to spend time and energy on your response to failure rather than trying to prevent it.
The SLOs create the rules of the game that we are playing. We use the SLOs to decide what risks we can take, what architectural choices to make, and how to design the processes needed to support those architectures.
Ronald Heifetz is the King Hussein bin Talal Senior Lecturer in Public Leadership at Harvard University’s John F. Kennedy School of Government. For the past twenty years, he has generated critical works that have influenced leadership theory in every domain. Heifetz often draws on the metaphor of the dance floor and the balcony.
Let’s say you are dancing in a big ballroom. . . . Most of your attention focuses on your dance partner, and you reserve whatever is left to make sure you don’t collide with dancers close by. . . . When someone asks you later about the dance, you exclaim, “The band played great, and the place surged with dancers.”
But, if you had gone up to the balcony and looked down on the dance floor, you might have seen a very different picture. You would have noticed all sorts of patterns. . . you might have noticed that when slow music played, only some people danced; when the tempo increased, others stepped onto the floor; and some people never seemed to dance at all. . . . the dancers all clustered at one end of the floor, as far away from the band as possible. . . . You might have reported that participation was sporadic, the band played too loud, and you only danced to fast music.
. . .The only way you can gain both a clearer view of reality and some perspective on the bigger picture is by distancing yourself from the fray. . . .
If you want to affect what is happening, you must return to the dance floor.*
So you need to be both among the dancers and up on the balcony. That’s where the magic is, going back and forth between the two, using one to leverage the other.
* Heifetz, R., and Linsky, M. Leadership on the Line: Staying Alive Through the Dangers of Leading.Boston: Harvard Business School Press, 2002.
Following on from my previous post on there’s no such thing as a small change…
Please do not make any changes to a production system – a live system – without first testing for any side effects. For example, please do not read a blog post or a book chapter, and then check your system and find you are using manual memory management – and then just turn on automatic memory management. Query plans may change and performance may be impacted. One of three things could happen:
- Things run exactly the same
- Things run better than they did before
- Things run much worse than they did before
Exercise caution before making changes; test the proposed change first!
Quick tip regarding the Oracle database alert log (from 11g onwards). There is a fixed table X$DBGALERTEXT:
SQL> select message_text from X$DBGALERTEXT where rownum <= 30; MESSAGE_TEXT ----------------------------------------------------------------------------------------------------------------------------------------- Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 2 Number of processor cores in the system is 2 Number of processor sockets in the system is 1 Shared memory segment for instance monitoring created CELL communication is configured to use 0 interface(s): CELL IP affinity details: NUMA status: non-NUMA system cellaffinity.ora status: N/A CELL communication will use 1 IP group(s): Grp 0: Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on. IMODE=BR ILAT =27 LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up: Oracle Database 11g Enterprise Edition Release 126.96.36.199.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options. ORACLE_HOME = /u01/app/oracle/product/11.2.0/orcl System name:Linux Node name:ODIGettingStarted Release:2.6.39-400.17.1.el6uek.x86_64 Version:#1 SMP Fri Feb 22 18:16:18 PST 2013 Machine:x86_64 Using parameter settings in client-side pfile /u01/app/oracle/admin/orcl/pfile/init.ora on machine ODIGettingStarted System parameters with non-default values: 30 rows selected.
My personal opinion? This can be useful if you're looking to create some custom alert log monitoring. However I still prefer to monitor my alert logs using shell scripts since accessing this X$ table requires the instance to be up and operational. But if you don't have access to the OS then this could be useful.
I also found the following Metalink note:
High CPU for Queries on X$DBGALERTEXT (Doc ID 2056666.1)
Oracle Database – Enterprise Edition – Version 188.8.131.52 and later
Information in this document applies to any platform.
- Query on X$DBGALERTEXT consumes high CPU taking a long time to complete.For example:
WHERE to_date(to_char(originating_timestamp, ‘dd-mon-yyyy hh24:mi’), ‘dd-mon-yyyy hh24:mi’) > to_date(to_char(systimestamp – .00694, ‘dd-mon-yyyy hh24:mi’), ‘dd-mon-yyyy hh24:mi’) /* last 10 minutes */
message_text = ‘ORA-00600’
OR message_text LIKE ‘útal%’
OR message_text LIKE ‘%error%’
OR message_text LIKE ‘%ORA-%’
OR message_text LIKE ‘%terminating the instance%’
- It can also cause ORA-700 [dbgrfafr_1].
Consider using the COMMENT= clause to document why a particular change was made the next time you make a change to a parameter using an SPFILE:
SQL> alter system set pga_aggregate_target=512m comment='Changed 04-JUN-2018, AWR recommendation, MR'; System altered. SQL> select value, update_comment from v$parameter where name = 'pga_aggregate_target';SQL> select value, update_comment from v$parameter where name = 'pga_aggregate_target'; VALUE UPDATE_COMMENT -------------------- --------------------------------------------------------------------------------------------------- 536870912 Changed 04-JUN-2018, AWR recommendation, MR
“We want to limit the length of a review in the product to 140 characters, because we may want to use SMS at some stage. That’s a small change, right?”
There are no small changes when you’re committed to delivering quality software. Let’s look at the above case. A naïve programmer may well get this coded in three minutes—after all it’s just an if-statement.
A background in consulting, where you are paid for your time, teaches you to ask a few questions before proceeding with ‘small changes’. Let’s start with some easy questions.
What happens when the review is above 140 characters? Do we crop the string, or display an error message to the user? If we display an error, where does it appear? What does it say? Who is going to write the error message? How do we explain to the user why we’re limiting them to 140 characters? How will these errors look? Do we have a style defined? If not, who is designing it?
But wait, there’s more…
In the unlikely event that we have answers to hand for all of the above concerns, we’re still not finished. Just doing this server-side is a messy way to handle an error. We should do this client-side. But if we’re going to do client-side validation then I’d have a few more questions…
We’re still not done. Look at this from a users point of view. They’re already frustrated by having to limit a review to 140 characters for a bizarre reason they won’t understand, and now we’re asking them to guess how long their message is? There must be a better way. Let’s give them a character counter. Oh, well that raises a few more questions…
Who is going to write this character counter? If we’re using one we found on the net, then who wants to test it in our target browsers (i.e. not just Chrome 27 and beyond).
Also, where is the count of letters displayed on the screen? What does the count look like? Of course, the style should change as the user approaches zero characters, and should definitely look erroneous when they’ve used more than 140 characters—or should it stop accepting input at that point? If so, what happens when they paste something in? Should we let them edit it down, or alert them?
When we’ve implemented the character counter, styled all the errors, implemented the server-side validations, and checked it in all of our supported browsers then it’s just a case of writing tests for it and then deploying it. Assuming your time to production is solid, this bit will be straightforward.
All of this happily ignores the fact that users will wonder why someone wrote an eighty word review just before them and now they’re only allowed write a 140 character one. Obviously we’ll need to keep support in the loop on this, and update our documentation, API, iPhone, and Android apps. Also, what do we do with all the previous reviews? Should we crop them, or leave them as is?
Don’t get me started on how we’re gonna deal with all the funky characters that people use these days… good luck sending them in a text message. We’ll probably need to sanitize the input string of rogue characters, and this means new error messages, new server-side code… the list goes on.
Once you get through all of this you will have your feature in place, and this is just for a character count. Now try something that’s more complex than an if-statement. There are no tiny features when you’re doing things properly. This is why as a UX designer you need a good understanding of what it takes to implement a feature before you nod your head and write another bullet point.
You can’t be serious…
Yes, this was a rant. Yes, most of the above decisions can be made on the fly by experienced developers, but not all of them. Yes, you can use maxlength, but this only addresses one of the points above, and even then only in an HTML5 context.
Often what seems like a two minute job can often turn into a two hour job when the bigger picture isn’t considered. Features that seemed like ‘good value’ at a two minute estimate are rightfully out of scope at two hours.
Key point: Scope grows in minutes, not months. Look after the minutes, and the months take care of themselves.
Agreeing to features is deceptively easy. Coding them rarely is. Maintaining them can be a nightmare. When you’re striving for quality, there are no small changes.
Red means stop, green means go… snow means…?
One more example to remind you that there is no such thing as a “small” change.
LED lights are an excellent lighting solution due to their longevity and power efficiency. Replace all traffic lights with LED lights, “small” change right? Wrong. It turns out that they may not be the best choice in all conditions. Normally, the excess heat generated by incandescent bulbs is enough to melt the snow off lights so that they remain visible even in freezing conditions. Traffic lights that employ LED lighting, while far more power efficient and reliable than older ones, aren’t able to melt the snow that accumulates.
Snow blocking traffic signals is a significant problem as it has already led to dozens of accidents and at least one fatality.