Why aren’t we learning from (security) incidents – views from resilience and complexity

It’s been way too long since I last blogged, and this particular post has been on my mind for weeks now so decided today was the day to get it out.

I’m privileged to be part of a community of Safety and Resilience Engineering experts from whom I learn a lot, and as some of you may know I’m also fairly active in Wardley Mapping and Cynefin (Complexity) communities as well. A lot in these different bodies of knowledge is quite complementary and related and I’ll be doing some exploration on where I think they meet or are supportive of each others approaches.

First, to define the problem I’m referring to. In both security and other types of incidents, the amount of organisational learning which happens as a result of it, is generally low. For most organisations, they tend to try and identify a “root cause” as quickly as possible (without providing any attention or consideration for the myriad of contributing factors both at the operations and governance levels which led to the incident) and “whack-a-mole” a solution to deal with the “root cause”, and then we’re back to business-as-usual and, from a management perspective, no longer care about it… until next time.

I believe this is really unfortunate, because as John Allspaw mentions “incidents are an investment you already made. might as well get a return on your investment” (I’m probably butchering the quote here but that was the theme)

A view from Resilience

Many organisations say all the right things when it comes to thinking about the long term vision of the capabilities and qualities of their organisations, but are then unwilling to do the daily work which is required to achieve those long term plans.

And I don’t say this with any malice. This is hard work, and it’s hard work because you can’t justify it economically.

resilience is: proactive activities aimed at “preparing to be unprepared, without an ability to justify it economically”; sustaining the potential for future adaptive action when conditions change; and something that a system does, not what it has. Another way of thinking of resilience is “sustained adaptive capacity”
John Allspaw, https://www.infoq.com/news/2019/04/allspaw-resilience-engineering/

And because you can’t justify it economically, it’s extremely difficult for an executive to prioritise learning activities, when there’s so much competing for the attention from the resource they have available: multiple types of incidents, business funding, customers needing and pressuring for new features, new deals dependent on developing certain features, having to deal with technical and process debt, growing teams etc.

But the bottom line is (and we’ll get back to this later) if you’re not doing the daily work now, that end state you/they long for will never come to be. We need to, collectively, stop this trend of misusing the word resilience to means acquisition of robustness. Resilience is about “sustained adaptive capacity”, and when our technology stops being able to absorb the variability that the real world is throwing at it, it’s people that act as the buffer to absorb that variability and if they’re not into the habit of learning and disseminating learning across their teams, you did NOT increase your adaptive capacity, at most you increased the robustness of some of your controls and as I’ve said many types before, there’s nothing wrong with robustness, just know that robust controls/constraints tend to fail catastrophically when design conditions are exceeded.

In summary, as anyone who ever dieted, built muscle mass or evolved in martial arts, if you’re not putting in the work, don’t expect the results.

A view from Complexity

Some of the key ideas from Complexity in why we’re not learning from incidents are those of disintermediation and affordances (which is related to the previous point as well)

I would highly recommend reading this post on Cognitive Edge’s blog on disintermediation

disintermediation; the removal of interpretative levels between the decision maker and the raw data. There are several reasons for this, both positive and negative and the creation of multiple layers of mediation is one of the reasons leaders get into the mode satirised by the cartoon. You see this in politics and business alike and it is important to realise that in the vast majority of cases it is a consequence of a process rather than a moral choice by the individual.
That means you end up with people who filter that data for you. Unless you are a saint (in which case you are unlikely to be in this position anyway) it will be difficult to prevent those people presenting the data in a form which feeds your preferences. Their power comes from their ability to influence you and they kinda want to please.
Dave Snowden, https://www.cognitive-edge.com/disintermediation/

Particularly when dealing with incidents, disintermediation is a big problem, particularly in the tech industry. Other executives, outside the CTO/CIO, are unlikely to be “technical” so they expect to receive a run down of “what happened? what are we doing about it ? when will it be fixed? now ensure it doesn’t happen again” cycle

But by the time even the CTO/CIO get feedback, it usually already went through a good number of interpretative layers and higher-ups never get to understand what people saw, what was weird about it, what did they relate to or why it made sense to them to act in a certain way. The cynic in me suggests there’s a level of “plausible deniability” point to it, in that “if you don’t know, you can’t act” but there are also less cynical issues of failing to understand that, for all the technical detail that may be involved in a tech incident, the response and recovery is largely a human thing that we don’t really need interpreters for. Getting into the habit of having direct access to raw data of what happens during incident recovery, would lead to insights (even for CTOs/CIOs) that there’s not other way to get to.

Another important aspect is that of affordances, and this is highly correlated with what I said earlier which I’ll quote myself (urgh) just to re-iterate

And because you can’t justify it economically, it’s extremely difficult for an executive to prioritise learning activities, when there’s so much competing for the attention from the resource they have available: multiple types of incidents, business funding, customers needing and pressuring for new features, new deals dependent on developing certain features, having to deal with technical and process debt, growing teams etc.
Me, scroll up

We often tend to attribute to lack of care the fact that executives aren’t prioritising “what we know” are the things we should be working on, but we fail to appreciate the affordances provided by the environment and context.

Affordance is what the environment offers the individual.
James Gibson, The Senses considered as Perceptual Systems

“Even if you know what the right thing to do is, you may not be able to do it. The context may not permit it in terms of expectations. If you are designing systems, you need to design for sight, attention and action separately”
Dave Snowden

How does a Product Owner justify spending 2 more hours on an accident, after a “root cause” has been identified, and there’s clearly no financial or political incentive to do so ? How does an executive justify to investors or the board that the 5 features they had planned for this quarter are delayed, because we had a few incidents and decided to ensure we did loads of learning after we had identified a “root cause” and fixed it ? or that a sales deal fell through because we were doing things we can’t justify economically ? This is the real challenge of learning from incidents (and it’s something I’ll get back to in a few weeks or months) but as Dave Snowden says above, we need to design systems for this and design independently for sight (to ensure we SEE the variability we struggle to deal with), attention (ensure we design it in a way which can demand or facilitate attention to it) and action (to ensure we actually do and deliver the things that will improve our resilience).