Resilience engineering: Where do I start?

Originally published by Lorin Hochstein

This an introductory guide to readings in resilience engineering. Key papers are organized into themes:

The papers linked here should all be accessible to casual readers.

For additional papers, see resilience engineering notes.

You man also be interested in the Resilience Roundup blog by Thai Wood https://resilienceroundup.com/issues/

What is resilience?

https://www.youtube.com/watch?v=8LbePBiOvZ4
Dr. Richard Cook describing the resilience of bone

A resilient organization adapts effectively to surprise.

Here I’m using the definition proposed by David Woods. Before going into more detail about resilience, it’s important to distinguish it from a different concept that Woods calls robustness.

Robustness vs. resilience

Resilience vs robustness

When we talk about designing highly available systems, we usually cover techniques such as redundancy, retries, fallbacks, and failovers. We think about what might go wrong (e.g., server failure, network partition), and design our system to gracefully handle these situations.

Woods uses the term robustness to refer to systems that are designed to effectively handle known failure modes.

Resilience, on the other hand, describes how well the system can handle troubles that were not foreseeable by the designer. You can think of robustness as being able to deal well with known unknowns, and resilience as being able to deal well with unknown unknowns.

Changing perspectives on accidents and safety

Resilience engineering as a field emerged from the safety science community. That’s why you’ll often see examples from aviation and medicine, as well as other safety critical areas like maritime, space flight, nuclear power, and rail.

Because of this history, the earlier papers that we associate with resilience engineering are reactions to previous ways of thinking about accidents in particular and safety in general.

Note that traditional approaches to safety often focus on minimizing variance associated with humans doing work, using techniques such as documented procedures and enforcement mechanisms for deviating from them.

For those of us who work on cloud web services, we don’t have this legacy of enforced procedures to contend with.

New look / new view

The “new look” or “new view” refers to a change in perspective on how accidents happen, which focuses on understanding how actions taken by actors involved in the incident were rational, given what information those actors had at the time that events were unfolding.

Johan Bergström of Lund University has three excellent short (<10 minute) videos:

Two great introductory papers (alas, both paywalled) are:

A great book on putting these ideas into practice in incident investigations is:

Safety-II

Safety-II is a perspective on the role that humans play in safety-critical systems, proposed by Erik Hollnagel. In the Safety-II perspective, it is the everyday, normal work of the humans in the system that create the safety, as opposed to the errors of humans that erode it.

Complex systems

Ever wonder why resilience engineering advocates natter on about “no root cause?”

A recurring theme in resilience engineering is about reasoning holistically about systems, as opposed to breaking things up into components and reasoning about components separately. This perspective is known as systems thinking, which is a school of thought that has been influential in the resilience engineering community.

When you view the world as a system, the idea of cause becomes meaningless, because there’s no way to isolate an individual cause. Instead, the world is a tangled web of influences.

You’ll often hear the phrase socio-technical system. This language emphasizes that systems should be thought of as encompassing both humans and technologies, as opposed to thinking about technological aspects in isolation.

  • How complex systems fail by Richard I. Cook is a great starting point. It’s a short paper and very easy to read.
  • Drift into failure by Sidney Dekker is a book written for a lay audience, so it is also very readable. Dekker draws heavily from systems thinking to propose a theory about how complex systems can evolve into unsafe states.

Coordination

The systems we are interested in often involve a collection of people working together in some way to achieve a task. One particularly relevant example involves a collection of engineers working together to troubleshoot and repair a system during an ongoing incident.

Automation

One thing we software folk do have in common with the safety-critical world is the increased adoption of automation. Automation introduces challenges, and the nature of these challenges is a topic of many resilience engineering papers.

You might hear the phrase joint cognitive system in the context of automation. This terms refers to systems that do cognitive work that are made up of a combination of humans and software. There is an entire research discipline that studies joint cognitive systems called cognitive systems engineering, initially developed by David Woods and Erik Hollnagel, both of whom would both later go on to play a significant role in developing the field of resilience engineering.

Because resilience engineering researchers like Woods and Hollnagel have their roots in cognitive systems engineering, and because of the ever-increasing use of software automation in society, this community is very concerned about the potential brittleness associated with poor use of automation.

  • Ironies of automation by Lisanne Bainbridge is a classic paper on the problems that automation can introduce. The paper was originally written in 1983, and continues to be widely cited.
  • Ten challenges for making automation a team player by Klein et al. is a more recent paper that outlines the requirements for automation to be genuinely effective in socio-technical systems. This work draws heavily from the theme of coordination discussed earlier.

Boundary as a model (Rasmussen)

The late Jens Rasmussen is an enormously influential figure in the resilience engineering community.

In this widely cited paper, Rasmussen advocates for a cross-disciplinary, systems-based approach to thinking about how accidents occur. He argues that accidents occur because the system migrates across a dangerous boundary, and this migration occurs during the course of normal work.

Here is a depiction of the model from that paper:

boundary

David Woods

We’ve already referenced several papers authored or co-authored by David Woods. Woods is a force of nature in the field of resilience engineering, having played a key role in creating the field itself. Woods is incredibly prolific, and has introduced a wide variety of concepts related to resilience engineering.

Woods is interested in resilience engineering principles that apply across an enormous range of different types of systems: whether we’re talking about the organs in a biological organism up to organizations like NASA.

Because he’s interested in general principles, many of his papers are written at a very abstract level, where he discusses generic concepts such as units of adaptive behavior or saturation.

Dragons at the boundary

David Woods uses the metaphor of a system moving within a boundary in his writings on resilience engineering, but in a slightly different way than Rasmussen.

Woods sees the boundary as a competence envelope. There are two different regimes of system behavior: far from the boundary and near the boundary.

When a system is far from the boundary, the system (and its environment) behave as expected. By contrast, when a system grows near to the boundary, surprises happen. Woods uses the metaphor of dragons to capture the surprises that occur when a system moves near the boundary, and how the system’s model of the world is violated when it enters this regime.

It is how units within a system adapt when the system moves near the boundary, how these units deal with the dragons, that is one of the prime concerns of Woods.

Woods’s Essentials of Resilience, revisited discusses behavior at the boundary, although it doesn’t use the dragon metaphor.

The adaptive universe

Woods’s idea of the adaptive universe is characterized by three properties:

  • Resources are finite
  • Surprise is fundamental
  • Change never stops

I haven’t found a good introductory paper for the adaptive universe, as it encompasses an enormous number of topics, including the topic of dragons at the boundaries that we discussed earlier.

I recommend watching Woods’s Resilience Engineering short course, which covers this topic. I’ve written my own notes on the short course, which you might find useful. In particular, you might be interested in my summary notes.

Graceful extensibility

Woods introduced the theory of graceful extensibility to capture how successful systems adapt effectively to surprise. The most relevant paper here is:

Four cornerstones/abilities/potentials

Four essential capabilities in a resilient system (Hollnagel, 2009):   

  • Learning from experience requires actual events from both what goes well and what goes wrong, not only data in databases. This requires selecting what to learn and how the learning is reflected in the organization, i.e. what is reflected in changes in procedures and practices. This ability is related to coping with the factual.            
  • Responding (including readiness to respond) to regular and irregular threats in a robust and flexible manner. The system is designed to provide a limited range of responses. There is still a necessity to adjust responses in a flexible way to unexpected demands. This ability enables coping with the actual.
  • Monitoring in a flexible way means that the system’s own performance and external conditions focus on what it is essential to the operation. This includes internal monitoring as well as monitoring the external conditions that may affect the operation. This will make it possible to identify what could be critical in the near future.     
  • Anticipate threats and opportunities. It is required to go beyond risk analysis and have the requisite imagination to see what may happen, and see key aspects of the future (Westrum, 1993). It is not only about identifying single events, but how parts may interact and affect each other. This ability addresses how to deal with the irregular events, possibly even unexpected events thereby allowing the organization to cope with the potential.

Hollnagel, E. (2009). The four cornerstones of resilience engineering. In: Nemeth C., Hollnagel E. and Dekker S. (Eds.), Resilience Engineering Perspectives, vol. 2, Preparation and Restoration. Ashgate, Aldershot, UK.