Asher Balkin Interviews Resilience Roundup’s Thai Wood

By: Beth Lay

Silence.  We have Thai Wood on the phone.  Asher looks at me, puzzled, then mouths “who is interviewing who?”  Asher and I were together at a meeting; 45 minutes earlier I’d asked if he wanted to “lead the interview”.  Asher resiliently recovered and led the conversation below.  Enjoy!

Listen to the interview

About Thai: Thai Wood writes about resilience engineering every week at ResilienceRoundup.com.   As a former Emergency Medical Technician, Thai applies his experience managing emergency situations to the software industry to help teams build more resilient systems and improve their ability to effectively respond to incidents.   Thai got his start in the software industry at Zappos.

About Asher: Asher Balkin is a safety researcher at OSU Cognitive Systems Lab with David Woods.

Asher:  How did you get started doing Resilience Engineering?

Thai:  Many years ago, spent time as an Emergency Medical Technician (EMT), working 911 service in Las Vegas, started to see similar problems in tech. industry but they had different solutions.  My thinking crystallized during the Re-deploy Conference https://re-deploy.io/ where I spoke with John Allspaw and Laura Maguire (OSU).     

Asher:  What problems, patterns, and similarities did you notice between EMT and software companies?

Thai: Incident response.  For software teams, there wasn’t a procedure for incident response, there was little opportunity to practice, and incident response was not established as a separate skill set.  In EMS and emergency medicine, it’s the opposite: they focus on practice and learning from different scenarios.  They rehearse.

Asher:  How did software companies organize incident responses before? Was it on the fly? 

Thai: Has seen a lot.  “You have a tangential skill set – emergency response – here’s your pager.”  Training may be through shadowing but this overlooks having a core framework of how to respond and how to communicate.  Looking over the shoulder of an expert may not reveal why they do what they do.  Each responder bears the burden of coping with a lot of uncertainty that they needed to invent for themselves instead of being given a framework to build off of.

Asher:  Organizations across domains struggle with how to train people to do incident management, how do you do it?

Thai: Suggestion sounds boring but it is to rehearse and practice.  Sometimes we call it “Game Days” or “Operations D&D (dungeons and dragons)” this becomes a precursor to chaos engineering and other things.  It doesn’t matter at first how you practice, what matters is to create the space for practice, to have some sort of scenario driven training.  Doesn’t have to be high tech.  There can be value in table top scenarios related to your system: you were paged for a database disc that’s full, what do you do?  Step through role-play to highlight what you don’t know. Look at documentation – is it sufficient?  Discover only 1 person knows the answer to this question: everyone asks Jane.

Asher: You mentioned “creating space for practice” – this is powerful.  A lot of folks struggle with the idea that people need to be doing production work and believe they can’t afford time for training / rehearsal.  How have you been able to create and maintain the space for practice?

Thai: It is common across all industry.  One point where it’s possible to make change is after a large incident.  It becomes evident what the payoff might be to invest in this training. You can’t afford not to. You will see the consequences of not doing so.  It may take an incident to open the space.

Asher:  Dave Woods calls it “Sweeping up after the parade. All the animals leave droppings then we get called in to into clean-up afterwards.”  But when we say “we can help you not have this problem in the beginning for a 10th of the price”, no one is interested.

Thai:  Liken it to people who cook, ask “Do you have a fire extinguisher?”  They reply “no” or they don’t know but people who’ve been through a fire see the value and say “yes”.   

Asher:  People who are most committed to investing in resilience are those who have frequent and tangible experiences with surprise.  The more often and more startling the surprise, the more you are willing to invest resources.

Asher: Earlier you said “Documentation doesn’t support the work.”  We talk about procedures / policies as directing the work, what do you mean when you say documents support the work vs. direct the work?

Thai:  In any complex system, work as imagined is very different from work as done.   If we could very clearly spec something in a document beyond a few generic responses, we wouldn’t need humans to bridge the gap and be sources of resilience.  More useful documents should allow you to learn, maybe provide explanation of how the system works, supporting work versus “type this command”.  Understanding what the system touches, what disturbances have been seen before, what mysteries were never solved, what weird things we’ve seen.   All these things help us ask and answer questions post-incident.

Asher: Go back to your past life in emergency medical response, what lesson or two are useful to what you do now?

Thai:  The number one thing is that the work and focus of responders does not need to be, and should not be, to make sure incident never happens again.  It’s not possible.  It would be ridiculous to tell responders “make sure this car wreck never happens again”.  The first car wreck was in Ohio City, Ohio about 130 years ago.  The person who made the car was driving the car and hit a tree.  He runs up against this in software: the job of incident responders is to make sure doesn’t the incident doesn’t happen again as opposed to investing in preparation, understanding, and learning. 

Asher: Is goal of “never happen again” a legitimate goal?

Thai: It can be aspirational.  Of course, we hope bad things never happen but having this be the thing we get grounded in in practice is ineffective.   It may be better to dispose of that idea (never happen again) as a whole.

Asher: What’s something learned from time in software systems that rest of world could learn?

Thai:  In his experience with medicine, even though we talk about post-mortems, he has seen a lot more post incident analyses, a lot more retrospective, a lot more conversations about failure (in software industry).  With the habits and culture in medicine, that didn’t occur very often except in very formalized or disastrous cases.  Software is doing a better job of learning from incidents.

Asher:  In non-software industries, we hear “I could understand if I just had more data, if could just measure this”.  It’s unique to digital systems that it is easier to collect data.  How has this affected the ability to learn from events?

Thai:  As a whole, wonders if it is harmful.  Wouldn’t give it up but it is so easy to collect data – have massive data stores – so easy to make a dashboard w/o really looking at who will use, what are they looking for, how does it work with human part.  It can feel like making progress but has hurt a little bit in including humans and making useful things with the data.  He spoke with someone recently: they had easy monitoring, so much information about their system, but no way could take this data and form picture of the health of the system.

Asher:  In every industry, but especially true in digital networks, how do you teach new people to make sense of what they are seeing?

Thai: The only way is experience.  Everything they do is trying to find a proxy for experience. Making space for practice is the most impactful thing and accelerates the cycle of getting experience faster.  Consider flight simulators.  It’s having things that allow people to get experience more quickly.

Asher: Could you tell us about your newsletter, ResilienceRoundup.com?

Thai: I was doing research, talking to John Allspaw.  He told me about the GitHub site.  He read the papers, 10 – 40 pages long with abstracts that were only notionally useful to what he would find important at the end.  He thought if going to review  the papers anyway, how could his time and effort benefit others?  Each week, he focuses on one paper and gives his analysis of what technology workers can glean from it.  They are always public access papers.  He wants people to read the paper, to go further.  Resilience Roundup is intended to be a bridge to the field of Resilience Engineering.  He offers something to help people get started and keep going.

Asher: Do you have a last thought to share?  

Thai: You must practice these skills.  Emergency response skills not interchangeable with other skills.   We have to have a place to practice.

Follow: @ThaiWoodHere