Action Item: Ban All Vacations - Resilience Engineering Association

by J. Paul Reed (Netflix, USA)

“I just got pager alert! Our service is reporting 100% CPU utilization; any ideas?”
“Not sure, but let’s start by reverting the last code deployment and seeing if that helps.”
“Ok… uhh… how do I do that?”
“Not sure; Sue knows all about our release pipeline, but… she’s… on vacation…”

Sound familiar?

As a Senior Applied Resilience Engineer on Netflix’s Critical Operations & Reliability Engineering (CORE) team, one of my jobs is to dig deep into the operational incidents we experience, mining for weak signals warning of high-impact risk, searching for areas for organizational development and learning, and discovering systemic themes about our ever-changing socio-technical ecosystem.

As we’ve invested more in these investigations, we’ve run into many cases where “vacation” turns out to be a contributory factor to how an incident unfolds. But, surprisingly, we’ve seen this play out in ways that aren’t the simple exchange above, where an engineer isn’t available in the moment of a specific incident.

In one case, a new code deployment was causing an issue for a small subset of users. The team involved elected to “fix-forward,” and release a new version of the code. The fix didn’t end up fixing the issue, and the decision was made to rollback not to the previous version, but to the version before the previous version. Code deploys were automated, but code rollbacks were not. And the engineer who normally does these sorts of tasks was out on vacation. When another engineer valiantly attempted the rollback, they succeeded… but unfortunately they didn’t fully complete all the required rollback steps, resulting in a impactful incident.

In another case, a service deployed a new version of their code and within about two hours, it started impacting customers. System monitoring detected the issue, and it was remediated relatively quickly. But in the post-incident investigation, it was discovered that a member of the team owning the service had been on vacation. That engineer normally did the weekly deployments after the Monday team sync meeting. But, because they weren’t there, they did the deployment on Tuesday, when they returned. However, another team member had added some code thinking they would have a full-week to run that code in the test environment, because… “our team always deploys on Mondays!” (Except for the week their release captain took a three-day weekend!)

This “vacation” risk often presents itself as a more specific case of a broader risk my colleagues and I have dubbed “Something somebody doesn’t know.” These can appear as mental model gaps, operational expertise gaps, observability gaps, or the omnipresent “missing important context.” Because we take extra time to examine our incidents, we find that vacations can uncover context missing or highlight areas of operational expertise to improve… but in the cases above, because the vacation was a minor triggering event, not an explicit utterance during the incident response effort, we might have never fully understand the impact of vacations and leaves if we’d hadn’t taken the time to more fully understand the incident.

So what’s the solution? Well obviously, everyone’s expertise should be available at all times, so we need to ban all vacations immediately! (Just kidding.)

But, these types of investigations do highlight the importance of organizational learning efforts that support the emergent exchange of information—what we’d call “context” at Netflix—and that we action the signal they raise that there may be an opportunity for upping a team’s on-boarding processes, so team members don’t find themselves on-call, responding to an incident, and lacking the details on how to execute important operational tasks. (Or, perhaps, code rollbacks need to be added to the pending feature list, to be automated just like deployments, and the rollbacks should be chaos tested!)

Whatever the path toward remediation is, learning that yes, even vacations and leaves can trigger and contribute to operational incidents in ways we would have never expected is a discovery we would never have made without looking more closely. And that gained insight reveals far more areas we can all work on to increase our resilience prowess.