New PhD Thesis: “Controlling the cognitive costs of coordination in large-scale distributed software systems”

By Laura Maguire, PhD

Critical digital infrastructure (CDI) is increasingly at the core of many high-risk domains that serve societal needs. Electronic health records, military intelligence surveillance, 911 call routing systems and monitoring of high hazard processes like nuclear power plants and air traffic control systems are all software intensive operations (SIO) that critically depend on reliable, timely system performance. Technology advances in recent years have shifted not only the technical architecture but the organizational design and work practices surrounding software engineering and digital service delivery (Kim et al. 2016; Beyer et al. 2016). This has, in turn, led to a concurrent shift in the skill sets and types of expertise needed to manage the infrastructure under these new conditions.  These changes have created new challenges for coordination that have yet to be fully realized.

In addition, SIO are heterogenous – both in terms of the ways in which they utilize new architectures and the degree to which their practices are integrated across the systems they manage.  The implications of the kinds of technological advances that generate new forms of dependencies, increase complexity and new coordinative demands has been discussed for many high hazard domains. An important point to make is that, similar to the patterns from high hazard domains, the new forms of dependencies, increased complexity and coordinative demands have become a reality for many kinds of critical societal infrastructure and the full implications of this have yet to be explored. This research, conducted over three years within four mid and large sized software intensive organizations, provides generalizable insights applicable to further understanding the escalating cognitive and coordination demands during anomaly response.

Responding to anomalies in the critical digital infrastructure domain involves coordination across a distributed system of automated subsystems and multiple human roles (Allspaw, 2015; Grayson, 2019).  When a service outage occurs in CDI, anomaly recognition is often a shared activity between the users of the service, the automated monitoring systems, and the practitioners responsible for developing and operating the service (Allspaw, 2015). This forms of the first of the coordination demands as information about the nature of the problem is sought. To adequately respond to complex systems failures there is a need for multiple, diverse perspectives amongst responders for their different views of the system and its behavior and for their ability to recognize unexpected and abnormal conditions. Those whose work depends on the system functioning, both directly and indirectly, get involved either to help with resolution or to seek more information so they can adjust their goals and priorities to account for the loss of a critical service. These stakeholders generate varying but continuous coordination demands for responders as they must provide updates and manage interruptions. While this collaborative interplay and synchronization of roles is critical in anomaly response (Patterson et al, 1999; Patterson & Woods, 2001), the cognitive costs for practitioners (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) can be substantial. Cognitive costs of coordination are defined as the additional mental effort, load and delay required to participate in joint activities.

The choreography of joint activity is shown to be a subtle and highly integrated into the technical efforts of dynamic fault management. A core contribution of this research was to define the elements of choreography needed for smooth coordination and their corresponding overhead costs. An example of this is in recruiting needed resources to an anomaly response underway.  Effort is expended for the recruiter in:

  • Monitoring current capacity relative to changing demands and identifying additional resource requirements
  • Identifying the skills and experienced required
  • Identifying who is available
  • Determining how to contact them
  • Contacting them/ alerting them to the event
  • Waiting for a response
  • Adapting current work to accommodate new engagement (waiting, slowing down or speeding up, completing other tasks to aid coordination)
  • Preparing for engagement

    -Anticipating needs

    -Developing a situation assessment or status update

    -Giving access/permissions to tools & coordination channels

    -Generating shared artifacts (dashboards, screenshots)

    -Dealing with access issues (inability to join web conference or trouble establishing audio bridge)

A second contribution from this research was to identify how all participants in smooth coordinative activity incur coordination costs – it cannot be proceduralized away or assigned to a single role.  Building off the same example as above – recruitment, where the incurred coordination costs were shown for the recruiter – we see how participants being recruited to an event also additional effort.  Effort is expended for the recruited party in:

  • Being interrupted in their work
  • Assessing the request relative to their capabilities
  • Assessing the request relative to their capacity to act
  • Deferring or abandoning their own work
  • Acknowledging their orientation to the problem
  • Communicating about the deferral or abandonment to the parties they coordinate with
  • Gaining access to collaboration tools
  • Assessing available information
  • Clarifying (available data and expectations)
  • Requesting additional information
  • Forming questions about the state of the event or system
  • Determining interruptability of the participants already in the event
  • Forming interjections
  • Interjecting
  • Determining roles or role reallocation within the existing group
  • Assessing work underway
  • Assessing implications of work underway
  • Considering their contributions relative to problem constraints
  • Assessing how their contributions may influence work underway

These costs are often incurred at points in time when they are least ‘affordable’ – during high tempo, highly demanding cognitive efforts – which can lead to degradations in the joint activities and coordination breakdowns.  This underscores the need to design for coordination to maintain seamless interactions and sustained adaptive capacity. This is particularly important in microservice architectures or siloed organizations where coordination ‘at the boundaries’ becomes increasingly problematic. If each organization or unit has differing strategies and orientations towards coordination greater costs are incurred and there is less common ground to help resolve coordination breakdowns. Methods of coping with high costs of coordination in one coordinative unit can shift the cost to other units by deferring, delaying or dropping coordination effort. 

An important implication of identifying 1) that costs of coordination are shared across joint service recovery activities and 2) that strategies for reducing costs can shift them across organizational boundaries is the finding that generally accepted practices in software operations such as the incident command structure (ICS) work very differently than domain models suggest. Hierarchical, role-based coordination structures can create workload bottlenecks that slow the response efforts and force responders to adopt strategies to reduce the costs of coordination.

Instead, a distributed model based on sharing coordinative functions across the parties – Adaptive Choreography – enables practitioners to cope with dynamic events and dynamic coordination demands. It also serves to identify when strategies for controlling coordination costs at the boundaries may be inadvertently (or intentionally) shifting that burden to other organizational units. In Adaptive Choreography the elements of choreography are a shared, overlapping responsibility contingent on available cognitive resources while coping with problem demands. Participants in this form of joint activity incur a lower overall cost of coordination by maintaining an ongoing awareness of event progression that enables anticipatory action for smoothly integrating cognitive and coordinative demands across the group.

As speed, scale and criticality of digital infrastructure increases well designed coordination across both formal and ad hoc multi-party incident response is becoming increasingly relevant.  The patterns outlined as elements of choreography and their corresponding costs are analogous to other high hazard domains with distributed work systems. A renewed focus on controlling the cognitive costs of coordination can aid in managing the new forms of dependencies, increased complexity and coordinative demands inherent in CDI and other safety critical domains.

Dissertation available soon on OhioLink.