CoRecursive: Coding Stories

Episode Summary — Tech Talk: Incident Response with Emil Stolarsky

Host: Adam Gordon Bell
Guest: Emil Stolarsky, Production Engineer at Shopify
Date: January 5, 2018

Episode Overview

This episode delves into the practice and philosophy of incident response in complex software systems, drawing from Emil Stolarsky’s experience at Shopify and his research into emergency management, aerospace, and transportation industries. Emil argues that software engineers have much to learn from these fields, particularly regarding process rigor, the use of checklists, communication protocols, and moving beyond traditional, ad-hoc approaches. The conversation unpacks the evolution from “move fast and break things” to a discipline where stability and responsibility are paramount.

Key Discussion Points & Insights

1. What is Incident Response? ([01:14])

Definition:
- Incident response is about understanding and recovering from failures in both systems and organizations. It encapsulates mitigating the fallout, organizing human responders, restoring normalcy, and carrying out retrospectives to learn and prevent similar failures.
- Quote:
  “Incident response is a field where we look at how systems can fail, both organizational and systems we build, and how we can optimize recovering them back to their normal state and everything around that.” – Emil ([01:20])

2. Four Phases of Incident Response ([02:07])

Mitigation:
- Reducing risk and ensuring failures are as safe as possible. Example from aviation and software: tracking critical parts (in planes) or using bulkheading and circuit breakers (in services).
Preparedness:
- Human readiness, e.g., on-call rotations, having clear roles for when things go wrong.
Response:
- The process of actually fixing what’s broken. E.g., switching to a backup system.
Recovery:
- Returning to normal operations and conducting a retrospective.
Aviation Analogy: Every airplane part is meticulously tracked. If software tracked function invocations and the “age” of code, we might catch and prioritize maintenance like airlines do ([05:53]).

Notable Quote:

“We need to be constantly tracking this to know, because we know that certain parts will fail under these conditions or will fail after these many uses." – Emil ([08:17])

3. Risk Assessment and Mapping ([09:02])

Resiliency Matrices:
- Mapping system components against potential states of underlying services to assess what fails gracefully and what fails catastrophically.
Probabilistic Risk Assessment:
- Complex systems (chemicals, aerospace) model dependencies and assign risk probabilities. Emil suggests adopting similar visualizations in software to spot systemic vulnerabilities.

4. Preparedness: The Incident Command System (ICS) ([11:45])

Origin:
- Developed after mismanaged wildfire responses in California, addressing confusion when multiple agencies were involved.
Key Principle:
- One person (the “incident commander”) maintains overall control, spreading responsibility in a structured way.
Shopify Example:
- Shopify uses the “IMOC” role (Incident Manager On Call) to formalize this, aided by chat bots that manage communication and checklists ([14:00]).
- Quote:
  “...with having a dedicated IMOC role... you can not only sort of clarify who’s going to be doing that role, but then you can also roll out appropriate training and give the best techniques.” – Emil ([16:36])

5. The Power of Checklists ([17:11])

Origin Story:
- B-17 bomber crash led to the first aviation checklist when pilot error due to complexity became evident.
Adoption in Tech:
- Checklists are crucial for critical, repetitive, and easy-to-miss tasks (e.g., “lock deploys” during incidents).
- Quote:
  "The system was so complex that you couldn't remember every single component right when you needed it in your brain at all times.” – Emil ([18:34])
Checklists as Cognitive Offloading:
- Not a crutch, but a way to avoid basic mistakes and preserve mental energy for truly novel problems.
Example:
- If you forget to “lock deploys” during an incident, new code may be shipped and worsen the situation. Checklists catch this ([21:28]).

6. Communication: Crew Resource Management ([23:42])

Lesson from Aviation:
- United Airlines Flight 173 ran out of fuel because warnings from crew went unacknowledged. Led to development of "crew resource management" (CRM) for formalizing communication in high-stress situations.
CRM Principles:
- Explicitly state the issue, address a specific person, explain the evidence, and seek acknowledgment.
Key Insight for Tech:
- Explicit ownership and acknowledgment prevent issues from being ignored.
- Quote:
  “Forcing that acknowledgement is really powerful... Making a direct statement to somebody forces a conversation around it.” – Emil ([29:02])
Breaking Down Hierarchies:
- Encourage equal participation in incident response, breaking through nervousness or deference to rank, popularized by CRM and relevant to software teams ([29:29]).

7. Postmortems and Root Cause Analysis ([32:12])

Tech vs. Other Industries:
- Tech tends to focus less on “operator error” and more on systemic/process failures.
Bias in Analysis:
- Watch out for the “attribution bias,” where individuals are blamed instead of uncovering deeper wrongs in the system or process.
Tools & Approaches:
- “Five Whys,” “causal factor trees” (from NASA), and linear timelines all have limitations and inherent biases. Awareness and mixture of methods are key.
Quote:
"Retrospectives are trying to figure out what went wrong and controlling for bias. People as humans are biased for many different reasons..." – Emil ([33:00])

8. Sharing Incident Knowledge ([40:29])

Aviation Safety Reporting System Example:
- Industry-wide, anonymous database of accidents and near-misses transformed learning and best-practices.
Proposal for Software:
- Emil suggests software should have a similar, cross-industry, anonymized RCAs repository to rapidly spread useful lessons and avoid repeated failures.
- Quote:
  “A million times yes... only the lessons are staying silent... Imagine if there was a third-party database... where every company can submit their service disruption reports...” ([40:29])
Why Anonymous?
- Focus on the lesson, not blame. Anonymity protects individuals, fosters sharing, and shifts focus to systemic improvement ([44:29]).

9. The SRE Role and Operational Evolution ([45:17])

SRE (Site Reliability Engineering):
- Originated at Google, similar to “Production Engineer” at Shopify.
- Shifts responsibility and perspective: SREs build tools to empower developers, blurring the traditional dev-ops divide.
Scaling Efficiency:
- SREs aim for logarithmic scaling vs. linear; by automating repetitive ops, fewer humans can manage more infrastructure.
Pets vs. Cattle Analogy:
- Manual, individualized server management (“pets”) vs. automated, homogeneous infrastructure (“cattle”).
- Quote:
  “With the SRE model... you want to automate away everything. You want to automate away all this toil. And so you'll treat your computers like cattle…” – Emil ([47:09])

10. Moving Beyond "Move Fast and Break Things" ([49:07])

Changing Stakes:
- Software’s outsized impact on society now makes cavalier attitudes toward failure dangerous and irresponsible.
- Quote:
  “When our services fail... the consequences can be terrifying. People can't travel, banking grinds to a halt, our 911 response services can't work anymore..."
Call to Action:
- Borrow more rigor from industries with higher reliability demands to ensure societal stability and safety.

Memorable Quotes

“It's almost malice not to start using a checklist, surprisingly.” – Emil ([20:07])
"Making a direct statement to somebody forces a conversation around it." – Emil ([29:02])
"Trust in the people in your organization is very important. And figuring out the systems around them and ensuring that they have the right tools to not cause issues is where postmortems should focus on." – Emil ([37:13])
“Imagine if there was a third party database... where every company can submit their service disruption reports, their retrospectives, talk about the lessons they've learned, and anybody else can go and read about them.” – Emil ([41:44])
“We can't just let other people pay the price of our systems breaking.” – Emil ([50:36])

Timestamps for Important Segments

Incident response overview – [01:14]
Emergency management’s four phases – [02:07]
Aviation part tracking and codebase analogy – [05:53]
Resiliency matrices and risk mapping – [09:02]
Incident Command System origins and Shopify IMOC – [11:45]
Checklist power and aviation accident – [17:11]
Pilot communication and CRM – [23:42]
Root cause analysis and bias – [32:12]
The proposal for shared, anonymous RCAs – [40:29]
SRE role and 'pets vs cattle' – [45:17]
Critique of 'move fast and break things' – [49:07]

Tone & Closing

The episode maintains a thoughtful, reflective, and practical tone throughout, with Emil drawing clear, compelling analogies and advocating for humility, rigor, and cross-discipline learning in software engineering.

For developers, engineering leads, or anyone involved in systems reliability, this episode offers a wealth of actionable insights, timely warnings, and persuasive arguments for taking incident response much more seriously as software becomes ever more integral to daily life.