
Tech Talks are in-depth technical discussions. As a system becomes more complex, the chance of failure increases. At a large enough scale, failures are inevitable. Incident response is the practice of preparing for and effectively recovering from these...
Loading summary
A
Welcome to Co Recursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. All complex systems eventually fail, and software is becoming more and more complex. Incident response is preparing for and effectively recovering from these failures. Emil Strlowski is a production engineer at Shopify, where his role shares many similarities with Google's site Reliability Engineers, also known as SREs. In this interview, Emil argues that the academic study of emergency management and industries such as aerospace and transportation have a lot to teach software engineers about responding to problems. You'll hear Emil argue that we need to move beyond tribal knowledge and incorporate practices such as an incident command system, rigorous use of checklists, and why we need to move beyond a move fast and break things mindset. I think you'll enjoy this interview.
B
Hi, Adam. Thanks for having me.
A
So you've given some talks about incident response. What is incident response?
B
Incident response is a field where we look at how systems can fail, both organizational and systems we build, and how we can optimize recovering them back to their normal state and everything around that. So that's mitigating system failure, that organizational, and figuring out how to organize the human response component. That's bringing the system back to running and then doing a retrospective and looking back at the system and seeing what lessons we can learn from the system failing and making sure that it doesn't fail the same way in the future, or if it does, that we can minimize the impact it has.
A
So is incident response made up of several pieces or steps?
B
Yeah, in my research into incident response. So when you're sort of looking into this field, and it's to maybe my naive surprise, I discovered there's like, there's this whole body of work where there's sort of institutes that are going and looking and the sort of the term that will be used there is emergency management, and there it's broken down into four components. It's broken down into mitigation, preparedness, response, response, and recovery. And the four components. So mitigation is systems will fail, things will break. How do we reduce the risks in making sure we have a safe failure? So an example of this is like on a construction site, you might mark off zones under the crane where people can't walk as the crane is operating, because if the crane breaks for whatever reason, something can drop in that area for us. In software, that might be something like having bulkheading or circuit breakers. If, say, remote service is not working, preparedness is how do we, I think, like the human component. So the analogy I Always think of in tech would be on calls. You don't ever want your system to break, but you assume it will. And figuring out organization, like who is going to be the person who comes in and gets alerted when that breaks. That would be preparedness. Response is actually fixing it. The service is broken, you need to bring it back up. And then recovery will be going and looking and doing a retro on it and recovery sort of you getting back to business as usual or operating. And if we're going back to the tech analogy, response will be, you say, switch over from a highly, highly available database. Your first one goes down, your second one is up. So bringing to the second database or to the secondary will be the response. And then the recovery will be standing up a new highly available database and having a new secondary for the database that's running.
A
So you had mentioned bulkheading. What is bulkheading?
B
So bulkheading is when you have, say, a slow external service and you have a bunch of app servers calling out to it. Bulkheading is this idea that you don't want all your app servers to be waiting on a response permit, where you will sort of say, allow only for a certain percentage of connections in your app fleet to be connected to one service. And you assume that if that percentage goes over or if the number of connections goes over, you can assume that the latency in that service is too high. And so you'll do a quick failure and you'll just return a failure connection and the service will have to go into its own degraded state. And it comes from the idea of in ships where you have bulkheads and ships where if you have water leaking into a particular portion of the ship, you want to isolate the damage where you can handle, say, a quarter of your ship being underwater, a quarter of your app servers being tied up in one service, but you can't handle all of them being tied up because then you have no more app capacity.
A
And so it's mitigation because you're mitigating a further disaster of some sort.
B
Right? You're lowering the risk of failure. So you're identifying your risks and you're saying, how can we lower the impact of if this risk manifests and we have an incident or issue with it, and that'd be mitigation. Sorry, go ahead.
A
And mitigation is preventing problems. You had an anecdote about how airlines or airline manufacturers tackle mitigation by tracking parts. Could you, could you share that? Right.
B
So in my strange loop talk, I brought up this idea of how every Single part in an airplane is tracked meticulously. It's tracked when it was first manufactured, when it was put into the aircraft, when it was last serviced, how many flights it's been on and what's sort of been its operational history. And then after a certain amount of time, mechanics will have to go and sort of inspect the part and decide whether or not it can continue flying or maybe it needs to be fixed, replaced, et cetera, et cetera, et cetera. And I was thinking, what would this look like in a code base? So we often when we design systems, we'll build out the system, we'll think a lot about how the system operates and then we'll ship it. But we don't track usage, individual, say, function calls. And so it'd be interesting if like imagine every time you deployed code, we started tracking how many calls there were to every single function. And, and you said after a billion calls to this one function, you'll open an issue to go and look and to sort of examine, hey, did we design this properly? Is this becoming technical debt? Should we just leave it as is? Should we remove it and move to a different model? Should we refactor it? And of course, like, if you have a large enough code base, that's unrealistic. But that idea of tracking the age of code is interesting, I think.
A
I like this idea. It's interesting. So if you're tracking it by the number of calls that are happening to it in production, I guess that's going to lead to refactoring kind of the code paths that get called the most. I could think of some alternate ways, like where you could say if the code that this calls or is calling this changes, then maybe it should be reviewed.
B
Also I think age of code would be very interesting if you had a heat map of least recently touched code, which wouldn't be too hard to generate by looking at the git commits of say a code base. It'd be interesting to see what's the oldest and what's the newest parts of the code base. And maybe like the oldest parts work the best, you don't need to touch them. But having that sort of looking at code from that perspective could be very interesting. Rather than just like ship and then go back and fix it when it's become this big issue of technical debt.
A
Like, look at this code hasn't been looked at in a while, it should be reviewed.
B
And it's this. The practices of meticulously tracking parts in the airline industry came from having a series of accidents that were due to part failure. And in retrospect, it's sort of, hey, we need to be constantly tracking this to know, because we know that certain parts will have will fail under these conditions or will failure after these many uses. And I haven't seen an analogy like that that we can use in tech or that we're currently using in tech, though it could be an interesting approach, and it could lead to some sort of different perspectives of how do we prioritize maintenance work on our own code basis.
A
Should we be tracking the probability of failure across our software components?
B
So one thing we do at Shopify today is we use resiliency matrixes extensively. And a resiliency matrix is you'll have on your Y axis, you'll have a grid. On your Y axis, you'll have every component in an application. And then on your X axis, you'll have different services across the entire application architecture. And you can track the. There'll be sort of like three entries for every service. So the first one will be healthy, second one will be degraded, and the third one will be down or complete outage. And then you can sort of track the state of the component in the application to see how it will react in that condition. If the database is partially down, can the application still serve something, or is it just going to return a complete 500? If Elasticsearch is down, search what might be down, but you can still complete a checkout. And we've been looking at sort of trying to figure out what are the different failure scenarios. And in more complex systems, say like aerospace and airlines, in complex industrial processes, like in chemical plants, they'll build diagrams out for every single individual component, and then they'll attach different probabilities or risk factors to maybe that component failing. And they'll also map out in these trees the different relationships and dependencies they have between different components. So maybe a door, a cargo door will fail if two particular bolts will fail, or maybe it'll only fail if one of those bolts will fail. And then the bolts have their own dependency of trees of, like, what will cause them to fail. And building out these sort of maps of how can our systems fail, what is the chance of them failing in different scenarios is powerful. It allows us to sort of realize, hey, maybe a whole large chunk of our application is dependent on this one component. And then that tells us, oh, we have a very high risk in this component, all our best effort, or it'd be best for us to sort of direct all our effort to mitigate that one component. And make sure it's a lot more resilient to failure.
A
So it helps you kind of pinpoint, I guess, like linchpins or Very.
B
Exactly.
A
Elements that could have cascading failure stories, I guess.
B
Yeah. So in the talk I talked about probabilistic risk assessment, which is sort of the overarching topic of what are the different sort of ways you can look at a system and figure out the chances of it failing in different scenarios and what the different components rely on each other, et cetera.
A
What was preparedness in emergency response?
B
So preparedness is how does the. How do responders prepare, for lack of a better word, to an incident or failure? So in my talk I talked about the Incident Command System. The Incident Command System was developed after a series of forest fires in Southern California. The response to them was mismanaged. And what ended up happening is the LA City Fire Department and the LA County Fire Department both competed and they had this sort of miscommunication and almost chaos in how they responded to the fire. And after, after the fire, sort of after the fire was extinguished, people went back and looked at it and they saw that these two organizations sort of not communicating together substantially exacerbated the size of the fire and the impact of it. And they went off and they developed a system organizing response to fires. And it was called the Incident Command System. And the idea behind the Incident Command System is during incidents, outages, failures, you'll have one person who's in charge of responding to the incident, and they'll have complete or almost complete authority on how to respond to it. And they'll delegate and they'll say something like a portion of people need to go respond to this and that structure, and placing that structure and having that structure laid out beforehand was very valuable. An interesting analogy for an instrument commander would be a composer in an orchestra. The composer can't play every instrument individually in the orchestra as well as the musician. Yet without the composer we wouldn't have like the final piece or the final composition won't be as great. And so it's this idea of like somebody who's organizing the response is important and makes every individual component the sum of it much greater than the absolute sum of it.
A
So does that mean that an incident. So that the person who is the incident response person, what are they in charge of?
B
So I can give you an example of how this works at Shopify. At Shopify, we have a dedicated IMOC role, so an incident responder role. And that's an on call rotation of production engineer. So shopify RSR roles we call production engineer. And so in addition to their normal on call, they'll sometimes go onto this other on call instead. And what will happen is if an incident is severe enough, the IMOC will come in and they'll be sort of, they'll be the incident command, the incident commander. And what that means is their job is to make sure that all the on call personnel that are necessary to mitigate the issue or to respond to it are present. They'll facilitate that. So if on calls need to get somebody else to come online, the IMOC will go and make sure that happens. If there's like the IMOC will be the person who's in charge of tracking the incident. And so if a new responder comes online, they'll update them on the situation. They'll also be the ones who communicate with stakeholders. So in the past we, if an incident was severe enough, we, there wasn't anybody, nobody's, it was, it was nobody's job to specifically go and update the status page or write the status page message or say inform leadership. And an IMOC is this formalization event. And it's interesting, when we first rolled this out, you sort of realize that there is implicitly this role already. If you think back to times when there's an outage or a severe enough incident, there's one or two people who are managing the process, but they're never elected. It's never, they never come in and go, I'm the one who's responding. It just sort of naturally happens. With having a dedicated IMOC role or dedicated incident command role, you can not only sort of clarify who's going to be doing that role, but then you can also roll out appropriate training and give the best techniques. So one thing we have is we have an IMOC bot, which is a ChatOps bot that's integrated into our main ChatOps tool that'll help coordinate the actual incident. So during incidents we can add notes that will then later show up in our RCA docs. With this, the IMOC bot will also send one on one slack messages to certain people with checklists. So for example, it'll say make sure to update the status page. It'll say make sure the lock deploys. Does this look like a broken code issue? Can you roll back? It's been three hours. Do you need to swap out your IMOC sort of role right now with somebody else? And so all this formalization is very powerful and putting a term on it and putting the ID on it and then having Dedicated people focus on that role has tremendously helped us with reducing our time to recovery.
A
And IMOC is their job is to not actually address the issue, but kind of coordinate the addressing of the issue. Right?
B
Exactly.
A
Another thing you kind of led into there was checklists. Could you expand on checklists?
B
So in my research around emergency management and incidents, I was reading about airlines and sort of the power of checklists, and I happened upon the story of the B17 and sort of the origin of checklists in airlines. And the story goes, in the 30s, the US Army Air Force was trying to procure new bombers. And all the major airline manufacturers had developed their own sort of prototypes for. For this competition. And Boeing had developed the B17, which had all these sort of amazing capabilities. It was more resilient to damage. It could fly farther than any of the competing prototypes. It could carry a lot more weight. And so much people were really excited about this. But it was also, with that, with all those added features, it was also a much more complex plane. And they had brought all these prototypes out to a test airfield out in Seattle. And on the second test flight for the B17, just after takeoff, the airplane crashed. And during the investigation, they had realized there was a pilot error. There's a particular valve that has to be open just before takeoff and during takeoff, but immediately after it has to be closed. And the pilots had forgotten to close it. And in the ensuing investigation, a bunch of the test pilots for the army went off and they started thinking like, how do we. What do we do? Because this wasn't. These weren't novice pilots. One of the pilots was the chief test pilot for the army at the time. And when they came back, they didn't introduce or roll out more additional training. Instead, they came back with this idea of a checklist. And this was the first checklist in airlines and in aviation. And the checklist was quite basic. It was. It had sort of do these, say, three or four steps for just before engine start or before takeoff, during takeoff, after takeoff, etc. Etc. And the reason they rolled it out was because they had this realization that the system was so complex that you couldn't remember every single component right when you needed it in your brain at all times. And so they put down the most important steps onto this list that pilots can follow. And this sort of took the whole industry by storm. Where now, if you think of a profession that uses checklists, pilots are immediately going to jump into your head in a cabin. Checklists are. They are built into the dashboard with a computer they also have binders full of them. And whenever there's an incident or an issue on board an aircraft, the first thing a pilot will do is take out the checklist and start going through the steps. And when you look at more generally, other industries have also started beginning to adopt this in the military, in medicine. And when you look at sort of the before and after of mistakes or failures or meaning, time to recovery with checklists, everything gets substantially better. And it's almost, it's almost malice not to start using a checklist, surprisingly. And so one, it was kind of like, it was surprising to me because when you go and you think of a checklist, for me personally, I always thought of a checklist as something that took the thinking out of responding to an issue. If a human is responding to a critical issue, why did they, as a critical thinker, why do I need to use a checklist? But the reality turns out is that humans, while we might be good at solving these like, complex problems, will often forget the basics or we'll often forget something that's easily overlooked, but it's really important to the recovery. And the analogy I used in my talk was this idea of automating thinking. In tech, we and programming, we automate everything that's manual and repetitive all the time, because why would we redo it? But an incident response or with checklist, you're doing that but for your brains almost. So going back to the example of the IMOC bot, if you have an incident in production, lock deploys. Don't let new changes go out unless they're related to the response, but you're going to. While it seems like a very obvious thing to do, there's going to be those situations where that thought sort of skips your brain or whatever and you forget to lock deploys and somebody deploys new code and it exacerbates the issue. Checklists are sort of this thing that let us go like, okay, there's an incident of production, what do we do first, lock deploys, second, go down the debug checklist and then you can have a second debugging checklist or whatever to sort of start figuring out are we seeing how is the database looking, how is the edge network edge looking, what's our app server capacity, so on and so on. And then when you actually see that there's an issue in a particular component, you can, then that's where you want to save all your sort of thinking and time and focus all your energy on that complex problem and figuring that out. Because checklists can't help, can't always help with those type of problems.
A
So why maybe this is just a very small detail, but why do we want to lock deploys when an incident happens?
B
Oh, so we follow that process to lock deploys just because failure happens from change and systems break when something changes because before they break they're in a stable state. And so the idea is that if you lock deploys, you won't be changing anything new. So you have your current sort of, you won't be introducing new changes that could change the response.
A
But there's already an incident taking place, I guess. So aren't you already in a non stable state?
B
Right, but you don't want to be introducing more changes during that. You want to figure out what change brought you to the point of incident and then mitigate that. Fix it or remove it.
A
Makes sense. So checklists are another example of an area where other fields have things to teach us about how to respond to outages. What can pilot communication teach us about responding to incidents?
B
So another really interesting thing that came out of my research into other industries was crew resource management. And there was a story that I happened upon for United Airlines Flight 173. And the story was it was a flight from JFK to Portland. And on the approach to Portland, as they were lowering the landing gear, the pilots heard a thump and the gear down success light didn't turn on, so they weren't sure if their gear was actually down. So they aborted their landing and they started circling around the airport try and debug the issue and figure out what had happened. And they did that for about an hour until they decided to start approaching to land. And all their engines began to burn out. They lost power and the airplane crashed just before the Runway. It turned out that the airplane had run out of fuel. And when investigators went and looked at the flight recording, at the flight record, the flight recorders, they had heard how both the first pilot or the first officer and the flight engineer had warned to the captain that they were running out of fuel. But the captain didn't respond or didn't acknowledge. And they had assumed that he had acknowledged or hadn't heard the issue, but he just chose not to do anything about it and didn't deem it a problem. And around this time there was a lot of there's a series of incidents where a breakdown in communication was one of the core reasons the accident had happened. Maybe there was a miscommunication between the air traffic controller or the pilots, maybe the tower and the pilots, between pilots within a cabin, where one pilot might have identified an issue and brought it up to somebody else, but nobody acknowledges it or they assume that the person who brought up the issue will fix it, whatever it might be. And this had led to countless, almost countless. And so what had happened, almost countless accidents. And what happened is the faa, which is the agency in America that regulates aircraft and aviation in concert with NASA, went and did a workshop to try to figure out how to rectify these issues. And NASA came out with this idea of crew resource management. And crew resource management is a formalization of best practices in communicating in high stress situations where time is almost of the essence. And so some of these things are. The ideas are very basic. It'll be like, clearly indicate who you're talking to, specify the issue you're noticing, specify why you, how you notice the issue. So maybe the gauge is broken. Talk about or mention how you plan to resolve the issue and wait for acknowledgement from the person you're talking to. And even like as, even as I'm reciting these, it seems so basic, it seems so obvious. Like, of course, of course I'm gonna like, be like, hey, Captain, I'm seeing this problem. This is why I'm seeing like, it's. There's nothing, nothing that's gonna blow your mind. But what the airline industry noticed is that when they formalize these ideas and when they sort of train them and almost like making it second nature for anybody who's a pilot to use these techniques, the results are you, you can't sort of, you can't debate the results. It works.
A
When you.
B
In my research, I was listening to talks from pilots talking about near misses or accidents they were involved in. In every single one of them, the pilot will talk about how they use crew resource management to more effectively communicate an incident. They might tell their copilot to look at an issue, or they might know if one person's debugging stuff, if the other one's flying the plane. And having that is really, really helps. And I was thinking as I was reading through this stuff about past incidents, I've been involved in incidents where say there's maybe a major outage and a bunch of people come in and start helping. There'll be sort of three or four people who'll be like, oh, I think this is broken. And they're four separate things, but they're. And then one of those gets ignored or gets lost in the noise. And then an hour later, people circle around back and they realize that one of Those was really the issue. That was what needed to be fixed in the very beginning. And you go, okay, why did we miss it? And it's because we don't have the same structure. We're sort of, there's. When you're going and looking at the emergency, like emergency management in other industries, sometimes a lot of it is process. Sometimes a lot of it, like in the. I was going through the ntsb. Sorry, go ahead.
A
Is it. Sorry to interrupt. Is it the, is it the forced, the forced acknowledgment? I could see how that would be valuable. I've actually seen this happen where I think this would help in terms of mitigation where somebody mentions offhand, hey, the master database hard drive is almost full and it just kind of rolls, you know, everybody moves on. But that is, that, is that kind of the piece that nobody acknowledged? Like, oh yeah, that's something important, right?
B
Yeah. And forcing that acknowledgement is really powerful. And it's also, I've seen oftentimes where somebody will just make a statement, but it's not directed to anyone. And so nobody will take ownership for it in that moment. And directly making a statement to somebody could force a conversation around it.
A
So how does that influence how you guys do things at Shopify?
B
So one of the things we're looking at is modifying our on call training and talking about these ideas and talking about how what are the best ways to communicate and sort of point out issues you're seeing. Talking about how. So in the United Airlines Flight 173, the captain not acknowledging was at a time when the captain was above all in charge of the, in charge of the aircraft and you couldn't sort of challenge them. And with crew resource management, it's this idea that there's no, there's no, there might be hierarchy in terms of managing the incident response, but there's still like you want to get rid of the like human fallibility and social interaction sometimes where you might be a little nervous to say something because somebody is your superior or whatnot. And it's sort of like get rid of these ideas. You're trying to fix the problem. You're all equals. How can we best do this? And I was, I was mentioning how like how this feels like process a bit and obvious. And in the emergency management industry a lot of stuff sometimes is just process. Like I was reading NTSB investigation manuals and the NTSB is the National Transportation Safety Board. It's the agency in America that is responsible for any transportation related accident. Figuring out what happened. And then once they figure out what happened, deciding whether or not they need to either issue new regulations or a bulletin or like issue of sort of here's advice on how to avoid deaths in the future. And it's an agency of 500 investigators whose job is to investigate and figure out the root cause of accidents. And when you're reading the manual, since it's a government issued manual, the first few pages are talking about expensing items and when you expense an item to make sure to keep the receipt. And you're reading this like, okay, like this is really obvious, I don't even know this. But for other stuff like calling out, calling out people or getting, trying to like break down this formality in social settings between people can be. Is actually very beneficial. And it's like the. I think one of the reasons the tech industry has been looking more and more into this is because you have to look for these golden nuggets in the rough or that needle in the haystack where you have to work through like sometimes like really thick manuals. But then a few of those pieces in there will be very valuable.
A
Yeah, and it sounds like you've extracted some great nuggets with checklists, crew resource management. Have you learned anything about like root cause analysis from this world?
B
So it's interesting in how in some regards the tech industry is actually better at root causes or figuring out or doing postmortems retrospectives than other industries. So other industries have very much of operator error, operator error, focus mentality. So they'll try to figure out who messed up and then they'll just fire them. Whereas I find in the tech industry we've been a lot better of going what happened? How do we, how do we make sure this doesn't happen again? And it's important. Dave Zwiebeck talks about. He is a director of engineering at I believe it's Pandora and he has a book on postmortems. And it's a very interesting book where it's this idea, it gives this example of. It tells a story of a technology sort of group in a bank and how they had an incident and somebody was fired for breaking the system. And how they sort of think through and discuss about like whether they should have, should buy the operator or should they should have fixed the root cause or whatever it might be. And he talks about how really what retrospectives are and postmortems are, is trying to figure out what went wrong and controlling for bias. People as humans are biased for many different reasons. And the Only way we can fight those biases and do effective analysis of what went wrong is by having other people point them out. And so some sort of biases that you might have is attribution error or an attribution bias where you'll identify sort of the root cause of an incident to a single person.
A
How does a bias affect generating a root cause analysis?
B
So an example would be if you think of an incident and you build a linear timeline. And this idea of a linear timeline is also partially broken. But suppose for now we'll have a linear timeline. There could be a point where an operator, so a programmer, operations engineer, production engineer, makes a change and the system breaks. And it looks, when you look at it in that sort of context or in that way, it looks like that person made that decision. They made a decision to ship broken code. They made the decision to delete the wrong database. I literally think back to the adage that GitLab had a while back where an operator there had logged into the wrong database machine to do maintenance and ended up deleting the wrong database. When you in that post mortem and it's written so so and so logged in and deleted master database or primary database, it looks like they had logged in, they knew they were on the, on the primary database and they decided to delete it. But that's not the case, right? It's people think they're making the right decision up until the moment they make the mistake. And so we have all these different biases when we look back. And it's important to try to build tooling and use different processes to control for them and to try to lower the chance of having those biases come into our decisions going forward. Because postmortems retrospectives are super valuable. You, you don't want to repeat the same mistake. And if you can figure out what is this sort of like core reason or root cause that's causing multiple other issues throughout your system, that's invaluable. But it's important we get there the right way.
A
So you have to make them not a trying to find who is at fault, but more look in terms of like, in terms of process or how you would change processes, right? Is that the idea?
B
Or so, for instance, at Shopify, we expose a lot of tooling around say flushing our caching system. And an example might be, I don't believe we've ever had this issue. But suppose somebody accidentally flushes the caches without intending to. One approach in the retrospective could be why on earth did you flush the caching system, do you cause issues? Suppose. But another approach could be why was it so easy for you to flush the caching system without the intention to do it?
A
Why?
B
Why could somebody make a mistake of being on the wrong database node? Why was it sort of easy to ship broken code? And maybe, maybe the conclusion is that because in order to have perfect code, quote unquote, or have such a test suite that like that particular bug could be super low, the cost of it is too high, maybe it takes too much time to run a test suite. And so the trade offs we say, okay, this is a risk we've decided to take, but having trust in the people in your organization is very important. And figuring out the systems around them and help ensuring that they have the right tools to not cause issues is where postmortem should focus on.
A
I've seen where a way that that's handled is with the five whys. Like Dave deleted the database, but why. And then you kind of is that a useful way to get to these root causes or.
B
So that's one of the ways. And there's many ways out there and I'm still going out and trying to categorize and each one has its own bias. So one interesting one that I really sort of, I like the idea of was a causal factor tree, and this is one used by NASA where they'll build out a tree, all the different components involved, different events and how different components failed. And then each of those events or components will have sort of leaves or nodes, trees under them that'll talk about how they failed or how they got into that state and what is the history of them. And I like it because you get to see sort of like the reality is that like an incident very often is multiple things going on in parallel and each thing has its own independent timeline. And in a causal factor tree you can sort of lay all that out. But then another thing that happens with a causal factor tree is that very often when you go far back enough, you'll hit sort of an organizational component and then that'll be at the bottom of the tree. So what example might be like there weren't enough engineers to fix the technical debt, let's say in an example. And so it could have like a bias in that sense. And it's. You're never going to find an approach to a problem with out or to a post mortem without any biases. It's what you should be doing and aiming to do is looking at the different sort of tools you can use the biases that will come with those, and then keeping an eye on making sure you don't succumb entirely to those or that you're aware of them, and making sure that you're accounting for them in your decisions and your conclusions.
A
So once you've generated your RCA and try different methods to eliminate biases from it, is there an end result or what's the goal? Actually, let me rephrase this whole question. Should we as an industry be tracking RCAs in some sort of global cross industry manner?
B
Okay, yes, a million times yes. If the rest of this podcast would just be me shouting yes, then maybe we should do that. So another story that I came across in my research is like, I find it very exciting. It's the Aviation Safety reporting system. The director of the Aviation Safety Board was giving a speech and he talked about how the reality is that every single airline has a huge database of all the accidents or near misses they've experienced. So an airline's legally required to report accidents that have occurred, but if it's a near miss, they don't have to. And the director was saying how where we're not capitalizing on these lessons, we're learning because only the lessons are staying silent. They're not being shared across the entire industry. And what came out of that is a data database was started where every pilot can submit an anonymous report of an accident that occurred or a near miss that occurred. And the database is managed by a neutral third party. In this case it was NASA and anybody in the industry, even you can go and look at this and read the reports and you can see what had happened, what was the sort of environment that it happened in. And in addition, the FAA liked this idea so much that you actually get legal immunity, I believe, for up to five or 10 years after an accident has occurred if you submit the report. So if you submitted a report and you did something that was wrong or illegal, but you talked about it and you let other people learn lessons from that mistake, you can't be faulted for that. And that's a really exciting idea because while we have our own different flavors of systems we all build, they're all very similar. If you take a sort of a web application that's running rails, is using MySQL as its main database, it's using Elasticsearch for search, it's using Redis for its job queue system, I bet there's been numerous incidents in every one of those companies that are very similar and that if the first company had talked about this mistake, they made. Maybe the replication setup wasn't optimal. Maybe a particular setting has a different sort of symptom that you don't expect is a problem until something else completely different in this architecture breaks. All those other companies wouldn't have had to pay the same price and figure out that lesson on their own. Imagine if there was a third party database, a database that didn't have any goals of profit, but just to better our industry, where every company can submit their service disruption reports, their retrospectives, talk about the lessons they've learned, and anybody else can go and read about them. Sure, we'll have to anonymize, say, the timestamps in it. We'll have to anonymize maybe some of the specifics, but the ideas are largely the most important part. You can even sort of, I can imagine something for us would be in the tech industry. This organization then would go off and be able to develop best practice guides. Right. If you go and look at all the different failure scenarios, say nginx in production, you can say this is one of the most optimal ways to run NGINX in production because it accounts for all these different very common incidents.
A
Well, this is a great idea. So I can imagine an nginx consulting company putting out some sort of white paper where they looked through all the incidents and they're advising everybody about best practices. Why Anonymous?
B
So nobody can go out and get blame. I guess anonymity provides protection where we don't necessarily care who was involved in the accident, we care what had happened. Because the individual, the individual themselves can be like. You can swap out a different person. And if the process and systems in place will cause people to create, well, to make mistakes, then it doesn't matter who that person was. What matters is what happened and what made that happen and what the repercussions of that were. So we can go and mitigate them.
A
Yeah, it's a great idea. You had mentioned earlier SREs. So that's a Google role, I believe. How is the SRE role influenced how you guys do things at Shopify?
B
So the SRE role is Google's sort of term for it. Production engineer is really a synonym. It adopts the SRE mindset, but the difference is largely only in the name.
A
Okay, could you expand on that? What does the role encompass?
B
So the idea with. Traditionally in companies there would be a developer role and an operator role or an operations engineer, and developers would write the software of the service that'll run in production and then they'll throw it over a fence to the operational engineers who will deploy that service and manage it, maintain it. So if there's an outage, the operations engineers are going to be the ones who are trying to fix the service, not the developers who wrote the code. The idea with the SRE role is instead of having this divide, the operations engineers build the tools and systems that developers can then use to run their own software. So imagine if you had your own internal Heroku, developers would write their code and then they push to Heroku and then they can sort of look at, they can monitor their own application. They can figure out if the application needs more resources of particular type. The SREs in an organization would build that Heroku almost internally. And the actual manifestation of what that looks like in reality is different, but conceptually it's very similar.
A
And the value of having such a new position is so you can.
B
One way to think of it is as your service scales up, the number of machines you're managing or dealing with also grows. The number of operations engine operations engineers you need as your service grows, scales linearly with that. With an SRE type role, since the focus is on its, you can almost think of it as developers with strong systems understanding who are automating a lot of the manual processes you would have in a traditional operations role. They'll scale logarithmically. So you don't need, you don't need a massive. You need a substantially smaller organization to be able to manage a large service. And it also forces much healthier sort of ideas around interacting with our infrastructure. So there's this idea of pets versus cattle, where before, where if you're manually managing a system, you would treat each computer or server as a pet, you would manually fix it, you would come in and you try to figure out what the problem is. You would almost. Yeah, it's very like one on one. With the S3 model. It is that you want to automate away everything. You want to automate away all this toil. And so you'll treat your computers like cattle, where you'll. They'll all be treated in the same. They won't. There won't be any special snowflakes where one computer has one configuration, another one has a different one. If a computer is misbehaving, you can just wipe it and reinstall the same sort of configuration that you had on all the other computers and treat it not as an individual, but as part of a herd.
A
The difference being that a pet, you know, each pet has a name.
B
Yeah.
A
And is unique to you. Where every cattle is is the same.
B
Exactly. And it's like an example would be if, like, if you have like cache one, cache two, cache three, cache N, like that's be treating them as cattle. But if you have like cache rails, cache, cache, page, cache, cache, whatever, each cache server is like unique in the infrastructure and is special and different and that's not great for long term manageability.
A
So I want to be conscious of your time. Before we go, what do you find to be the problem with the Facebook motto of old, move fast and break things?
B
I think two things. One, in the tech industry, traditionally or not traditionally, but in the past and when we were younger, the services that we built, their impact on the people around us and their impact on society was much smaller in scope. People's lives didn't rely on this thing called the Internet. Today, if when our services fail, there's a problem, the consequences can be terrifying. People can't travel, banking grinds to a halt, our 911 response services can't work anymore, so on and so on. And the list is countless. And the terrifying thing is that it's growing. That list is growing by the day. We're constantly modernizing and connecting all these different things that before were analog and now they're becoming digital. And we need to, as an industry, start to appreciate the responsibility we have to people who aren't technologists and approach our service. The things we build and the systems we build. Maybe not with the extreme of managing a nuclear reactor, but there's a lot of lessons out there that we can learn to make sure that our systems are more stable and are built with that understanding of their importance of staying up and available and move fast and break things indicates to me this old idea. It's this time before when if things broke, it's fine and we need to move to something where it's more. We can't just let other people pay the price of our systems breaking.
A
I think that's, that's a great thought and it's a, that's a great place to leave this with. So, Emil, thanks so much for your time and all your great insights.
B
Thank you. I had a ton of fun.
Host: Adam Gordon Bell
Guest: Emil Stolarsky, Production Engineer at Shopify
Date: January 5, 2018
This episode delves into the practice and philosophy of incident response in complex software systems, drawing from Emil Stolarsky’s experience at Shopify and his research into emergency management, aerospace, and transportation industries. Emil argues that software engineers have much to learn from these fields, particularly regarding process rigor, the use of checklists, communication protocols, and moving beyond traditional, ad-hoc approaches. The conversation unpacks the evolution from “move fast and break things” to a discipline where stability and responsibility are paramount.
“We need to be constantly tracking this to know, because we know that certain parts will fail under these conditions or will fail after these many uses." – Emil ([08:17])
Tech vs. Other Industries:
Bias in Analysis:
Tools & Approaches:
Quote:
"Retrospectives are trying to figure out what went wrong and controlling for bias. People as humans are biased for many different reasons..." – Emil ([33:00])
Aviation Safety Reporting System Example:
Proposal for Software:
Why Anonymous?
The episode maintains a thoughtful, reflective, and practical tone throughout, with Emil drawing clear, compelling analogies and advocating for humility, rigor, and cross-discipline learning in software engineering.
For developers, engineering leads, or anyone involved in systems reliability, this episode offers a wealth of actionable insights, timely warnings, and persuasive arguments for taking incident response much more seriously as software becomes ever more integral to daily life.