Transcript
A (0:01)
Welcome to Co Recursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. All complex systems eventually fail, and software is becoming more and more complex. Incident response is preparing for and effectively recovering from these failures. Emil Strlowski is a production engineer at Shopify, where his role shares many similarities with Google's site Reliability Engineers, also known as SREs. In this interview, Emil argues that the academic study of emergency management and industries such as aerospace and transportation have a lot to teach software engineers about responding to problems. You'll hear Emil argue that we need to move beyond tribal knowledge and incorporate practices such as an incident command system, rigorous use of checklists, and why we need to move beyond a move fast and break things mindset. I think you'll enjoy this interview.
B (1:12)
Hi, Adam. Thanks for having me.
A (1:14)
So you've given some talks about incident response. What is incident response?
B (1:20)
Incident response is a field where we look at how systems can fail, both organizational and systems we build, and how we can optimize recovering them back to their normal state and everything around that. So that's mitigating system failure, that organizational, and figuring out how to organize the human response component. That's bringing the system back to running and then doing a retrospective and looking back at the system and seeing what lessons we can learn from the system failing and making sure that it doesn't fail the same way in the future, or if it does, that we can minimize the impact it has.
A (2:01)
So is incident response made up of several pieces or steps?
B (2:07)
Yeah, in my research into incident response. So when you're sort of looking into this field, and it's to maybe my naive surprise, I discovered there's like, there's this whole body of work where there's sort of institutes that are going and looking and the sort of the term that will be used there is emergency management, and there it's broken down into four components. It's broken down into mitigation, preparedness, response, response, and recovery. And the four components. So mitigation is systems will fail, things will break. How do we reduce the risks in making sure we have a safe failure? So an example of this is like on a construction site, you might mark off zones under the crane where people can't walk as the crane is operating, because if the crane breaks for whatever reason, something can drop in that area for us. In software, that might be something like having bulkheading or circuit breakers. If, say, remote service is not working, preparedness is how do we, I think, like the human component. So the analogy I Always think of in tech would be on calls. You don't ever want your system to break, but you assume it will. And figuring out organization, like who is going to be the person who comes in and gets alerted when that breaks. That would be preparedness. Response is actually fixing it. The service is broken, you need to bring it back up. And then recovery will be going and looking and doing a retro on it and recovery sort of you getting back to business as usual or operating. And if we're going back to the tech analogy, response will be, you say, switch over from a highly, highly available database. Your first one goes down, your second one is up. So bringing to the second database or to the secondary will be the response. And then the recovery will be standing up a new highly available database and having a new secondary for the database that's running.
