#726: Single region, zero excuses: Mastering AWS resilience - AWS Podcast

Summary7 min read

AWS Podcast Episode #726: Single Region, Zero Excuses – Mastering AWS Resilience

Release Date: June 23, 2025
Hosts: Simon Elisha and Hawn Nguyen-Loughren
Guests: John Frimento (Principal Product Manager, Resilience Infrastructure and Solutions) and Tariq Makota (Senior Principal Solution Architect, Customer Resilience Engineering)
Duration: Approximately 45 minutes

Introduction

In Episode #726 of the AWS Podcast, titled "Single Region, Zero Excuses: Mastering AWS Resilience," host Gillian Ford delves into the critical topic of building resilient applications within a single AWS region. Joined by AWS experts John Frimento and Tariq Makota, the episode offers deep insights into AWS's resilience architecture, common misconceptions, and practical strategies for developers and IT professionals aiming to enhance their application's reliability and availability.

Understanding AWS Regions and Availability Zones

Gillian Ford kickstarts the discussion by addressing listeners new to AWS, prompting Tariq Makota to elucidate the fundamental building blocks of AWS's infrastructure.

“Both are basically logical constructs under which are the actual physical implementation. Availability zone is one or more data centers that are closely located and then multiple availability zones form an AWS region.”
— Tariq Makota [01:45]

Tariq explains that Availability Zones (AZs) are clusters of data centers within a region, designed to provide physical isolation to safeguard against localized failures. Regions comprise multiple AZs, typically spread approximately 60 miles apart, offering protection against broader events like electrical grid failures or natural disasters. This layered approach is pivotal in crafting resilient and reliable applications on AWS.

AWS's Fault Isolation Boundaries

Delving deeper, Tariq outlines AWS's multi-layered fault isolation strategy, essential for designing robust applications.

Partition Level Isolation:
AWS separates environments such as the GovCloud for public sector clients and commercial regions for enterprise customers, ensuring distinct fault boundaries.

Regional Isolation:
Each region operates independently, akin to the bulkhead pattern in shipbuilding. “If you think about it, very similar to the ship building to the bulkhead pattern where each region is a part of the bulkhead and flooding one of the bulkheads does not basically take the whole ship down.”
— Tariq Makota [03:11]

Availability Zone Isolation:
AZs further subdivide regions, providing additional layers of resilience. Services like Amazon Elastic Block Store (EBS) and Elastic Compute Cloud (EC2) are AZ-bound, ensuring that failures within one AZ don't cascade across others.

Control Plane vs. Data Plane:
Tariq differentiates between control plane operations (e.g., spinning up EC2 instances) and data plane operations (e.g., handling HTTP requests). Understanding this distinction helps customers architect applications that can gracefully handle failures in either plane.

Cell-Based Architecture:
For highly scalable services, AWS employs cell-based architectures, segmenting large fleets of servers into smaller, isolated cells to minimize the blast radius of potential failures.

Common Misconceptions about Resilience

The conversation shifts to prevalent misunderstandings customers have regarding AWS resilience.

Single Region vs. Multi-Region:

“I think the biggest misconception is that single region architectures cannot achieve high availability or resilience...”
— Tariq Makota [11:25]

Tariq points out that multi-region architectures aren't inherently more resilient if not architected correctly. For instance, synchronous replication between regions can create a single point of failure, potentially reducing overall availability. Conversely, a well-designed single-region, multi-AZ setup can offer superior resilience.

Misuse of Availability Zones:

Simply deploying across multiple AZs doesn't guarantee resilience. Applications must be explicitly architected to handle AZ failures, such as implementing circuit breakers or load balancing mechanisms to prevent cross-zone dependency issues.

Strategies for Building Resilient Applications

John Frimento and Tariq Makota offer actionable strategies for enhancing application resilience within a single AWS region.

Failure Mode Analysis:

John emphasizes the importance of identifying and categorizing potential failure modes to inform resilience strategies.

“If you look at your application and it's not just like components try to break it down to say like a user journey or story.”
— John Frimento [18:00]

This involves assessing critical user journeys (e.g., purchase flows for e-commerce sites or balance inquiries for banking applications) and determining how various components interact and potentially fail.

Mitigation Techniques:

Tariq discusses several techniques to handle excessive load and other failure scenarios:

Load Shedding:
“You have a decision to make, right? ... load shed some percentage of the customer or the request and am I going to impair, let's say 20% of those...”
— Tariq Makota [19:25]

Implementing strategies to gracefully degrade service by limiting the number of requests during peak loads to maintain overall system stability.
Throttling:
Using AWS services like API Gateway for rate limiting to prevent system overload.
Exponential Backoff and Circuit Breakers:
Techniques to manage retry attempts and isolate failing components, ensuring that temporary issues don’t cascade into widespread outages.

Tools and Services for Enhancing Resilience

The episode highlights several AWS services and tools designed to facilitate resilience planning and implementation.

Amazon Application Recovery Controller:

John introduces this service, which offers capabilities like Zonal Shift and Zonal AutoShift, enabling automated traffic routing during AZ impairments. It integrates seamlessly with services like EKS clusters, Application Load Balancers (ALB), Network Load Balancers (NLB), and EC2 Auto Scaling groups.

Fault Injection Service:

A critical tool for resilience testing, allowing developers to simulate failure scenarios and validate their application's ability to handle them effectively.

Shared Responsibility Model:

Tariq elaborates on how different AWS services manage varying degrees of resilience, reducing the operational burden on customers. For instance, managed services like Amazon Connect handle much of the resilience infrastructure, whereas running containers on EC2 gives customers more control and responsibility.

“Most of the services give the customer some level of resilience boost... or have the container services manage the fleets and node workers.”
— Tariq Makota [38:58]

Parting Advice for Listeners

As the episode wraps up, both experts share their final thoughts to guide listeners on their resilience journey.

John Frimento:

“Multi AZ is a great place to start. Just out of the box you get so many benefits and so I hope that didn't get missed in today's conversation.”
— John Frimento [42:03]

He advocates for leveraging multi-AZ deployments as a foundational step toward building resilient applications, emphasizing that it's achievable with AWS's built-in capabilities.

Tariq Makota:

“Question everything. I would question recovery time objectives, recovery point objectives... simplify my architecture as much as possible.”
— Tariq Makota [42:49]

Tariq advises a rigorous evaluation of resilience strategies, encouraging customers to regularly reassess their recovery objectives and strive for architectural simplicity to minimize potential failure points.

Continuous Improvement:

John adds that resilience isn't a one-time setup but a continuous process requiring regular testing and updates to adapt to evolving application demands and potential new failure modes.

“It's not a one time deal... think about this as a continuous process.”
— John Frimento [44:22]

Conclusion

Episode #726 of the AWS Podcast offers a comprehensive exploration of single-region resilience, demystifying AWS's infrastructure and providing practical guidelines for building robust, high-availability applications. Through expert insights and actionable advice, listeners gain a deeper understanding of how to navigate the complexities of AWS's fault isolation boundaries, implement effective resilience strategies, and utilize AWS's suite of tools to safeguard their critical workloads.

For those keen on mastering AWS resilience, this episode serves as an invaluable resource, setting the stage for more advanced discussions on multi-region architectures in future episodes.

Notable Quotes:

Tariq Makota [01:45]:
“Both are basically logical constructs under which are the actual physical implementation. Availability zone is one or more data centers that are closely located and then multiple availability zones form an AWS region.”
Tariq Makota [03:11]:
“If you think about it, very similar to the ship building to the bulkhead pattern where each region is a part of the bulkhead and flooding one of the bulkheads does not basically take the whole ship down.”
Tariq Makota [11:25]:
“I think the biggest misconception is that single region architectures cannot achieve high availability or resilience...”
John Frimento [18:00]:
“If you look at your application and it's not just like components try to break it down to say like a user journey or story.”
Tariq Makota [19:25]:
“You have a decision to make, right? ... load shed some percentage of the customer or the request and am I going to impair, let's say 20% of those...”
John Frimento [42:03]:
“Multi AZ is a great place to start. Just out of the box you get so many benefits and so I hope that didn't get missed in today's conversation.”
Tariq Makota [42:49]:
“Question everything. I would question recovery time objectives, recovery point objectives... simplify my architecture as much as possible.”
John Frimento [44:22]:
“It's not a one time deal... think about this as a continuous process.”

For more insightful discussions on AWS and cloud architecture, subscribe to the AWS Podcast and stay updated with the latest episodes.

Loading summary

Transcript39 lines

[00:00]
Tariq Makota
This is episode 726 of the AWS podcast released on June 23rd, 2025.
[00:10]
Gillian Ford
Welcome everyone to the AWS Podcast. I am your host for today, Gillian Ford. And I always love episodes that apply to every person, every single listener. And this episode is one of those. We are going to be talking about single region resilience. So stick around because there's going to be a takeaway for wherever you are at with your single region resilience journey. And I've got two experts on the topic, which I can't wait to pepper them with questions for selfish reasons, of course. So let me. Let's do some introductions. John, why don't you introduce yourself?
[00:50]
John Frimento
Hi everybody, I'm John Frimento. I am a principal Product Manager in our Resilience infrastructure and solutions organization. I focus a lot working with customers who are wanting to operate critical workloads on aws. And then I also help build Amazon Application Recovery Controller.
[01:10]
Tariq Makota
Hey, I'm Tariq Makota. I'm a senior Principal Solution Architect, which means I spend most of my time working with customer on the resilience aspect. Been AWS about seven and a half years and I'm also extended member of the team we call internally Customer Resilience Engineering.
[01:30]
Gillian Ford
Super cool. I can't wait to ask you all these questions. So let's start really simple because we've got people who are brand new to aws. So Tariq, can you explain the difference between a region and an Availability zone?
[01:45]
Tariq Makota
How much time do we have? All right, 30 seconds. Yeah. Both are basically logical constructs under which are the actual physical implementation. What I mean by that is Availability zone is one or more data centers that are closely located and then multiple availability zones form AWS region. AWS region generally is give you protection against events such as electrical grid failures, such as the nature events and things like that. Regions are generally spread around geographically about 60 miles and these two constructs are heavily used to create resilient and reliable applications.
[02:43]
Gillian Ford
Love it. And I know a lot of people listening are like, okay, like give me this stuff to be able to implement right now. But I think it helps to really have a better understanding of AWS's fault Isolation boundaries before you can start to already implement some actual tactical things into your architecture. So Tariq, like when you're talking to customers, can you help them explain, can you explain how like how they think about default isolation?
[03:11]
Tariq Makota
Yeah. So there's multiple layers of fault isolation. So I'll start from the top and kind of navigate down. I presume we Have a listeners who are both in public sector that would be utilizing our Gov cloud and then we have the customers who are in enterprise companies basically utilizing what we call commercial cloud. So the first separation when it comes to the fault boundary is what we refer to as a partition. So in this case it would be partition for the Gov cloud and it would be partition for the commercial AWS regions. So that's the first layer of isolation. Then under that layer of isolation there's a regional isolation. So we basically every region itself can fail and have an impairment without impairing other regions. Meaning this is basically if you think about it, very similar to the in ship building to the bulkhead pattern where each region is a part of the bulkhead and flooding one of the bulkheads does not basically take the whole ship down. AKA other regions are not impaired. After the regions, the next step, the next fault boundary is the Availability zones. Availability zones themselves are just basically extension of that bulkhead pattern. So if you think about it region being a part of the bulkhead, then separating that bulkhead additional layers is basically what the Availability zones are. So those are the basically basic construct. Diving deeper when it comes to the protection. There are services that are at the regional level, there are services that are at the AZ level. So for example EBS would be Availability zone fault boundary type of a service EC2 similar. In addition to that, some services implement what we refer to as cell based architecture, which is another layer well before the cell based architecture. Number of services implement concepts of control planes and data planes. Control planes very simply I use to persist and make the changes to the service. The data planes actually carry the workload of the service itself. An example of this would be spinning up the EC2 instance. Basically starting up EC2 instance, whether it is through the auto scaling group or start instance or watching it through the console would be the control plane function. Meaning I'm instantiating something brand new. 2 EC 2 instances communicating to each other would be the data plane operation of the EC2 data plane. After those two constructs, number of services implement cell based architectures which are probably out of the scope of this. But it's another layer of isolation where the larger set of fleets. If you think about the data plane, data plane might be composed of the fleet of servers that might be in tens of thousands and they logically get separated into smaller groups called cells and that contains the blast radius as well.
[06:24]
John Frimento
So you just you as a customer, right. Like you explained a lot to me about, you know, different boundaries. AWS has and then got into some of the plumbing about how services you think about data plane and control plane. Like why does it matter? What can I do with that?
[06:42]
Tariq Makota
The reason it matters is knowing how to use these contracts is important in your own application. So to give you an example, if I am to take my application from my on Prem data center, my fault isolation boundary at that point is my application, AKA my data center. I take that application and transpose it to AWS and I stretch that application across the three Availability zones. Each Availability zone is default isolation boundary. However, since I'm crossing those isolation boundaries, I'm basically poking the holes into the bulkhead pattern. So I need to either modify my application to be Availability zone aware, have Availability Zone isolation, or Availability zone affinity. There's a number of ways to do it and to take the advantage of these things. Also in terms of when it comes to certain failure scenarios, some services, if you. If I am relying on a control plane functions to basically repair my application after the impairment, those control plane functions may or may not be available. So understanding what is available, what is not available, what is the, what are the constraints, basically help me construct my application in such a way where I can have the highest availability possible. Is that what you're looking for? You wanted specific details?
[08:24]
John Frimento
No, I think that's, that's how I. And the reason why I poked a little bit is because when I'm talking with customers there's usually like a so what moment that happens of like, all right, great, John, you're telling me about control planes, you're telling me about data planes, and you're telling me about this, this zonal contract. And for what's worth it, like the way the AWS infrastructure is built is a very unique property of aws, right? Where we have this regional isolation and something happening in one region shouldn't have an impact in another region. And then the isolation between Availability zones gives customers very powerful mechanisms to build resilient architectures. And then so like again, so what is if as a customer, when you're building for say a highly resilient application in a single region and using a multi AZ approach, understanding the different, like if you think about it from like an API perspective, not just a service, right? So if you think of like EC2, there are some actions which are going to be control plane and what you don't necessarily want to do is if something is going wrong, dependent on launching instances. Right. For your most critical apps, that's a very complicated process and that's part of the EC2 control plane where it's attaching ENIS and doing a bunch of things within the process of spinning up an EC2 instance, but interfacing with already running EC2 instances, such as that HTTPs connection or whatever port or protocol you're using is a data plane action. And that's kind of the nuts and bolts of regular operating of that service. And so like that's just one example. But you know, they even thinking about like a database, like you know, spinning up another replica as a mechanism to recover. That's a control plane operation of spinning up replica. Let's have that replica there already. And do you know a SQL, you know, promotion which is more of a data plane operation on an existing data store. So kind of you need to understand how those relate to the services you're using to make informed choices on a resilience architecture.
[10:36]
Gillian Ford
Yeah, you bring up some really interesting points. Because when I talk to customers, I work with startups and I think a lot of them, they see exactly what you were saying about all the, what AWS does to really have make it easy for customers to build these resilient architectures. And it gives people unfortunately a false sense of security that they don't really have to do much in order to be resilient, like, oh, I can just use like a managed service for example, and then I'm good. So I think it's great that you were kind of unpacking what are some additional things that people should think about. And I'd love to understand more on that thread of like what are some other misconceptions that maybe you see customers have when they're thinking about designing for single region resilience?
[11:26]
Tariq Makota
I'll go first. I think the biggest misconception is that single region architectures cannot achieve high availability or resilience or otherwise said multi region architectures can achieve the higher resilience. There's a lot of ifs in that statement, but it's technically, if not done appropriately, multi region architecture could have lesser availability. Example of that being if I am synchronously replicating from region A to region B and if I have dependency on that synchronous replication, that means if any one of the two regions is unavailable, my whole application is unavailable. Statistically speaking, two regions may have an impairment more often than single region, just as a statistical thing. So I think one of the things that's number one. I think the other part of that misconception is usage of availability zones. So availability zones are physically isolated with independent power, independent cooling, networking infrastructure. Right. But in order to use them correctly, application needs to be distributed and architected correctly. I spoke a little bit about this earlier. If I quote, unquote, spray my traffic across all three Availability Zones at any point, meaning I go, my traffic increases in Availability Zone one for some reason, I send it to Availability zone two because that's where the service is and then that goes back to availability zone 1 and then to the availability zone 3 and then to the database. I'm basically, basically crossing those zonal boundaries multiple times, which may be fine if I have ability to detect that any one of those Availability Zones is impaired and have either circuit breakers or some sort of other mechanism that basically would isolate that Availability Zone from my application standpoint. So just running application on multiple Availability Zones does not mean that that application is highly available and can use all of that, all of the benefits Availability Zones application needs to basically very purposely ensure to take advantage of those in some. So that to me the biggest misconception, I believe is that going multi region by default gives me more availability and resilience. I think how the application is architected, how the failovers work is just as important.
[14:32]
Gillian Ford
I think what you were just saying there, even within a region of how to architect for multi, for multi Availability Zones, I think that's going to cause a lot of people to really think more critically about how they're designing it. So I'd love to understand John, like, how should customers start to think then about resilience for their critical applications?
[14:54]
John Frimento
Yeah, that's a good question. Like, and maybe I'll tie back to the example I was given around like control plane, data plane and like if you would just hear like that snippet, right? Of like, hey, you should rely on running instances only. Like it's the wrong message, right? And so when you think about critical applications, you need to think about, there's, there's normal operations you're going to do, right? Like naturally, as load, as load increases, you want to have scaling to be able to absorb that load. And you, and you're going to naturally, you know, rely on various control planes to do that. And that's, that's a good thing, right? As more customers are using my service or product, I want to be able to scale to absorb that load and then scale back down to, you know, a predefined kind of baseline to save costs. Right now with a very critical workload, what you want to think about maybe is like that baseline of like what is the minimum we need at any given time to service our clients. And so then what in my mind, it now becomes you need to really separate like a conversation around critical workloads, around like what are our specific recovery operations. We have a lot of times customers will think about these in terms of like runbooks or you know, sops for recovering from various failures and things like that from like the normal state of the application. And so for like recovery procedures, you really need to double click on for these critical workloads. And that's where you want to make sure like things we're doing are not introducing like a new modality. This would be one of Tariq's favorite of like bimodal operations. And what we mean by that is what you want to avoid in a critical system is as a part of recovery, going down a path you never test or normally don't go down, for lack of a better word. And what we've seen more times than not is there will be this kind of fork in the road, so to speak, when something's going wrong and you go down this new path and you try to do things to recover that you have never tested. Eventually typically that backfires because you need these recovery processes to be test it and you want them to be, you know, you know, just regularly exercised. So when you use them it's going to work as you expect. And so I, you know, I think that's a big thing to think about. And it also goes into like the one thing, you know, we talk to customers a lot about is understanding failure modes and maybe doing like an like a failure mode analysis. Like for folks out there who are familiar with like FMEA failure mode effects analysis, it's kind of like our version of that, so to speak, where we just kind of think about it through a lens of cloud based systems. And the idea is if you look at any application, there are common failures that just occur and we've categorized these and just maybe a plug to the document that's available for customers. If you Google or search wherever, like the AWS resilience analysis framework, you'll find this. But we basically categorize through internal learnings working with customers common failure categories or modes and you think about shared fate, excessive load, excessive latency, misconfiguration, bugs and single points of failure. And then so if you look at your application and it's not just like components try to break it down to say like a user journey or story. And like an example of this may be for an E commerce site path to purchase, you'll hear that a lot like what's the path my customers go to actually complete a purchase is a critical path. More so than maybe me being able to update my address or phone number, right? Like it's okay if I can't do that right this second, but most likely I want to purchase something and I want that to be available. And for like a banking application, one of the most common critical paths is like being able to check my balance, right? Like I want that level of trust with that financial provider that can go and I can see how much money I have. Like I worked hard for that, right? And so if you think about it as a frame of there's these critical user journeys and there's interactions between components. In this case maybe AWS services of where, you know, say like an API gateway interacts with a lambda that interacts with the DynamoDB table. You can then look at like, well, what are common failure modes that you see? And maybe excessive load is one of them, right? Like what happens if a client all of a sudden is misconfigured and is bombarding my service and overloading it to the point where I can't handle all these requests, like what can I do? And maybe I'll send this to Tariq. What would you do if your application was being bombarded by requests? What's a mitigation you could do?
[19:25]
Tariq Makota
Well, there's a couple of them. They're not popular but you know, the load shedding would be the one of the things, you know, the load shedding meaning if I'm getting overwhelmed with requests, I have a decision to make, right? Decision is am I going to load shed some percentage of the customer or the request and am I going to impair, let's say 20% of those or am I going to allow this to go on until the resources of application are exhausted and I potentially have full on failure of my application, aka 100% of the customers are affected.
[20:09]
John Frimento
In case it may be better to give one client a bad day or a bad 20 minutes and preserve the other customers or clients in my system.
[20:17]
Tariq Makota
Yeah, the analogy that I like that all of us can relate that Mike Kakin actually wrote the paper of this in a builder's library. It's like when we all go to the restaurant, sometimes we go in there and we are told, Hey, 15 minutes wait, 20 minutes wait. We are basically being throttled, right? So they could take all of us in the restaurant and then I could be sitting in John's lap with his family eating the dinner at the same time. Experience is bad for both John and myself. Or they can allow John to finish his dinner, open up the table and then I go have my dinner. So that's kind of like an everyday example of these things. So definitely, you know, load shedding, throttling as well. Throttling based on number of the customers. And I could treat customers differently of my application. If I have a customer that needs the high level utilization is an important customer, I might give that customer more tokens and more space to use in my application versus other customers. Or I can distribute, distribute it evenly. So throttling the customer requests. The last part of it is, you know, the load shedding also allows me like sometimes I think in the E Commerce applications it's about 4 seconds for the user abandonment, meaning the user will wait for about four seconds before they abandon their cart, purchase, whatever. So for example, if I'm having some sort of a slowdown or impairment where the require responses was being returned to the customer, you know, after four more seconds I have a decision to make. I already know that there's going to be abandonment by the customer. That's likely. I have projected that the request might take more time that that customer is willing to wait and if I proceed with that work, that's going to be wasted work at that point. So I have a decision to make. Should I do that work or not given the time? So those are the some of the techniques to use. Exponential backoff is another one. Or I can just stop certain things from working using the circuit breakers. Depending of the impairment that I'm trying to handle. There are a number of the techniques that could be used. A lot of these things need to be implemented in my application. Some of AWS services can help with these things. API Gateway can help with rate limiting and things like that. ALB can help with making the load shedding easier. It's not out of the box, but I can with the path for the load shedding kind of quote unquote, black hole the traffic. Long story short, knowing how these services behave, what the capabilities are and implementing the functionality in my application basically give me the best chance of having the high level of availability.
[23:20]
Gillian Ford
One theme that I'm hearing from both of you is that it sounds like customers need to really rethink how they are thinking about resiliency. A lot of people will work backwards from oh, I need a certain number of nines where it actually sounds like based on what you're both saying is that they should really be thinking about it in terms of failure modes, like all the scenarios you were talking about of different points of failure and how you want to result, be able to respond from that failure. So John, I'd love to hear from you of, of walk us through how customers can start to make that shift as part of their resiliency strategy.
[24:00]
John Frimento
Yeah, sure. And yeah, like I think like you know, number of nines, you know, like I want 99 or, or measuring like against service level objectives. SLOs is really common and I'm not saying, you know, change the way you measure, like that makes sense but like what we like to see or what is typically helpful when we talk to customers is thinking about how to like what's the means to get to that end. And it's typically looking at your application and figuring out like hey, yes, these three failure modes are pretty plausible and everything's a trade off with resilience. And to your point earlier, like, yeah, maybe like it's not uncommon for customers to say like yeah, I'm using these, you know, managed services and they're multi az, so I should be good. And quite frankly that's a really good baseline to start. You can build some very resilient applications that way. And so like I guess to, for like the audience and for like this conversation like Tariq and I are really coming through the lens of like the most critical apps. Like you know, flights may not take off, transactions may not clear. Like things that are really going to be impactful. And I'm sure every business has some of these. And typically, and my point is typically it's not every application within a business. And so when you think about these things, you have to make trade offs around how much you invest in the resilience because I'm sure for every application team there's a counterpart in the business that's saying we need these features. And then right. How far do you take this, this thing? And so yeah, with that said, like with, with every kind of failure mode there's a, there's a trade off and we like to think about it in terms of like, what's the plausibility that it could happen, how realistic is it? And then if it did happen, what would be the impact? And so when you think about the different categories and how they can manifest into failures for your application, you know, assess it against those two things and if you kind of think of and like obviously like the high highs, like the high probability, high impact, you probably want to take care of those. But now if you get something to like, well hey, the risk is super low, but the impact may be really high, that's probably something you want like that's a harder trade off to make. And then right as those ranges or inputs kind of change, you can make that assessment but you really want to make that assessment against, you know, the engineering effort and the cost of implementing. Because like Tariq's talking about throttling and circuit breaker technology and you know, creating cell based architectures, which is something we do like when we're building services. AWS you will not find a conversation that doesn't incorporate concerns or conversations around blast radius reduction. But that's an engineering trade off to make. And so as a customer you really want to think about the level of complexity you're building into the system and does that get you what you're after? And then again right, and all relates back to the end of the day of like what availability do your customers expect? And I think, you know, most customers nowadays expect everything to be available whenever it is. The heck they want to use it. So it's a challenging environment, right?
[27:08]
Gillian Ford
Totally. Yeah, it really is. And I'm sure there are other customers as they're like listening to this and it's getting them thinking about what the failure modes are. There's probably people who are like wait a minute, but I don't know what I don't know. So is there some sort of list that AWS is curated that can help me see what are some of the common failure modes that maybe I should think about?
[27:32]
John Frimento
Tariq, you want to. I almost want to just tell everyone you're alias really quick but that's probably not going to do you any good. So maybe you tell them where they can find it.
[27:41]
Tariq Makota
Yeah, I mean let me I actually been playing with the AI a lot and to be honest with you, I was playing with the QCLI other day and I asked about the failure modes for API Gateway. It gave me 14 failure modes and then I asked what are the some of the mitigation factors for it and gave me those are some of the links to the blog. It was fairly accurate. So when it comes to I would say that each I shouldn't say each. Most of the services have probably resilience section in their documentation where they're going to talk about some of these things. The Gen AI the AI could definitely help because ingested so much public data so it's like a quick way to kind of get like a you know, 8020 type of a rule. But we've been talking this before while we've been focusing here how AWS services can fail. My own application has failure points that I've that these Things will not tell me about it. So for example most of the application have dependency on some sort of a single sign on system. Things like okta ping and so on and so forth. I as a customer could implement these things myself within my own infrastructure. I can self manage them and things like that. At that point I'm taking responsibility to making these highly available. But my application might be dependent on single sign on to be there. Or as John was mentioning before thinking about critical path if I have a travel application on that travel application as giving me turn by turn directions and in addition to that it's going to tell me what the weather is in the city or town that I'm going to arrive to which one is more critical? Is it critical that I know the weather in the city that I'm going to arrive? I would say not as much. It's nice to have. So if that's a dependency on weather com or weather service of some sort I may want to basically make a choice that when that dependency is not available that I basically my application still operates. So I have to take ownership a lot of times when it comes to how I build my application. So for example, if I have built my application in such a way where I can where the application has availability zone isolation when I do the deployments I might have ability to deploy in a 1ac out of 3 or I might have ability to use the feature flag so deploy my function only to the subset of things and I can test that and if there's issues I basically roll back. It is more common that deployments cause of the config changes through the deployments cause an outage than anything else. Like that's a very common thing even with all of the testing. So there is no one single way get all of these all of the potential failure modes. But one of the things that I would suggest to the team and some of our internal teams do this on a weekly basis as they as the team goes through the backlog they might come together and basically say what happens if the power goes out? Right? Because this can be done even if my system runs on on premises, right. What happens if the power to the data center fails? And that I would reason about all of the things that actually happen or what happens if the mainframe system fails, right. And then I reason basically all the potential impacts that I have once I have those impacts it will get tell me the significance to my business. Right? So for example, if I'm running the reeks grooming service that is not super important and the mainframe fails, I may, you know, I may say, okay, I'll be down for a few hours, gonna recover versus if I'm running some sort of a critical payment system, et cetera, where I may have to have a high level recovery. So what we, what we haven't really talked about is like very basic choices. Like the first choice that I need to make it is, you know, whether my system requires disaster recovery or high availability. And it's a huge difference as far as of how I'm going to architect, how I'm going to recover. Analogy to this would be almost kind of like am I going to build an airplane, AKA high availability, or am I going to build a car? Car would be disaster covered. Example being the way airplane moves is by having the jet engines. Most of the airplanes, commercial ones at least have two jet engines and they have everything else pretty much redundant. So they can operate on single engine. Meaning during the impairment and the failure there is a high availability because there is a redundancy in terms of the car. If my tire actually gets deflated or impaired, I basically have to pull over. I have a spare tire in the back, so I'll take the 30, 40 minutes to replace the tire. That would be quote unquote disaster recovery. So making those choices and knowing what's important is critical. Also a lot of applications have recovery time objectives, recovery point objectives, which technically should be defined by the business continuity and disaster recovery documents. What I find quite often is that we as a technologist have a much more aggressive recovery time objectives, recovery point objectives than the businesses necessarily need. Sometimes this is just due to the lack of miscommunication, sometimes it's just for the other reasons. But deciding disaster recovery or high availability would be first step in a lot of these things. I would say disaster recovery is probably a little bit less complex to handle. Having high availability, we're talking about being able to recover within minutes, not necessarily hours. That is much more difficult and it may require different things from my application. Give you like simple example, if my recovery time objective is 12 hours, I may just decide not to do anything because statistically the any impairment that actually happens would probably be done at 12 hours. Or I can wait for six hours before deciding that I want to do disaster recovery in the secondary location, whether it's a AWS region or something else. Whereas I may not have that ability if I have critical application. So in a critical application that the concept of airplane is then question around high availability versus the cost. So cloud being provision on demand and matched consumption, we Match the demand with the number of the resources. Question then becomes if I use that airplane scenario and if I'm running in two availability zones, if I lose one availability zone, assuming that both availability zones are utilized, almost like let's say 80% each, if I lose one, that means that traffic from the one that I have quote unquote lost once actually move to the other. I'm going to have 160% of the traffic going through the single AZ and I may not have the resources. So what John was talking about earlier is the concept of static stability. I should have excess capacity in my availability zone. And excess capacity meaning that you found means extra compute that might be idle. And in this term if you think about the airplane analogy, yeah it's definitely cheaper to build the airplanes with one engine. But then the airplanes are so critical, if I lose one engine, basically everyone on board perishes. Right. Um, so if you use the same concept of the critical application, if I lose one Availability zones critical application, I might have reputational risk, I might have a financial impact on things like that. And if I financial impact is, let's say financial impact for 30 minutes of outage is $20 million and having static stability, having the excess capacity is going to cost me $10 million a year. I will choose that probably every time. So it's a trade off. So I think those basics also come in play. And then after the things, yeah once I have decided those things, all of the things about the failure mode analysis go. But I would actually suggest everyone plays with the AI. It's actually pretty good at identifying I would say 70 to 80% of these fade amounts.
[37:23]
Gillian Ford
Wow, that's a really good call out on Q developer. So what are some other services that can help customers in AWS build resilient applications?
[37:34]
John Frimento
Yeah, Shameless plug here. Amazon Application recovery controller can help. So we have capabilities called routing Control to help ship traffic between regions and then for more applicable to this conversation with single region architectures with Multi Azure. Now we have two really awesome capabilities, one called Zonal Shift. Both of these we're about to talk through Zonal Shift and Zonal AutoShift integrate with EKS clusters, application load balancers, network load balancers and EC2 auto scaling groups. But they allow you to essentially say hey, something's going on in this Availability zone Shift my work out of that az. And that's a pretty unique capability to aws. Then there's also the fault injection service which we should have talked more about how critical testing is. I remember I mentioned you need to test your recovery procedures. But testing's paramount to success when it comes to these types of things. And so fault injection service allows you to. They call them actions, but allow you to inject certain kind of symptoms of various failure modes and test how your application handles those types of things. And so yeah, I mean there's a couple. I'll call it just a doc again, like Tariq talked about using Q. If you rather just read something, you can also read that doc. I mentioned the resilience analysis framework that will guide you through that process and give you some ideas on different failure modes.
[38:59]
Tariq Makota
Yeah, I would just, just to add to John's, John kind of mentioned the services that are specifically intended to help me as a customer to either test or validate or move my application when there's an impairment. I think the other aspect of this is we talk about the shared responsibility model in terms of resiliency and different services have more responsibility on AWS services versus others. To give you example, if I am using Connect, Connect is pretty much I just have to kind of define and configure the Connect and then everything else, compute, connectivity, et cetera, is managed by the Connect team. So the Connect team is taking more responsibility from the resilience perspective than I am. So this makes my resilience easier. If I'm running containers, I have multiple choices. I can run containers on my own. I can use any one of the container services and manage my own fleets, or I can have the container services manage the fleets and node workers. What this does, in that analogy of static stability, if I use for example ECS Fargate or EKS Automote, the new service that was released, then I'm basically shifting the responsibility on that capacity management and having enough capacity for me to those services. Lambda would be similar example. I can run the code on EC2 instance or if that's some sort of request response application that basically can be moved to Lambda. Lambda is going to actually manage provisioning of those instances, the capacity of those things and I just have to provide the code. So what I would say is the most of the services give the customer some level of resilience boost in some cases as a customer may need to utilize those services and know how they actually operate and basically build my own resilience application. Other cases I might be able to remove that heavy lifting to that service like Amazon Connect and basically not really, you know, have a minimal responsibility when it comes to the resilience of that application.
[41:30]
Gillian Ford
So it sounds like there's really, I would say A lot of things that customers should think about, not just on like the, the framework of how they're thinking about the resiliency strategy, but also the services that can really be able to help them. I would say that's a huge takeaway for me from this conversation. So one last question for both of you is just like some parting advice. So John, do you have any parting advice for customers as that are thinking about their Multi AZ resiliency strategy?
[42:04]
John Frimento
Yeah, I think Multi Az is a great place to start. Just out of the box you get so many benefits and so I hope that didn't get missed in today's conversation. Like out of the box, Multi AZ is the best practice and you can build a really reliable, resilient application with that for critical workloads, which again if you look at a customer, not every workload is usually a small percentage. There are things you can do that we talked about here to really get that next level of resilience again still within a single region. And that kind of saves you, like Tariq mentioned, if some of the complexity of going multi region, which I think we may talk about at another time of what that all entails. But yeah, that would be just my kind of parting thoughts.
[42:50]
Tariq Makota
Let me see. I would say question everything. So if I was.
[42:54]
John Frimento
There you go, that's a good one.
[42:56]
Tariq Makota
I would question recovery time objectives, recovery point objectives. Like what I am being told is that accurate? Like does my business continuity plan, does it match? Do my regulators actually asking for the level of the resilience that I'm looking, that I'm basically trying to implement, like is that adding additional complexity? The reason I said question everything from the beginning was because the more cogs in a wheel the more likely are the failures. So I want to simplify my architecture as much as possible to that same thing. Having 30 services in my architecture might be cool, but Those are the 30 failure points, potential failure points. Statistically speaking, we're talking about the regions of this. Like if I have 10, this actually you know, that will do less. So I would say question all the choices along the way and then that then leads into the failure points, failure points that we talked about in resilience analysis framework is basically questioning myself how can my application fail and what other things that support my application could fail and how that actually. So I guess it requires high level of scully and questioning everything along the way to optimal results.
[44:23]
John Frimento
Yeah, it may be obvious one last thing right Is it's not a one time deal, right? You don't do it once and you're done or you spend a month doing it and you're good forever. You know, think about this as a continuous process.
[44:38]
Gillian Ford
Totally. And especially for some customers that maybe they're on a single region right now and they're thinking, oh, maybe I actually do need to use multiple regions. Which yes, as John was hinting is going to be another episode. So you'll definitely want to make sure that you are are subscribed to the AWS podcast so you can be notified when that is available. But thank you so much, John and Tariq for being here on the AWS podcast. This was super interesting. I learned a lot. I'm sure the listeners did as well.
[45:14]
John Frimento
Thank you. Appreciate it.
[45:16]
Tariq Makota
Hopefully we didn't scare anyone.
[45:19]
Gillian Ford
Well, hopefully we scared them enough that they're actually going to start looking at the resilience. So that that was my secret motive.
[45:25]
John Frimento
There you go. Hopefully mission accomplished. Yes.
[45:29]
Gillian Ford
Awesome. Thank you so much, everyone.
[45:30]
John Frimento
Awesome. Take care, everybody.