Summary6 min read

Summary of AWS Podcast Episode #694: The AWS Resiliency Lifecycle Framework

Podcast Information:

Title: AWS Podcast
Host: Gillian Ford
Guest: Rajesh Prajapati, Startup Solutions Architect at AWS
Episode: #694: The AWS Resiliency Lifecycle Framework
Release Date: November 11, 2024

Introduction

In episode #694 of the AWS Podcast, host Gillian Ford engages in an insightful discussion with Rajesh Prajapati, a Startup Solutions Architect at AWS with over five years of experience focusing on resiliency for startups and a diverse range of AWS customers. The episode delves into the critical topic of application resiliency, offering listeners a comprehensive understanding of establishing and maintaining resilient systems in the cloud.

Notable Quote:

[00:09] Gillian Ford: "We are talking about resiliency and how every company should have a resiliency baseline not just for your main core product, but even for other products..."

Defining Resiliency

Rajesh begins by clarifying what it means for an application to be resilient. Resiliency refers to an application's ability to continue regular operations despite disruptions. This encompasses both resisting and recovering from potential disruptions, ensuring minimal impact on business operations.

Notable Quote:

[01:23] Rajesh Prajapati: "Resiliency's ability of a workload to continue its regular operation in spite of any disruptions."

Establishing a Resiliency Baseline

The conversation transitions to how organizations can establish a resiliency baseline. Rajesh emphasizes starting with Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) to gauge the current resilience posture and set achievable targets. However, he advocates for a broader approach that includes High Availability (HA) and continuous improvement alongside Disaster Recovery (DR).

Key Points:

RTO: Time required to recover from a disruption.
RPO: Acceptable amount of data loss measured in time.
Three Pillars of Resiliency:
1. Disaster Recovery (DR)
2. High Availability (HA)
3. Continuous Improvement

Notable Quote:

[01:49] Rajesh Prajapati: "There are three distinct pillars: disaster recovery, high availability, and continuous improvement."

The AWS Resiliency Lifecycle Framework

Rajesh introduces the AWS Resiliency Lifecycle Framework, a structured approach incorporating best practices to embed resiliency into daily operations. The framework is depicted as a closed-loop flywheel comprising five key stages:

Set Objectives: Define key metrics like RTO, RPO, SLA, and SLO based on business needs.
Design and Implement: Architect solutions to meet the defined objectives.
Pre-Deployment and Post-Deployment Activities: Incorporate testing and validation.
Operations: Establish observability, event management, and monitoring.
Response and Learning: Develop runbooks, conduct incident analyses, and iterate based on learnings.

Notable Quote:

[03:34] Rajesh Prajapati: "The AWS Resiliency Lifecycle Framework... helps us incorporate these best practices into our day-to-day practices."

Example: Three-Tier Architecture

To elucidate the framework, Rajesh uses a three-tier architecture example involving Route 53, CloudFront, Amazon EKS (Kubernetes), and an RDS backend database. He walks through each stage of the framework, starting with setting business-driven objectives, designing the infrastructure to meet these goals, and implementing continuous improvements through DevOps practices.

Notable Quote:

[06:24] Rajesh Prajapati: "Start with your business in mind first... the technology stack has to be designed in order to meet the business objectives."

Disaster Recovery Strategies

Rajesh breaks down the various disaster recovery (DR) architecture patterns, each offering different levels of RTO and RPO:

Backups and Restore:
- RTO/RPO: Hours
- Description: Restore infrastructure and data from backups post-disruption.
Pilot Light:
- RTO/RPO: Minutes to hours
- Description: Maintain foundational infrastructure in a secondary region with on-demand compute resources.
Warm Standby:
- RTO/RPO: Minutes
- Description: Partially running environment in the secondary region, ready to take over quickly.
Multi-Site Active-Active:
- RTO/RPO: Near real-time
- Description: Active production environments running concurrently in multiple regions.

Notable Quote:

[25:08] Rajesh Prajapati: "As you move from backups and restore to multi-site active-active, your RTO/RPO goes from hours to near real-time."

Choosing the Right AWS Services

Selecting appropriate AWS services is crucial for simplifying resiliency implementations. Rajesh highlights the advantages of managed services like Amazon Aurora and DynamoDB, which come with built-in resiliency features, versus self-managed services that offer more control but require greater operational expertise.

Key Considerations:

Managed vs. Self-Managed Services:
- Managed Services: Offer out-of-the-box resiliency, reducing operational overhead.
- Self-Managed Services: Provide greater control but increase complexity.

Notable Quote:

[19:37] Rajesh Prajapati: "Choosing the right services will make it a lot easier to develop the infrastructure that can meet those requirements."

Testing Resiliency in AWS

Effective resiliency requires thorough testing both pre-production and post-production. Rajesh outlines strategies using AWS tools:

Pre-Production Testing:
- Fault Injection Simulator (FIS): Integrate into CI/CD pipelines to simulate disruptions.
- Load Testing: Ensure the system can handle extreme conditions.
Post-Production Testing:
- AWS Resilience Hub: Assess and validate resiliency posture.
- Chaos Engineering: Introduce controlled risks to test system responses.

Notable Quote:

[15:57] Rajesh Prajapati: "AWS Resilience Hub... allows you to set your RTO/RPO requirements and assess whether your infrastructure meets those needs."

Monitoring Resiliency

Monitoring is split into two aspects: High Availability and Disaster Recovery. Rajesh emphasizes the importance of observability tools like Amazon CloudWatch to track both operational metrics and business-level KPIs. He also discusses the impact of monitoring granularity on detection times and the subsequent response.

Key Points:

Observability: Comprehensive monitoring of system health and business metrics.
Synthetic Monitoring: Simulate user interactions to gauge performance.
Alerting: Set up alarms based on defined thresholds to trigger responses.

Notable Quote:

[30:55] Rajesh Prajapati: "You essentially want to create dashboards, set up alerts that notify your operations team, and design reactions for those situations."

Incident Analysis and Continuous Improvement

The final stage of the framework focuses on responding to incidents and learning from them. Rajesh introduces the concept of a "Correction of Errors" document, which involves conducting root cause analyses post-incident and implementing corrective actions to prevent recurrence. This fosters a culture of continuous improvement and knowledge sharing across teams.

Key Points:

Runbooks: Detailed procedures for responding to specific incidents.
Incident Analysis: Thorough examination of what went wrong and why.
Knowledge Base: Sharing insights to enhance overall resiliency practices.

Notable Quote:

[34:08] Rajesh Prajapati: "Creating a Correction of Errors document allows you to incorporate recommendations and proactively avoid similar faults in the future."

Conclusion and Additional Resources

The episode concludes with Rajesh and Gillian reiterating the importance of a structured resiliency approach. They recommend several AWS resources for further learning, including:

AWS Startup Resiliency Baseline Paper
Architecture Blogs: Visual diagrams and detailed explanations of resiliency patterns.
AWS Well-Architected Framework: Specifically the Resiliency Pillar and Disaster Recovery sections.

Notable Quote:

[37:15] Gillian Ford: "If those who are listening are like, can I? Where can I read more about this in the well-architected framework? I definitely suggest you checking it out."

Additional Recommended Resources:

AWS Resilience Hub: Link to AWS Resilience Hub
AWS Fault Injection Simulator (FIS): Link to AWS FIS
AWS Well-Architected Framework - Resiliency Pillar: Link to Resiliency Pillar
AWS Blogs:
- Understanding Resiliency Patterns and Trade-offs
- Creating a Correction of Errors Document

This episode offers a thorough exploration of establishing a robust resiliency framework within AWS, blending strategic planning with practical implementation guidance. Whether you're a startup or an enterprise, the insights shared by Rajesh provide valuable strategies to enhance your application's resilience against disruptions.

Loading summary

Transcript35 lines

[00:00]
A
This is episode 694 of the AWS podcast released on November 11, 2024.
[00:10]
B
Welcome everyone to the AWS Podcast. I'm your host for today, Gillian Ford. And we've got an episode that applies to every single person who is listening today. I don't care if you are at a startup organization, you work at a large company. Something in between. There is something for you in this episode because we are talking about resiliency and how every company should have a resiliency baseline not just for your main core product, but even for other products, either in terminal applications. And I've got an expert, the man who wrote the paper on this, Rajesh. So Rajesh, please introduce yourself to the listeners here today.
[00:51]
A
Awesome. Thanks Shilin for having me. Rajesh Prajapati I'm a startup solutions architect. I've been working with AWS for five years and specifically with startups and resiliency for over two years. My primary work involves startup. Part of it is to help them create a baseline for their resiliency framework. But I also work with broader AWS customers, all ranging from startups, small and medium sized businesses, to enterprises that are getting started on their resiliency journey. So, happy to be here.
[01:17]
B
Let's start to level set for everyone what it actually means for an application to be resilient.
[01:23]
A
So resiliency's ability of a workload to continue its regular operation in spite of any disruptions. So as long as an application has the ability to resist or recover from disruption, it's considered a part of the scope of resiliency.
[01:39]
B
So for any company, that's like listening. How can one really be able to get started to establish a resiliency baseline?
[01:49]
A
That's a great question. Most customers that I typically work with, the conversation starts at an RTO and RPO conversation, which is recovery point objective, which is RPO and recovery time objective rto. These are essentially metrics that help you understand what your existing resilience posture is or the resilience posture you are striving to achieve. Essentially, these means how long from the event of a disruption is your workload going to recover? How much data loss can you afford to either recreate or afford to lose? In terms of metrics, typically that's the conversation we start with. But what I'm striving is that this conversation starts a little bit beyond that. So when you're talking about resilience, there are three distinct kind of mental pillars, if you will. So one is the disaster recovery where the RTRP conversations happen. But in Addition, there's also an HA conversation which is high availability. And then this continuous improvement, that is whatever standards you are setting, the mechanisms you're developing, how do you make sure that that consistently stays with the organizations and scales with your applications, your teams and new features that you're building? So from a mental model perspective, when I'm thinking about resiliency, it's three things. HA Dr. And continuous improvements.
[03:08]
B
These three things. So disaster recovery, high availability and consistency. That scales, I know in one of the principles that you talk about in the resiliency white paper, the startup resiliency baseline, even though it's startups, it can really apply to any company, is this concept of a Resiliency Lifecycle framework. So can you dive into that a little bit more?
[03:34]
A
Absolutely. So one of the challenges that we've seen is when we are talking about these three powers, they kind of feel a little bit distinct. We know that there's some level of overlap between them, but how do we approach them in a way that we can include these on our day to day practices, our SDLC lifecycles, team culture and product development thought processes. So that's where the AWS Resiliency Lifecycle Framework comes into picture. It essentially takes our decades of working with customers, building our own AWS services and putting it into a framework which helps us incorporate these best practices the way we have designed the Resiliency Lifecycle Framework. It's a closed loop flywheel kind of a setup where you start with an initial point, you go through the whole cycle and then you close out the loop so that you can understand what your closed loop feedback looks like when you're thinking about resiliency initiatives. And it also helps you further fine tune each of the different things that are required to have a high resilient posture. So at a very high level, there are five distinct stages for the Resiliency Lifecycle Framework. The stage one is to set objectives. So that's where your key metrics like RTO, RPO, maybe SLA, SLOs, all of those come into pictures. The second stage would be to design and implement for these objectives. And moving on you will go to stage three where you think about your pre deployment and post deployment activities. Once that is set, then you go into your operations where you set things like observability, event management and things like that. Then the final stage is response and learning all the things you're done till this stage, which is making sure you set the right RTO, RPOs or the design objectives. You design it properly, you create Observability metrics and inside all of that events will happen. So how do you respond from those? And make sure you learn from those. And once you do, you can use these learnings to input as a parameter for refining your objectives. And that's how it becomes a closed loop cycle where you start with a shallow pass of going through your resilience posture, understanding where you find gaps and then refining it over a period of time.
[06:00]
B
Yeah, I bet a lot of the people are like, oh my gosh, there's a lot of different steps in these stages and it sounds a bit theoretical. So I think it'd be really helpful if you could take this framework and be able to explain it with an example. So let's use an example of three tier architecture and maybe using Amazon eks, which is Kubernetes.
[06:25]
A
So let's have a simplified setup where you have Route 53, Cloudfront EKS and then a backend database like RDS. Right. So let's start with the stage one which is defining objectives. So one of the common approaches I've seen is teams might look at their existing infrastructures, look at a component level, ability to recover from backups, or how quickly they can spin up a new environment. There is a component or a workload driven RTO RPA that customers might set. But this is the place where typically I say like start with your business in mind first. So when you are, let's say a retail company, you need to think about what is the impact, the financial impact for any kind of a downtime. So that will help you create kind of a quantifiable metric from a business perspective that says, hey, let's say our average orders per RX and our average cost per order or revenue per order is Y. Then that will help you understand how much revenue impact you would have for a downtime of a given duration. Now that's where the business leadership would help you determine that as at a business level, our revenue goals are xyz and that's why we need to make sure that our downtimes are within this time frame. Now that will help you navigate and set what an RTO RPO would look like. So starting from a business mindset helps you define the required level of RTO and RPO and also like internal service level that would be driven through those. That's the step one. Right. So the key thing over here is when you're setting the objectives, it should not depend on the underlying services technology stack that you're using. The technology stack has to be designed in order to meet the business objectives.
[08:28]
B
Okay, so they first define the business objective starting from the financial impact of having downtime. And that financial impact is then going to dictate the RPO and the rto. And that's where it sounds like the intersection between the business and DevOps really comes together. So maybe can you elaborate a little bit more of what the role of DevOps is at this stage of the resiliency lifecycle?
[08:58]
A
Absolutely. So once you have your business level objectives, that will help you define the kind of planning you have to do for HA and Dr. Right, and then also give you some indications of where you set in the continuous improvement piece of it. So depending upon the objectives, let's say you have aggressive RTRP requirements of less than four hours or maybe one hour. Then the strategies you choose, those will be dictated by these RTO RPOs. So let's give an example from Dr. Perspective. There are at least four different strategies, broadly speaking. So you have the backups and restores, you have the pilot light, you have warm, standby and then active active. And then on the spectrum, if you think about going from left to right, backups and restores are the least expensive one. But you can expect recovery time of and even the recovery point objectives within hours. And then as you move towards right, you go from multiple hours to a couple of minutes to a couple of seconds. And then in active active, you're essentially having two primary applications running in parallel with which both can take traffic, right. So from left to right your cost increases, but your RTO RPOS reduces. That's where you get into the strategical implementations of DevOps. So for going from left to right backups and restores, you can have a lot more manual processes, complicated systems and runbooks that are executed manually. But as you go to a more rigorous RTR pro, which is going to palette light, warm, standby, active, active, your deployments, your backups, your automations, all of that increases significantly. So that's where the continuous improvement or the DevOps piece comes into picture. That is more and more stuff will have to be automated and your DevOps lifecycle should take into implementation. So that's where you start getting into the stage two, which is the designing and implementation. So within the DevOps lifecycle of CI CD, there are a number of factors that you would have to make decisions on. So what pieces are fully automated? Do you include testing in it? How frequently you are going to commit and do integration tests? How are you making sure that your infrastructure is immutable? What is Your rollback strategy to what level are you doing versioning? Are you doing canary deployments, blue green deployments? A lot of these factors the decisions will be based on to the level of Dr. Strategy that has been chosen. So like to put to summarize, I like the devop strategies that you choose. Those will be driven by the level of RTRP requirements and the disaster recovery strategy that you're choosing in addition to improving your operational performance on day to day basis.
[12:00]
B
And for the folks listening who are trying to visualize in their head the diagram that Rajesh was just explaining, there is a blog post that's out there on the ATABIS Architecture blog. It's called Underscore Understand resiliency patterns and trade offs to architect efficiently in the cloud. That blog post has a diagram that can help you to be able to visualize this concept of being able to show what from lowest to highest requirements for your resiliency, the different trade offs on a lot of different accesses. And I think it'll help answer some of those maybe questions if you were trying to like wait, I don't fully understand this.
[12:45]
A
No, absolutely.
[12:46]
B
And same with the resiliency baseline paper as well that you wrote that which is called the AWS Startup Resiliency baseline.
[12:55]
A
Great call out. Those are like really good starting point when you're thinking about resiliency baseline because that's where you do the first few steps of really creating a baseline which is understanding what are the key stages to help you set objectives. And then you go into your design and implement phase where you choose the disaster recovery strategies, you choose your DevOps practices and the blog that you mentioned, that is a very good blog to help you understand and visualize those strategies.
[13:24]
B
Yeah, this blog that shows the diagram, it shows you the dimensions across like cost to implement, operational effort, effort to secure and then the other dimension. It compares things from extremes from multi AZ deployment to multi region active active and then someplace in the mean. So I think this is a good place to start, especially if you're at a stage where you're not really sure of what your RTO and RPO is like. Everyone wants everything to be available all the time. But then when it's like okay, what's the reality of the cost, the complexity, what do we really need? Especially when it comes down to cost. So definitely take a look at this.
[14:05]
A
Once you decided what your appropriate DOI strategy looks like, there are a couple of other things that would be a good starting point in terms of so that setting your objectives kind of addresses the stage One of setting the baseline. Right. In addition to just setting your RTO RPO requirements, it's not necessary. You would just set these two. You can have other metrics that you can set from a business perspective to give you a little bit more context and visibility. So let's say in our stack and our retail operations, that example we have chosen, for this example, you can create more metrics that matter to you. So similar to what business objectives we were talking about, number of orders per second, revenue per hour, something like that. Right. So when you're defining objectives, it also helps to identify business level objectives in addition to those and then mapping your user stories, which is as an end user, what are the kind of features, what are the kind of services, what are the kind of business functions that act as a single workload? So mapping those also helps you get started quickly in the first stage of setting the objectives.
[15:17]
B
I really like that thinking about how your KPIs also are important with respect to the RPO RTO to really help you define what your objectives should be for your resiliency baseline. So now that people are thinking, okay, I've got, I know what my KPIs are, I've defined my RTO, I've defined my RPO, I know what the financial impact is going to be. How can I actually be able to test this in AWS and make sure that my application can be able to meet those RPOs, RTO and KPIs?
[15:58]
A
Yeah, that's a great question. So in order to test, there are actually two distinctive ways and timings where you could test things out. So you can test things pre production before any workload hits the production and then during post production as well. Let's quickly talk about some of the pre production things that you can do. Let's say you have a CI CD process or you're planning to build one. You can include something called as integration test that allows you to basically understand what dependencies exist or what functionalities may be impacted when a certain scenario happens. Now the recommendation is to use any kind of circuit breaking pattern or load shedding pattern. We do have different tools within your dispersal that you can use. So we have AWS service called Fault Injection Simulator or Fault injection Service fis where you can actually use, which you can use to run the integration test as a part of your CI cd. So when a code is committed you can run fis to check how your application is going to be here. So this allows you to test your large scale distributed applications through a managed service fashion, which a Lot of our customers are finding pretty useful. You can also automate your deployment pipelines and include FIs as a stage in that process. And the last thing you should be doing is load testing. So make sure your pre prod environments do some form of load testing to make sure that when there are no bimodal natures, when the system goes under extreme stress in production, which it can from time to time based on increased workloads or some kind of incorrect function of a particular component. A couple of things that you can do for pre prod testing would be degradation testing, load testing and then automated automation deployment pipelines. Now what about post production? So in post production you want to conduct some resilience assessment test. So similar to security where you have toolings to ensure that your setup is meeting certain compliance that you have set internally. Similar to that you want to ensure you are using some automated tool to do resilience analysis or resilience assessments. Within aws, we have the AWS resilience hub where you can set your RT RP requirements and then the tool will look at your environment and then run assessments and tell you whether your infrastructure currently meets those requirements or not. The great thing is it's also part of you can also include this as a part of your CI CD processes. Conducting resilience assessments could be a big game changer because now you can see for your production environment what does your RTO RPO look like, whether it is meeting those. Again, you can use either some tooling to test out your environment and then when you're getting started, obviously start in a pre production environment. But for customers who have been practicing doctor testing successfully, they can also do this at a low scale within their production environment. There are a couple of other things that you could do, everything from drift detection to synthetic testing. And for really advanced customer use cases you can also do chaos engineering to some extent where you inject an evaluated risk with a known steady state and see how the application behaves. You want to make sure that you're not doing random test where you don't know exactly what the system is going to do. Right. So these are pre evaluated, pre vetted test with an expected outcome that you are expecting the system to bring.
[19:38]
B
Let's go over architecture. Are there architecture decisions, Rajesh, that would make this simpler for someone to be able to achieve their RTO and rpo? Like for example, maybe if they were to use Amazon Aurora or Amazon Dynamodb?
[19:57]
A
Absolutely. When you're choosing your technology stack for meeting a particular requirement, there could be a number of Variables that could be impacting it. So it could be your legacy infrastructure, your legacy tooling, applications teams comfortability with a particular tooling. But if you have an opportunity to start a green field where you have that option to choose any service or tools, see about what kind of functionalities they offer. If you're thinking about databases, so the common routes are managed databases or self managed databases. The pros and cons are with self managed you have a lot more tuning capability, a lot more things are under your control, but at the same time that requires a lot more expertise of operation and ability of execution. With managed databases you get an out of the box experience where depending upon what the managed service offers, certain aspects are taken care by the provider. So when you're thinking about AWS managed databases like RDS, DynamoDB and so on and so forth, all these comes with a lot of resiliency features built in. Typically when I'm seeing customers who want to start with a solid resiliency background, I do tend for smaller organization, I do tend to prefer in going with AWS managed services, especially if there are regional services like S3. The reason is AWS takes care of building resiliency postures, resiliency functionalities into these services for common failure modes such as a single host or a single hardware issue or even a single AZ issue, right? So S3 automatically replicates your data across multiple regions. Similarly, if you're choosing the right service, that could take care of making sure that you are resilient against common fault modes. So let's say if you're thinking about deploying containers, couple of options you have within AWS would be in our scenario we talked about EKs so customer could do either self managed Kubernetes, EKs, ECs and within these platform you also have option for Fargate. Now what's the difference between choosing for EKs versus ECs now? Let's say with ECS Fargate AWS would take care of making sure AWS has enough capacity so that when you are scaling you don't run into insufficient capacity error. For example, or if there is a problem with a given easy or if an AZ is not operating efficiently, AWS takes care of weighing out that AZ so that your workloads are not impacted. But when you go to EKS now you have to be more cautious how many easies you are deploying into. What's the capacity? If a particular easy cannot handle traffic, could the other easys you're using take that load? Right? So as you move away from managed to self Manage the number of parameters and variables and the fault modes that you have to design for it increases. Now this could be a good or a bad thing. So if you're a customer who has the ability to map out these different fault modes, identify how many you want to tackle and then design for that, then going to a more semi managed self managed would be a good approach. But if you're a customer who has certain applications where they want to rely on provider like AWS to handle some of these basic resiliency checks or fault modes, then going for managed services really helps. Another good thing about AWS managed services is a lot of these come with pre binned functionalities to get to a higher resilience posture. So think about ECR for example, which is the Elastic Container Registry. It allows cross region applications. So let's say you're building an infrastructure where you want because let's say you choose a pilot light kind of a setup where you want to have your primary application running in region one, but you want to have a pilot light setup in region two. In case region one is not operating to your expected levels, you could use the cross region replication functionalities of these AWS managed services. So S3, for example ECR, for example RDS, all of these services have natively built one click solutions that allow you to design a multi region setup very quickly. So if you know that from your previous stages that you will need to go with a pilot light or warm standby or maybe active active setup. Choosing the right services will make it a lot easier to develop the infrastructure that can meet those requirements. For self managed applications you would need a much higher level of complexity and expertise to meet those standards that you have set.
[24:51]
B
I think it'd be helpful, Vijesh, if we could go over actually these architecture patterns that we've been talking about the whole time you were talking about active active from pilot light. So can you break down for us what each of the different disaster recovery architecture patterns are?
[25:08]
A
Absolutely. So when you're thinking about the different disaster recovery options on cloud, broadly speaking, there are four so backups and restore, then pilot light warm standby and multisite active active. Now as you move from backups and restore to multisite active active, your RTO RPO goes from hours to tens of minutes to couple of minutes and then almost near real time. So what's the difference between these? At a very high level in backups and restore, the kind of architecture you're developing, you can afford to have an RTO RPO of hours. So to make it cost effective, you essentially make sure that you have the backups of both your data as well as your architecture, your entire infrastructure stack, essentially, and you have the ability to restore this from backup. So let's say in our scenario where we have the router three cloudfront, eks and rds, in that situation, let's see your primary site. And I think this is also kind of important. When you're talking about HA and doctor Is within ha, you are trying to make sure that the workload is resilient in your primary site. But for Dr. You're essentially talking about a secondary site or a secondary location. So for our practical discussion, let's assume that we are talking about a region as your primary site and a secondary region as your doctor Location. So for backup and restore, think of a scenario where your region one has some kind of impairment and you, based on your metrics, have decided that it's time to invoke the doctor so essentially you should be able to restore the entire infrastructure, including your data stores, into the secondary region. So you have to think about how do we take backups of our RDs, how do we ship it to a secondary region, and what about the rest of the infrastructure? Are we using infrastructure as code? Are we doing manual deployment? So this somehow ties into the DevOps decision as well, that if you have, let's say, 12 hours to do something, and your infrastructure is pretty simple, where you can create it manually, well, awesome, you can just create everything in a secondary region. But you might feel like, okay, we still have 12 hours, but we don't want to like manually deploy things, then that's where you can have partial infrastructure as code. And then what you will observe is as we get more aggressive into how quickly you want to restore essentially infrastructure as code. Automation is a requirement. Right? So with backup and restore, the idea is to rebuild the environment into a secondary region completely from scratch and restore the databases and data stores from a previous backup. The second strategy, which is a lot more aggressive, rtrpo, which is in the order of magnitude of minutes, is pilot light. In pilot light, you essentially create your foundational infrastructure in a secondary region. So in this example, let's say we had VPCs, we had the subnets, we have the proper networking setup in the secondary region, and the compute is going to come up on demand when we invoke the Dr. Plan. And because we need order of magnitudes of minutes, our data store are most likely going to be replicating through continuous replication. So if you're using rds, we probably need a read replica in the secondary region to restore from. Right. So now you see your backup, which was your database which had a backup and restore now goes into continuous replication strategy. So that's the difference between pilot light and backups and restore. Now what's the difference between pilot light and warm standby? They're quite similar in nature. You do have your foundational infrastructure, your data stores are replicating, but in addition you're also running your compute so that when a disaster occurs, you can simply switch your traffic to the secondary region, take whatever actions you have to make sure to promote your databases, switch traffic, update some DNS records, and the secondary site is basically ready to take traffic fairly quickly. So all these threes are more of active, passive kind of setups. And the last one is a multi site active active where you have two different regions which both are in active production and taking traffic.
[29:38]
B
That was like a textbook amazing explanation. And if those who are listening are like, can I. Where can I read more about this in the well architected framework? This is a pillar resiliency, so I definitely suggest you checking it out. And there is even a section that's literally just on disaster recovery within the resiliency pillar that dives into what Rajesh was just saying in a lot more detail.
[30:09]
A
Yeah, absolutely. And if a listener has an access to your laptop or phone and they pull up the disaster recovery strategies in cloud or disaster recovery options in cloud, they'll probably end up on a page which kind of will make it a lot easier to visualize because we have some good architecture diagrams as well. It's some semi transparent icons that kind of help you visualize on what are the things that are running all the time, what are the things that have to be deployed on demand when you execute your doctor so definitely a very good resource.
[30:43]
B
And how can customers be able to monitor their resiliency? So we can break that into two categories. So the high availability aspect and then the disaster recovery aspect.
[30:55]
A
The good thing is this is where again the lifecycle framework that we talked about, it really helps because this goes into the stage four, which is operations. A big part of it is observability. P Essentially you need to set up some kind of observability tooling to see what your current metrics look like. And then you are not only monitoring your operational stuff, which is how many requests you're getting, or the utilization of resources on different components, but you also create some aggregate metrics for your application, your workload, and then more at a business level. So you Essentially want to get some more visibility into how the workload as a whole is operating. In addition to the component level visibility and there are a lot of AWS tools that can help and also like AWS partner network tools that you can help to help you get more visibility. So common things that you can use is CloudWatch, but in addition to that you could also do synthetic monitoring. You can see how the end user experience looks like and if there is any drop, that's where you need to have proper alarms. Now some of the things to keep in mind over here is the observability tooling that you set also impacts your overall operation. So let's say you have an observability of requirement of an hour and your checks are happening every five minutes and you've decided it's going to take three field checks to know if something is not operating properly. You're looking at a timeframe of 15 minutes. So the granularity of the observability that you are setting it impacts the mean time to detection and that also impacts what is the rest of the time that you have to execute the entire operations. This applies to both, right from an echo perspective, if you want to fail fast, you can have a lot more aggressive health checks. And from a doctor perspective, having a lot more aggressive health checks gives you more visibility into how quickly you want to execute something. And there are obviously some trade trade offs that you have to consider in terms of complexity, cost, effort. And typically the way I recommend customers to start is start with a shallow pass off the metrics that you want to set up with somewhat relaxed health checks and then refine it over a period of time. Because as you, as you start getting more aggressive, that's where you have to design for false positives, false negatives, and then your health check engineering kind of becomes important.
[33:37]
B
This has definitely been a really useful conversation. I think being able to understand the importance of having a resiliency baseline, being able to work backwards from the financial impact and being able to then take these concepts and understand how can you actually implement them continuously in aws. Any other piece of advice? I mean you've probably spoken to hundreds of companies of different sizes at this point all about their resiliency posture.
[34:09]
A
It's pretty common for me to see customers who have thought about the first four stages. I think the last stage is pretty important as well, which is how do you respond when something happens. So when you're setting up the observability, you are going to create some dashboards, you're going to set up some alerts that's going to alert your operation team, but you also want to design for reactions in those situations. So let's say when an event is triggered, do you have the right runbooks to execute and does that runbook have enough information to record what's happening during that event, right before the event and after the event is completed? Because one thing that helps you close this entire loop on resiliency and really get that end to end visibility is by doing some form of incident analysis report right Within Amazon we use a process called correction of errors. Essentially after every event we look into what led to this event, what was done during the event, and how can we avoid this in future. And without that piece of doing incident analysis, you might run into a situation where you have an open loop system where you fix an issue that's like one and done. A different team at a different point in time on a different application might run into something similar. And we want to ensure that as an organization you have some form, some way to make sure the event that happened. The root cause analysis, the incident report is created and shared with the rest of the team. So it allows them to incorporate the recommendations or the resolution action steps and proactively avoid similar faults in future. It also helps to do like an ongoing operational reviews so whatever events have happened, what the status of this analysis reports and then and feeding it into the product teams to make sure that corrective actions has been taken a lot of time that could be very simple like we need to tune some alarms, we need to make, we need to reduce false positives, we need to reduce false negatives or well, the same event generated multiple alerts and then multiple teams were engaged. So how do we like avoid duplicative alerts? So all of this is a part of like responding and learning from it and then using this to create some kind of a closed loop mechanism for training and enablement or creating an incident knowledge base that allows you to spread this across the different teams. And when the same knowledge is spread through a mechanism, you will see that intuitively your RT RPOs that you have set could be much aggressive. And then you can reevaluate like hey, let's say if our RTO RPO is set to 12 hours, but with the entire path on the framework that we talked about, we are realistically able to now meet four hours. Well, that's when you can as a business see like okay, does it make sense for us to get it to more aggressive? And if so, then let's start looking at a second pass of this life cycle. So that's how it all comes together.
[37:16]
B
That is such a good call out the Creating a Correction of Errors document. I've seen the same thing. I think it's something that often is overlooked in this process. And there's a great blog post if you're someone that's hasn't made that journey. It's called Creating a Correction of Errors document. It's in the AWS Cloud Operations and Migrations blog. It's really good. It also talks about how you can use AWS Systems Manager as an option to be able to help you as part of your correction of errors process.
[37:50]
A
This doesn't have to be very complicated if you don't have a system right? When I work with a lot of startups who say, yeah, but we do not have the resources, bandwidth or the right dedicated tooling to do something like that. It doesn't have to be very fancy or complicated if when you're starting it for the first time, it could be our simple ticketing system with a particular tag, or it could be a wiki where you're maintaining a list of incidents and all the different topics that I talked about. So you can use start with your existing systems, ticketing system, wiki system, whatever you're using bug tracking features or internal shared tools and create that as a baseline. Like, okay, as a team, this is what we start with. And then over a period of time where the team adapts this culture, this thought process, this framework, that's when even this stage will start getting prioritized.
[38:37]
B
So good. Wow. Virjesh, thank you so much for being here on the AWS podcast today.
[38:43]
A
Awesome. My pleasure. Thanks for having me.