
With only nine months to launch Max, Tom Leaman, VP of Site Reliability Engineering at Warner Bros.
Loading summary
Simon Elisha
This is episode 722 of the AWS.
Werner Vogels
Podcast, released on May 26, 2025.
Tom Lehman
Hello, everyone, and welcome back to the AWS Podcast and a very special episode, another in our series of the Frugal Architect. And of course, I'm joined by our CTO and living legend, Werner Vogels. G' day, Werner.
Werner Vogels
Farewell. Thank you, Simon.
Simon Elisha
Thanks for everything.
Tom Lehman
He didn't know I was going to say that. That's what I like about Werner. I can still keep him surprised, even after all these years. And we're joined by a very special customer guest today. We're joined by Tom Lehman, who, as the vice president of site reliability engineering at Warner Brothers Discovery wbd, where he leads teams responsible for ensuring the reliability, scalability and operability lotsabilities of WBD's global technology platforms that support entertainment brands such as Max, Discovery plus and Bleacher Report. Before that, he had some really extensive leadership roles at Audible, at Vanguard, et cetera. He's done more In, I'd say, DevOps and site reliability than most of us have had hot dinners. Welcome to the podcast home.
Simon Elisha
Thank you very much, Simon.
Werner Vogels
So, actually, the former Amazonian.
Simon Elisha
Yes, yes. I had a brief stint at Audible. So you'll notice as we get into it, I'll probably bring up a lot of different Amazonian activities and actions, but it's great to be here, Simon and.
Tom Lehman
Bernard, you're amongst friends here. And before I even start, I just want to thank you. I'm in Australia and Max launched recently and I'm a subscriber and it's working perfectly. So, you know, it's always fun in these conversations, I think, and you'll agree with me, Werner, to dig behind the details of the services we kind of take for granted every day. We're going to do that today, aren't we?
Werner Vogels
Yeah, well, look at the past days in Portugal and Spain. Whole countries without power. Yet our AWS data service, AWS data centers kept on perfectly humming. Nobody can watch, of course, any of these videos.
Tom Lehman
Everyone's sitting around campfires.
Werner Vogels
We're fine. Yeah, no database corruptions or anything like that now. And I think after 20 years, I mean, we know how to plan and now how to prepare for this. And I think actually that's also a very large part of where the conversation is going to be about today, because it's not only about planning and execution there, it's also keeping cost in mind while you do it.
Tom Lehman
Well, Tom, let's start with that before we get techy. And believe me, we're going to get techie. Tell us a bit about your role at Warner Brothers Discovery. And what even does the VP of Site Reliability Engineering even do? Like what, what does that role entail?
Simon Elisha
Sure. So Simon, Site Reliability Engineering at wbd, our mission is to ensure that our content will always be there, always ready to play the moment any viewer presses the watch button. Our goal is to provide the customer uninterrupted and efficient experiences on all of WBD's digital platforms. And in the case that an issue does come up, that we are able to mitigate it relatively quickly, either through automated means or if manual operator initiative is necessary, that they can go in and fix an issue asap.
Werner Vogels
So the bank with that having insight into your systems because it's not just a matter of operations, there's also a data flow going back to them. I mean frugality, for example.
Simon Elisha
Oh absolutely. Observability and operational intelligence is absolutely essential to everything that we do. So a big part of what our team facilitates is the standardization of how we visualize all the operational data associated with our global deployments. We have hundreds and hundreds of microservices operating across nine separate regions in AWS right now. And as a part of that process, we need to be able to make sure that we have an understanding of how healthy each of those microservices happen to be. All their critical dependencies, their databases, what connections they happen to have, how traffic is flowing cross region and within our kubernetes clusters. It is a lot of data to synthesize and understand. So a big part of our job is to understand exactly how that data flows, the overall health of the system, and make that tangible and understandable to a broad engineering organization as well.
Tom Lehman
And it's interesting you talked about the sort of the scope here. I think there's some elements of scope here. I think it's worth picking up because it's different and somewhat daunting. Firstly, you've got millions of customers like millions of customers for whom any delay, any break in transmission is a frustration in what is a highly competitive market. And it sounds to me that you're really trying to take a, a standardized approach rather than a firefighting approach. Because I'm guessing at any given time there's all sorts of weird stuff going on and you could just get lost in that, particularly in a leadership position like yourself, where you're trying to sort of create this umbrella architecture that can help support that. Help us a bit more around that. Because you touched on the operational metadata scheme. I think we want to Dive into that first, because I think there's a bit of a mental breakthrough you've made there in terms of how folks can connect these systems together, because I think very often it's spreadsheets and documents and who knows? But you've gone way beyond that.
Simon Elisha
So to dive into that and before we even get to the operational metadata piece, and that's a fun journey to talk about by itself. For us, it all starts with the customer experience. So in order to understand our system, at the end of the day, we're providing a product to your point that customers interact with millions of customers. And for us, what matters is the customer and what they're doing and whether they're getting errors, the stream quality of the different videos that they're watching. Not hey, is container ABC573 running hot on CPU. That can be fine as long as the customer experience is completely uninterrupted in those spaces. So what we really take a look at, to start off with, is what we call our critical user journeys, right? So can a user log in? Can the user log in and then play back video? Can they browse? Do they get recommendations that are appropriate for them? And we start by trying to capture that information. We are in the process now of really structuring how we think about things like service level objectives and service level indicators to be able to map back to those customer user journeys and then how they translate into the actual components and API calls on our backend so that we can really have a good tracing of that customer experience and what might be happening with our backend services, our systems, our databases, and drawing that connection between them. When we start looking at this concept called the Operational Metadata Schema that we created back when Discovery was in the process of launching, originally around three or four years ago or so, the problems that we were trying to solve were a taxonomical problem. We were trying to figure out how do we actually get our arms around millions of cloud resources that had been deployed across dozens of AWS accounts. When we spun up D, a lot of our focus was how do we engineer quickly? How do we get capabilities out as quickly as possible to release the new D streaming product, Each team had effectively landed on their own way of cataloging their services and systems and isolated in a single location that was fine for their operational needs. But as we started to wrangle things from a platform and from an overall product perspective, that started to fall apart. It would be the equivalent of if we use mailing addresses and everybody had a different term for street or A different term for city, and then populated it with different information based on where they are. When you think about something as simple as a mailing address, the power of that and how we've standardized that in different countries, it's amazing. We can tie back not only just critical notifications and communications, we can tie invoices, billing, utilities, all back to that same structure. And the same thing goes with how we identify and organize our cloud and distributed systems. So that was really what we wanted to target on it is how do we create our own mailing addresses that will be understood and standardized across the organization. That's what operational metadata started with.
Werner Vogels
But of course you, you, you had a system that needed to get out quickly, and as such things grew, not necessarily in lockstep with each other. So how did you get everybody to more or less take a step back, stop and adopt your methodology?
Simon Elisha
So there was a sandwich approach that was applied. There was buy in from not only our engineers, but also senior leadership. And we had to make a business case as to why this was highly valuable. And a couple of the areas where really came into play was when we started looking at security and infrastructure vulnerabilities as well as cost management. We were partnered directly with our infosec team. And as we were going through and trying to identify where different vulnerabilities landed, we would easily have the AWS resource numbers that were available. We could tie that back. But then when we tried to tie that resource to an individual or team to actually take action on it, because many of our teams at that point point owned their own infrastructure as code, their own deployments, their own operations for their infrastructure, it became a very, very difficult process to do that. And more often than not, what we landed on was the account owner of an AWS account would get landed with all these vulnerabilities. But it was actually 15, 20, 30 different teams that all lived within the same account. And the account owner was there saying, I don't know what to do with this, right? And then security is there saying, we need to fix this. So we had some really good motivating factors across the board. From a business perspective, from a security and risk perspective, that really clicked with a lot of folks. So we were able to get buy in from our CTO at the time to sponsor a program. And we were able to start lighting up and provide spotlight into tagging compliance across all of our accounts. So we sent out weekly reports that would provide a listing of here's what is tagging compliant, here's what it isn't here is the breakdown of the different resource types. And then it made it a lot easier for us to be able to align that compliance and teams were able to start self serving and build that out over time.
Werner Vogels
So part of the metadata is not only all the resources that are being used at every place, but also who is responsible for them.
Simon Elisha
Exactly, exactly. So when we were building it out, a couple of the key items of metadata included a three tiered hierarchy of the functionality. Effectively we landed at a point where one tier of hierarchy saying, hey, this is the user service, for instance, didn't provide enough context to make it useful and also doesn't provide an ability to aggregate data in common ways. So we effectively created this three tiered system, starting off with a business service, then a service and then an individual component, where an individual component would map back to a microservice and its direct dependencies like databases, et cetera. So anything tagged with a component tag you could identify this is a microservice. A database like an RDS instance that supports that microservice would also have the same component tag. And then multiple components would roll up into a service, multiple services into a business service. And per Conway's law, our business services, services and components more or less mapped to our organizational structure. So it was easy to create some organization based reporting after the fact.
Tom Lehman
And then, and then you've sort of, I guess extended this from just purely a performance slash security landscape to suddenly really being able to get a grip on the efficiency part and the frugality part. And it's interesting to me that you're establishing a lot of measures already at the outset around the customer experience. But frugality is often, or cost is often just a corporate thing that we monitor. How did you start to think about firstly what the SRE role could even do in that space? Because their reliability, not necessarily design build side of things all the time. It depends. How did you unpack this? I think it's a fascinating sort of story of discovery.
Simon Elisha
Sure. So some of it was just natural evolution, to be frank, because when we originally started off with the OMD taxonomy, it was really about tagging infrastructure. And teams could go about doing that in any number of ways through using their own iac. You could manually apply the tag. Some teams had to rely on that just based on how their systems were put together, but ideally through iac. And then once we created that, so teams identified, hey, it would be really great if for instance, we could start applying these tags to our actual code repositories and Then we can align not only the infrastructure, but also the repositories, the code, the configuration that maps to that infrastructure together. So now we can draw that alignment of if I need to update RDS instance or Dynamo table A, I can now map back to the repository that actually drives the configuration for that. And once we got that mapping together and we started to build out common CI CD platforms that everybody ended up aligning to when we were building Macs, we then took the connection of you create a repository, you create the mapping to the OMD schema, and then that CICD pipeline could not only align that OMD information to the eventual infrastructure that gets deployed, but also add that to the actual running jobs. So you can understand what jobs are associated with a component, that which jobs are aligned to a business service. And then that got propagated out, it got added to our observability data metrics, our logs. So you have this tracing capability associated with our entire ecosystem where you can map almost any piece of operational data back to that taxonomy. And much of that operational data then translates into your utilization that then drives cost. So as this started to evolve over time, it effectively became a no brainer for us to then start aligning costs and understanding these different efficiencies within the context of the OMD schema itself.
Werner Vogels
So when you think about the CSD pipeline, do you do then dynamic tagging for those resources that go that say to staging or to testing or to. I mean, because those are again different resources than the ones you use in production.
Simon Elisha
Absolutely, absolutely. So the CI CD pipelines, which are now shared across all of our services that get deployed to this common SaaS platform that I was talking about earlier. Now no matter what environment we have those tags applied and when those tags get applied again, it gets into the operational data. And we can see that across dashboards and Grafana and Prometheus data, you know, our open search logs as well.
Tom Lehman
When it comes to cost, you and your team did something I think that's really interesting and it's often counterintuitive, is that you, you chose a deceptively simple measure for efficiency, which I think once you pick away at, it becomes complicated, as with many things that we do in business. But you chose cost per subscriber as your efficiency measure. Unpack that for us. What is a very quick sentence to say. I reckon there's a lot there.
Simon Elisha
Yeah, yeah. Well, part of it is a acknowledgement that there's a unit cost economics associated with any type of cloud platform, any platform really, whether you're using bare metal and your own data centers or out in the cloud. It really just depends on how much of it can be utility priced versus how much of it can be statically in fixed costs if you happen to be in your own data center where you have to position your hardware. But ideally your costs should be driven by some factor. Something should be in that driver's seat. And when we started looking at how to drive FinOps initiatives out of the site reliability engineering organization, we realized throwing out static dollar figures would be very difficult in an organization trying to grow its streaming business is. We knew that as we acquired new subscribers, costs were inevitably going to go up in some way, shape or form. It actually started. That's a, it's, it's a really, really good thing. Right. So we wanted to find some way to balance and understand what our efficiency was with an understanding that costs were going to grow in some way, shape or form. So after we met launched max in the US back in 2023, we decided, hey, we're going to start tracking cost per subscriber. We had plans for expanding in the eu, APAC and then a number of different countries within those regions in the coming years. So every time we would launch, there would be a surge of new infrastructure being built out. There would be new subscribers migrated from HBO as well. So we were able to track all of those customer acquisitions or migrations and then that would offset or effectively become that denominator below the cloud cost. And then that helped us understand are we actually building a system that is just as efficient, if not more efficient than some of the products and platforms that had come before it. So eventually we were able to track cost per subscriber not only for Max, but the predecessor HBO Max as well as Discovery Plus. And we could do a comparison. Is this new platform that we're building really more efficient than what we've built in the past?
Werner Vogels
Which is interesting of course because the economic model for streaming services is subscription based. It's a fixed price, yet your costs per subscriber are variable. I some of them may want do binge watching every night on your service and then there may be others that now and then pick up a sports or a new movie or whatever, things like that. So your subscriber base must be. Their behavior must be quite broad. Yeah. And mapping that back to a subscriber price, a subscription price is. Must be interesting.
Simon Elisha
So I think you're hitting at one of the areas and opportunities that we have for evolution in the space burner. So this was our MVP unit price economics that we wanted to get out the door. It was simple, it was easy, it was something that we could get that was standardized across all of our products and platforms out the bat. But I certainly have aspirations and a vision to get much more granular with our unit economics. So we can get back to exactly what you're talking about is more of a price per string or price per login and really tying it back to specific critical user journeys effectively using the same type of model so that we can understand what are those critical drivers and then tie that back to active subscribers and things along those lines.
Werner Vogels
So if I think about the critical usage journey at Amazon, the retailer, it it's search, browse, checkout, shopping cart, any reviews. Because if reviews don't work, people won't buy either. That's everything else on the site is secondary. I mean recommendations are important. And who bought this and XY as well? Are there particular parts? Did you apply something like that to your organization as well? Where let's say the critical user journey needs to be four nines available where there's other parts of your organization with two nine may be fine.
Simon Elisha
Absolutely. So we have a microservice tiering strategy that factors in a combination of the customer experience, business operations, things like legal compliance and things along those lines. So we have a fairly robust set of effectively questions that we use to be able to evaluate exactly where every microservice happens to fit in that domain. That then translates into a number of different impacts from an operations and process perspective, whether it be how we examine the cost associated with it, to how we handle incident response, to even what type of gates and expectations we set for deployment frequency and quality of testing that happens ahead of time so that all factors into the play. And then certainly when we think about the actual operations of visualizing and understanding those services, we give higher priority to those tier one services that have a much stricter SLA than those that have.
Werner Vogels
A lower tier instead of conversation that you have with the business, so not necessarily only the tech only, but where basically the business helps decide which one should be tier one and tier two.
Simon Elisha
So from a business perspective, we don't get feedback on individual microservices, but we get feedback more on that user journey factor and then the user journey translates down into the microservices themselves. So effectively, if something happens to be a critical user journey or a microservice that directly impacts that that's going to be a tier one service. In those circumstances, things that are not critical dependencies as A part of the cuj. Those wind up being your tier two and all of those situations.
Tom Lehman
Tom, I think it's interesting too, there's an insight here I really want to call out because I think it's vital for our listeners is that both on the performance and reliability side and on the cost side you're using business nomenclature the whole time. It's the user journey, it's the cost per subscriber, it's things that I would imagine you can have a very comfortable and open conversation with the cfo, anyone in the finance department, the head of marketing, whomever. Like these are not geeky. It hey, you know, what's the back pressure on this particular service? Or you know, how fast is our storage running? There's none of that but this ability to have an accessible conversation I think is vital to getting that buy in to this whole process you're trying to do.
Simon Elisha
Yeah Simon, I'll admit I love a good quirky microservice name, a good Optimus prime or something along those lines. But at the end of the day what you said is exactly it. Right. We need to be able to provide understanding of technical systems that are digestible to a large audience and by doing that we need to actually describe the feature and function that's being performed. Naming is not easy and certainly the system is going to evolve over time and even when you land on functional based names for the system, there's a good chance that the functionality underneath those names is going to change and twist and be dynamic over time. But that's where we need to try our best to be able to make that alignment because it makes everything else so much easier.
Werner Vogels
So maintaining things are evolving over time. Talk us through the merger and what impact that had on sort of. Did you get a chance to start from scratch again?
Simon Elisha
Absolutely. So I have to give a lot of credit to our senior leadership. When we went into the merger, we didn't build completely from scratch. We actually had a mission of best of both. There was a code word where we were building the Bob platform. It was really great from not only a technical perspective but also a people perspective because it wasn't a hey, we're coming from this particular company, we're just going to use this land it. But effectively the engineering teams from both organizations brought together and they collaborated and really critically examined the back end and front end systems. Everything from the individual like customer facing services and capabilities to how we manage the actual platform behind the scenes. And it was the engineers that came up with the proposals associated with which direction we should go. Certainly there were certain elements associated with product of hey, how easy will it be to ship certain features based on these different capabilities? But that joining invest of both really put us in a really good spot. Now, once we had landed on that, there was still an actual migration because the underlying platform, the CICD capabilities, the actual compute platforms and kubernetes and how we deploy databases and things along those lines did end up getting built from scratch. So even though we had services that might have done XYZ business capability, they still needed to migrate to some of the new functions. But because everything on both companies was containerized, we were still able to port a lot of the code over and still get that deployed through the new functionality backend platform. So it was a really exciting time, a lot of development, a lot of work, and really good conversations across the board were had from both the folks that were originally from Discovery plus and the folks that were from the Warner side. It was really a one team moment where we came together and operated together. So some of the key functions that we brought over were certainly the operational metadata for that. That was something that had really served its purpose. Within Discovery plus we had this taxonomy that extended across the entire software development lifecycle. So we were able to more effectively bake that into everything that our engineers did on this new platform. And that helped us get set up right away with easy standardization from everything from how we created our repositories, how we mapped OMD to those taxonomies, to how the CICD platform ended up working, how we deployed both containers as well as infrastructure, and then how we actually ended up understanding it. We could easily create out of the box, dashboards and visualizations for individual services to portfolios of services. And we could also track the costs, incidents, alerts, et cetera, all across the board in a standardized way.
Werner Vogels
Often when there's a merger, especially with two tech heavy organizations, culture might result in clashes. And in this particular case, you already had a whole methodology as well as culture around your OMD or your the metadata repository. How easy was that to convince the other team, the other side, to actually adopt your methodology?
Simon Elisha
Yeah, so I think yes, there, there can certainly be situations where there can be some antagonism and things along those lines. But quite frankly, with how we ended up integrating and the way that we ended up coming together, it wasn't two separate teams effectively clashing hits. During the merger, especially at least from my experience with the platform teams, there were a lot of very good collaborative sessions between the teams where we were truly Evaluating the merits of different parts of the system and came up with the recommendations together. And certainly there were some disagreements, but I'm going to go back to my Amazonian days. There was a commitment to disagree and commit without any animosity between the engineers on the floor, which was excellent to see. And in a lot of ways a lot of our solutions were headed in the same direction. So if we took a look at the future state vision for what had been built out and discovery and the future state vision for what had been built out for HP Max. While neither had reached their end state North Star, their North Stars looked a lot pretty similar at the end of the day, including specific technologies that they wanted to use, the ways that they would adapt them. So this was a really great opportunity for platform teams to full scale propel themselves forward to that shared North Star vision. So it was a great happenstance and great opportunity for that type of collaboration to land on what that future state actually looks like.
Werner Vogels
Did the fact that you'd been given a nine month deadline to get everything done, did that help in decision making?
Simon Elisha
It can certainly speed things up, right?
Tom Lehman
It focuses the mind beautifully.
Simon Elisha
Yes, yes, it was very much a, well, we got to do this, we got to get it done and get it out the door. So we didn't. There wasn't any time for hashing and rehashing architectures and systems because we needed to get to a point, especially as a platform team, you're the first gate for any of your engineers, for the back end, for the front end, to actually make progress. So we were in the hot seat for the first two months or so because everybody was waiting for us to be able to build the capabilities for.
Tom Lehman
Them to build their software and talking about making decisions. You know, one of the things we talk about in sort of the frugality of architecture is this incremental approach to cost optimization. But you can't change everything. And you have a really interesting approach to, I guess, categorizing design decisions that could, could continuously cost you in the future. Help us unpack how you sort of look at that.
Simon Elisha
Sure. So if I'm not mistaken, Simon, you're referring to the closed door philosophy associated.
Tom Lehman
Exactly. Yes.
Simon Elisha
Yeah. So a good analogy that I like to think of is when you're going and buying a house, there are certain things that you can change or update or modify. There are certain things that you can't. You'll hear real estate agents say location, location, location, and sure, it's a stereotype, but at the end of the day That's a very important distinction. Once you buy a house in the land, you're not going to lift the land or move the land. If you buy a house in a floodplain, you have a house in a floodplain and you have to deal with that going forward. And that's something that is a closed door. You can't change the environment around you unless you have a ridiculous amount of money. Most of us don't have that type of money to be able to do that. So you need to be able to focus on what are the things that are going to be. Whether they're irreversible.
Tom Lehman
Yeah, everything's reversible. It's a question of time and money and usually we're short on both.
Werner Vogels
Some things are just like land. I mean, if you just sold your car to someone, you can't go back a week later, say, sorry, I changed my mind, bought my car back. Yeah, there's one way doors and there's two way doors.
Simon Elisha
And there are some doors that they get a little stuck and they need a little bit of oil and you can budge those free. So from our perspective, a lot of our closed door or very difficult to alter technology decisions come down to things like database choice, deciding whether you're going to go relational versus non relational. So choosing between RDS, Aurora versus DynamoDB, that's a pretty sizable distinction. And if you get into production, you have users, if you're going to switch databases for a critical, let's say a service that happens to be in our critical user journey, that becomes a big endeavor to make that switch. So we want to be able to make sure that upfront, those decisions are appropriate for the business use case, for regulatory needs, for reliability needs and for cost needs. Right. Being able to spin and also maintenance and operational means of course too. Right. We don't want to be in a position where we're creating database that may only be a hundred rows, rarely gets updated and then land that in Aurora rds because that happens to be a situation where you're going to have to operate and maintain those instances going forward. Right. It's just not the fit. So there's an important conversation to be had with engineers to be able to make sure that they've got the right education of, hey, what are the right ways and what are the right situations to use different types of technologies and.
Tom Lehman
Do you document those, do you use that as a repository of knowledge for future people coming on board?
Simon Elisha
Absolutely. So we have some clearly defined, at least within the database space, the different Use cases for different styles of database. And ideally we help teams understand and self serve that information. But one of my teams, the database Reliability Engineering team, partners with teams so when they create, we have a document that's standardized in our organization for documenting architectural decisions. And this is whenever we're introducing a new microservice, databases, et cetera into the ecosystem, our DRE team gets their hands on the add effectively. If there's a database involved, teams flag it. My engineers get engaged in that and they'll do a once over and just validate, hey, does this seem to make sense from a database perspective, a maintenance perspective, and partner with the team to see, hey, are there other opportunities or different ways of being able to use, use the right database style, different types of deployment methodologies, replication capabilities and things along those lines up front before we end up getting anything into a production environment where it could end up costing us a lot of money down the line.
Werner Vogels
But hopefully also then the category of decisions that people can just go make without having to talk to others first.
Simon Elisha
So in those spaces, particularly when we're thinking about cost, right, for things like container scaling, we'll typically look at that. And if we're in a situation where we understand that a service is inefficient, when it's just ready for its initial production launch, or we have a major event that's coming up, we may say, hey, it's okay for right now. We will take on that actual real financial debt and scale it up horizontally, right? And we'll keep it scaled up as necessary to handle traffic and load for a short period of time. Because we know that that is a situation where we can come back, we can tune configurations, we can update some of the logic within the service itself and then eventually bring that back down. And we have some good ways of being able to measure and understand those efficiencies and we can work with teams, and teams have the ability to see that information and pare that down a bit. So those are some of those situations where it becomes a, yeah, go ahead. Let's just make sure that we keep track of this and we don't lose sight of it in the future.
Tom Lehman
Well, this leads into that concept that I know you've spoken previously as well, which is, you know, the difference between frugal and cheap. And the wonderful concept we talk about, Amazon too, is fruity. And I think you've touched on that a little bit there, which is, you know, you don't have to save money all the time because it's not always the right thing to do.
Simon Elisha
Yep. And there are two angles of rapidity. And I can't take credit for introducing it to Warner Brothers. That's one of our senior leaders. But I remember we were going down the route of a cost savings initiative, and every company goes through these throughout their cycles. And we were like, all right, we're going to emphasize cost over this quarter. And he looked at me and he just said, tom, whatever you do, just make sure we don't do anything stupid. And I was like, that has resonated with me and the rest of the company ever since that happened. And the position there was, don't do something in the name of cutting cost. That is going to have negative repercussions for our users. But it also has the same implication to and Simon, what I believe you were getting at was, hey, don't spend tons of time on trying to reduce cost that could be served for improving customer experience. Right. Saving $10 a month and spending a week on building in that efficiency is probably not the right move when it comes to how we're investing our engineering effort.
Werner Vogels
Now, when efficiencies are one offs versus the ones that, that execute three milliseconds over and over again. You mentioned nine different regions where you're operating in, serving all of the world. Are those regions for you identical? Or do you have different deployment strategies and different, maybe different regulatory requirements where you have to operate in.
Simon Elisha
Yeah, so we try to keep them about as identical as possible. With all of those considerations in mind, we have a way of thinking about our architecture where we have different classifications of each of our components and services. One is our market specification. So we divvy up those nine regions into four markets. But really it's three at the end of the day. Americas, basically. US and Latin America, EMEA and apac. And each of those are effectively there to serve customers within each of those particular areas. But we also have a branch of what we consider global services. So these services, no matter what market they happen to be deployed to, they are 100% identical. There's no change in business logic. They have the exact same infrastructure deployment. So if they have an RDS instance per region in the us, they have an RDS instance per region in EMEA per region in apac. For our market specific ones, there are actually unique databases, for instance, within a particular market. So the US databases are isolated to the US regions. There will still be databases in the eu, but they are self contained in that area. And typically in those spaces, we'll have a replication strategy within the market it where for global services, it'll be global replication across all nine regions, those spaces.
Werner Vogels
So from your operational side and your reliability side, do you look at each of those four markets differently or do you have one global. If your operations team has one global view of the whole world, yeah.
Simon Elisha
So it's both. Because we care, again, about both, and because we have services that are spread globally, we need to be able to understand globally impact as well as individualized market impact. And the market categorization is a part of the OMD taxonomy. So we have that information aligned to each of the services, their individual deployments, and the measurements across the board. So we can easily pull up and understand, here is all of the operational metrics associated with the U.S. here's all the operational metrics associated with EMEA, here's all the Operational metrics associated with APAC. And it'll pull that data from those particular AWS regions to be able to bring them up to the forefront. So we can understand that for our databases, our containers, the CUJ performance in each of those domains, it's not that.
Werner Vogels
You have your data flow to one centralized location, let's say in the US and then aggregate everything there, but each of the regions are responsible for itself in terms of providing that data?
Simon Elisha
In many cases, yes.
Tom Lehman
Tom, I wanted to pivot a little bit. You talked a lot about sort of some interesting things that the team's working on, and things don't always go right. I mean, we wish they go right, we try them to go right, but they don't always go right. And you talk a lot about celebration of error, and that's a term I hadn't come across before, quite frankly. We do correction of error at Amazon. But celebration of error is an interesting concept. Talk to us about how it works and what impact it's had, even just using it in that way.
Simon Elisha
I didn't term celebration of error, but I think there was definitely a little bit of a rooting in the COE acronym there. And the desire was to really make it more of a. We wanted a positive experience associated with the learning opportunities. So using the term celebration in that space felt like a way by which we could highlight the opportunities and the learnings that come from it. Certainly we don't want errors to happen. The celebration isn't the fact that we had an incident, customers were impacted. Yeah, no, that's. That's not the end goal. The. The celebration is really about the shared learnings and how we better understand a combination of our systems, our people and our processes that are in play. So by doing that and changing the nuance and trying to focus on those three functions that all typically get engaged whenever you have an incident. We've seen a really positive approach to how engineering teams reflect on incidents after the fact when SRE gets engaged. And one thing I should be clear on is that at wbd, reliability is a shared responsibility. My organization's name is Psych Reliability Engineering. But we aren't on point for the reliability of all the pieces of our systems. The teams that build and deploy Service X are responsible at the end of the day for the reliability of Service X. SRE helps out and helps provide them tools to make that more reliable. But occasionally we do get pulled in for actual engineering work. And typically these are some of the hairier, larger scale events that happen to occur. And when we go in, we try to look at the incidents from a number of different angles. We really try to understand the observation, hypothesis and action life cycle that typically occurs in many incidents. And we try to identify that throughout the entire celebration of error process. So reconstructing the timeline, tagging and understanding what happened even before we started investigating the incident. So what was the state of the system prior to that point, how that informed how we actually respond to it when people come in, what is their perception when they enter the incident? What knowledge do they bring as a part of that process? And then as we're going through that, we're identifying action items and observations of what was happening at the time, and then use that to inform how we're going to bolster the system and how we're going to bolster the processes and maybe even translate that into better training and material. We don't want to be pigeonholed into. Let's tune an alert or let's scale up. We want to be able to identify, hey, maybe if this engineer or operator knew about this dashboard a little bit sooner, do we have an opportunity to do some knowledge transfer or knowledge sharing associated with, hey, this deployment dashboard could have shown you the correlation between rebuffering rates or playback failures that went up at the same time. And we could have rolled that back faster in that situation. So we wanted to attack all of those different angles. And then we always share these major COEs with the broader organization. So we'll invite everybody from senior leadership to level one engineers and we'll do a review of the COE and error. Exactly what happened.
Werner Vogels
Do COEs in your case have a particular structure? I mean, at Amazon, we have these five whys and descriptions of things that that there's a fixed format of or there's fixed questions that you have to answer. Is that do you have something like that as well?
Simon Elisha
Absolutely. Everyone we have we have a standardized incident creation process. Some of it ended up up getting triggered automatically just based on our metrics and configured alerts were manual creation processes. And if it happens to be what we classify as a SEV2 or a SEV1, it will automatically create a template of a COE and that template includes number of core sections. The first four or five are really about executive summary and decomposing the customer impact because we always want to understand again it goes back to the customer. What were the customers feeling at that point in time and and one thing to call out. It's not always external customers. Our systems support internal business use cases for platform engineering. We have customers that are developers within our organization. So that customer impact isn't just about our customers that have an active subscription that are trying to stream. So we try to quantify and understand what that happens to be. We have a timeline that we go through for being able to understand again each of those different actions, observations and things along those lines of how people interacted throughout the incident. And then we have standard questions five whys happens to be a part of it and then a number of templated questions associated with observability like was this automatically detected? Is there something that we need to change associated with how we manage or alert as a part of the incident? And then there's certainly an action item section as well. Well, we also have a few sections of what did we learn from this? What went well, what didn't go well. Because sometimes you learn more from what went well during an incident than necessarily all the bad things that might have happened and occurred.
Tom Lehman
So Tom, obviously the future is interesting, exciting and a little scary as well for all of us in technology in terms of what the future holds, I guess before you wrap up, because time has gone very quickly firstly, I mean the journey the organization has gone on is remarkable. And again, I can't overemphasize how hard it is to keep a streaming service up and running reliably. Like streaming stuff is hard, particularly when you've got hits like Succession and you got the Last of Us, et cetera. Like it hits hard. And so clearly the team's done a lot of work. Are there any last thoughts you'd like to share for others thinking about either who are in this world or even thinking about getting into it that you think are relevant and specifically obviously around frugality in the way you're thinking about things.
Simon Elisha
Yeah. So I, I think at the end of the day, in order to build frugal architectures and enable a culture of frugality, there are some core ingredients to the recipe. First and foremost, you need to be able to align what you have to business. I truly believe that we, we've done a lot of that within the WBD space is being able to understand how your costs impact the business, the way that it ends up tying back to certain capabilities within your system. You also need to get buy in that this needs to be something that's done. I think that's easier than some domains certainly because everybody wants to save cost and reduce the cost of operations. You need to be able to provide teams the right visibility. Right. Creating and generating reports on a quarterly basis, half year basis, yearly basis on where costs are, how they happen. To be allocated. The cycle time on that is too long to really embed in a culture. So making it as self serve as possible and getting that insight into the engineering teams to take action is table stakes in my mind. And then finally provide the teams the right tools and education to take action on that information. Again, if they have the information and they can't do anything about it, you're not setting them up for success in that space. So get the mission, get the insight, get the tools and education and I think teams can make a lot of difference in this space. Makes sense.
Tom Lehman
Tom, thanks so much for coming on and sharing all that with us. It's been really fascinating.
Simon Elisha
Thank you Simon, it was great. Thank you.
Tom Lehman
Warner and Werner, always fun to do this. We'll have another one soon, I'm sure. And as always, you can refer to the Frugal Architect webpage as well. Get lots of information, lots of tips. It's the sort of resource that you want to revisit regularly because you'll learn something new every time. I know, I find that as well. And some great customer stories there. And of course, until next time and doing it frugally, keep on building.
AWS Podcast Episode #722: The Frugal Architect w/Werner Vogels - How Warner Bros. Discovery Keeps Streaming Seamless
Release Date: May 26, 2025
In episode #722 of the official AWS Podcast, hosts Simon Elisha and Tom Lehman delve into the intricacies of maintaining seamless streaming services at Warner Bros. Discovery (WBD). Joined by Amazon Web Services (AWS) CTO Werner Vogels, the conversation centers around the strategies, challenges, and innovations that ensure millions of subscribers enjoy uninterrupted entertainment across platforms like Max, Discovery+, and Bleacher Report.
[00:00 - 01:02]
Tom Lehman opens the episode by introducing the "Frugal Architect" series, emphasizing the focus on cost-effective yet robust architectural solutions. He welcomes Werner Vogels and highlights Tom’s role at WBD as Vice President of Site Reliability Engineering (SRE).
Notable Quote:
"It's always fun in these conversations... to dig behind the details of the services we kind of take for granted every day."
— Tom Lehman [01:17]
[02:21 - 04:24]
Tom elaborates on his responsibilities, which include ensuring the reliability, scalability, and operability of WBD’s global technology platforms. The mission is to deliver uninterrupted and efficient streaming experiences to millions of subscribers, mitigating issues swiftly through automation or manual intervention.
Notable Quote:
"Our goal is to provide the customer uninterrupted and efficient experiences... and in the case that an issue does come up, that we are able to mitigate it relatively quickly."
— Tom Lehman [02:37]
[04:24 - 10:46]
The discussion transitions to the creation and implementation of the Operational Metadata (OMD) Schema. Tom explains how standardizing metadata across millions of cloud resources across multiple AWS accounts was crucial for managing dependencies, security vulnerabilities, and cost management.
Notable Quote:
"We wanted to create our own mailing addresses that will be understood and standardized across the organization."
— Tom Lehman [08:31]
Key Points:
[10:46 - 19:18]
Tom discusses the importance of frugality—not just in cost-cutting but in optimizing resource usage to enhance customer experience. WBD adopted "cost per subscriber" as a key efficiency metric, enabling them to balance growing subscriber bases with controllable infrastructure costs.
Notable Quote:
"Cost per subscriber... helps us understand are we actually building a system that is just as efficient, if not more efficient than some of the products and platforms that had come before it."
— Tom Lehman [15:00]
Key Points:
[23:55 - 29:09]
The conversation shifts to the recent merger between Warner Bros. and Discovery, highlighting the integration of diverse engineering teams and technologies. Tom credits senior leadership for fostering a collaborative environment that merged best practices from both organizations without significant friction.
Notable Quote:
"It was a really great opportunity for platform teams to fully scale themselves forward to that shared North Star vision."
— Tom Lehman [28:43]
Key Points:
[40:00 - 45:54]
Tom introduces the concept of "Celebration of Error," a proactive approach to incident management that focuses on learning and improvement rather than merely correcting mistakes. This methodology fosters a positive culture around handling incidents, emphasizing shared learnings and continuous improvement.
Notable Quote:
"The celebration is really about the shared learnings and how we better understand a combination of our systems, our people and our processes."
— Tom Lehman [40:26]
Key Points:
[35:35 - 37:10]
Tom explains WBD’s deployment strategy across nine AWS regions, categorized into three main markets: Americas, EMEA, and APAC. They maintain identical global services across regions while allowing for market-specific deployments to meet regulatory and operational requirements.
Notable Quote:
"We have services that are spread globally, we need to be able to understand globally impact as well as individualized market impact."
— Tom Lehman [38:56]
Key Points:
[46:40 - 48:22]
Concluding the episode, Tom shares insights on building frugal architectures by aligning costs with business objectives, providing visibility through self-service tools, and educating teams to take informed actions. He emphasizes that frugality should enhance, not hinder, the customer experience.
Notable Quote:
"Get the mission, get the insight, get the tools and education and I think teams can make a lot of difference in this space."
— Tom Lehman [46:40]
Key Points:
Episode #722 of the AWS Podcast offers a deep dive into the operational excellence and frugal engineering practices at Warner Bros. Discovery. Through strategic metadata standardization, a collaborative approach to mergers, and a proactive incident management philosophy, WBD successfully delivers seamless streaming experiences to millions worldwide. The conversation underscores the importance of aligning technical strategies with business objectives, fostering a culture of continuous improvement, and maintaining cost efficiency to support scalable growth in the competitive streaming industry.
For more insights and detailed discussions, listeners are encouraged to visit the Frugal Architect webpage.