
Dive deep into the fascinating world of modern data centers with Sr. Principal Engineer at AWS, Step
Loading summary
Stephen Callahan
This is episode 716 of the AWS podcast, released on April 14, 2025.
Simon Lesh
Hello, everyone, and welcome back to the AWS Podcast. Simon Lesh here with you. Great to have you back. And I'm joined by a special guest. I'm joined by Stephen Callahan, who's a senior principal engineer at Amazon and he works in the infrastructure services space, and he's been here for a long time. In fact, one of the few Amazonians I get to meet that's been here longer than me. Almost 15 years. I'm only 13. Stephen, welcome to the podcast.
Stephen Callahan
Yes, thank you very much. Yes, it has been. It has been a roller coaster. When I think back of how things started, goes 15 years ago, and what we were doing then compared to now may be 15 years in one company, but it kind of feels like we've been through three or four different variations since then.
Simon Lesh
Yep, absolutely. Every year is like three to five over here. But that's a good thing for our customers because it means we've done lots of stuff and done it at scale. So today we are talking about infrastructure. In fact, we're talking about data centers. Now, we don't often talk about data centers at AWS and Amazon because that's the undifferentiated heavy lifting that supports what our customers do. And customers connect via an API and get all the good stuff. But servers have to run somewhere. It has to run somewhere. Services are created with technology. Technology is made of hardware and software, and those things have to come together. And at Amazon, they come together at enormous scale. And because of that, we think about things very deeply, maybe more deeply than if you've run your own data center, many of you have that you've had the opportunity to. And before we get into sort of some of the things we've been doing differently and how we've been changing things, I guess. Let's talk a bit about scale, Stephen, because I think you bring a really interesting perspective on that topic, given what you've seen, and help folks understand the different mental models that you and your team apply.
Stephen Callahan
Yeah. So I suppose kind of if I look through the different scale aspects, you know, there's a joke I have kind of around my organization that every couple of years we just kind of change what the exponent was. And so we started in gigabits, we kind of moved to terabits, and then we're talking about petabits, and it just kind of moves up as we do it.
Simon Lesh
Oh, just end zeros.
Stephen Callahan
It's just a different letter, you know, It'll always evolve. And yes, like 800 gigabit or 1.6 terabit network interfaces, you know, they were fantasy worlds, you know, not too long ago. And I suppose when the data center space, you know, a lot of the changes that we're seeing now with, you know, it was a data center, it was like one thing that we were talking about now we're kind of planned in, you know, in clusters or we'll get onto a little bit later on. But one of the things in kind of preparing for this I was thinking was, you know, in the non ML world we're adding a building. It's like we add a building to AZ1 and then AZ2 and AZ3 and we're trying to balance them out. That's no longer the case now. Now it's a case of I want six buildings in this location as quickly as possible. And so it's just a different way of looking at it. And like you said about the way that we think about it is a bit different, is we go to much deeper things or much deeper levels of ownership in these data centers than I think most people would think is reasonable to do.
Simon Lesh
Well, I think it's interesting and we'll get onto some of the challenges of scale, particularly some of the newer technologies. But one thing you said to me when we're preparing for this call is nothing is uninteresting scale. What do you mean by that? Because I think it's a fascinating way to look at things.
Stephen Callahan
Yeah, there's a set of kind of core tenants that drive most teams here at Amazon and that's one of the ones that I advocated for within our team. And that's a case of we're trying to balance speed of delivery with availability, sustainability. There's lots of different things that we're doing and so we're all looking for those trade offs. But I don't want to just sit here and think about it. And so when it's at small scale you can kind of make a faster decision. But as scale starts coming into it, then that's when the more interesting things start coming about. And maybe you'll reevaluate a decision you made in the past because now suddenly we're 10 or 50 times larger than we were before.
Simon Lesh
Gotcha, Gotcha. Now, you and the team have been building lots of data centers for a long time. Surely it's just cookie cutter, you know, set and forget, stamp them out, away you go. What's. Why are we rethinking things? What's changing Well, I think that's one.
Stephen Callahan
Of those things that if you're not continually moving forward, you just get into stasis and that kind of ultimately gets to, as we say here at aws, that gets into day two thinking which ultimately leads to just being frozen an ultimate depth in terms of your innovation ability to deliver. And so we're continually rethinking some of these things. But I suppose what's different with the pace in the data center space is there's a, you know, obviously there's a huge market interest in Gen AI and those things, you know, those data centers are, they're different. You know, we've gone to much higher density, much more power intensive processing and also, you know, I look at these things as much more correlated workloads. When you do have a general purpose cloud data center, you've got 100, a thousand customers, they're all doing different things and you kind of gain some benefits by everyone doing something different. But when the entire data center is correlated and on or off, you know, new, new trends come up and anytime that happens, you get a new opportunity to just rethink things.
Simon Lesh
And so you talked about Gen AI and what we know is that, you know, it uses lots of power, power needs cooling as well. Talk us through the whole, the whole thing because I guess for the uninitiated it's easy to think about, say, well, it's just a bunch of GPUs and you throw them in a rack and how hard could it be? What's so different? What is so different?
Stephen Callahan
I would love, if that was the.
Simon Lesh
Case, well, you wouldn't have a job.
Stephen Callahan
But you know, yeah, that's one thing. So yeah, a lot of the things that we're talking about here with Genai, when you get to generating tokens, which is ultimately what these machines are doing, they need to do so at as fast a rate as possible. And when you start looking at that, it's how many other machines are working and like what is the scale of the model that you're kind of working with, whether it's training the model and creating it or inference where you're using it. You know, there's a lot of correlated compute that's happening and so how much concentration of compute can you get? And ultimately the distance is a factor. Like, you know, I did a talk not so long ago where I said, you know, the young faced version of myself never realized I'd be obsessed, oops, upset that the speed of light on earth is so slow and so you've.
Simon Lesh
Changed man, you've changed.
Stephen Callahan
I know where we can get things closer to each other and where we can ensure that we're, you know, under a couple of microseconds latency, that makes a material difference to these systems. And so if you look at, you know, example of what Nvidia puts out or what we put out with our Trainium Ultra servers, we're talking entire racks are now systems. So we used to have servers and instances which were slices of servers. And now we're talking about the rack is the entity, or in some cases, clusters of racks are the entity. And so having that concentration helps these models. And then even within that, we don't talk about, you know, five GPUs for this purpose. You know, there are some customers out there who want tens of or hundreds of thousands of GPUs all working together on the same workload. And that's where the data center innovations comes into it. And so if we did want to get 20,000 GPUs all in the same building, because we want high speed network between them, we want low latency between them, we want a lot of bandwidth between them, that's a data center full of GPUs and accelerators in general. Whereas a few years ago, before this boom, we were using a room of GPUs or a corner of a building, now it's multiple buildings together for a singular purpose.
Simon Lesh
So help us understand what that means from a power consumption and a heating and cooling perspective. Because I think there's. You talked earlier on about the numbers being very different. The numbers are scary different. And we have to tackle it differently, don't we?
Stephen Callahan
Yeah, yeah. The one that's kind of, you know, people talk about the power that's consumed by these data centers and the power that goes into it. But the other thing to think about is all of that power gets converted to heat. And so if I look at just the innovations that happen to happen, have to happen on the cooling side, like that's kind of a representative of what are we doing with all of this heat. And so if I look at the previous world, before we got to these levels of concentrations, we had a bunch of cooling technologies. And so we do things in the data center. One is called hot and cold aisle containment, where your chilled air is on one side, it goes through the servers, gets heat, is then in a hot aisle, and then it gets exhausted out of the room, processed by the facility, and then the cool air kind of comes back in. And that processing could be, you know, for a given data center, maybe there are chillers that are there and so you know, your traditional H vac sort of system or maybe it's evaporative cooling. And so there's a bunch of technologies that we would have chosen. But if you look at what the heat load of the data center was, it was, it was CPUs like the, you know, it's fairly spread out.
Simon Lesh
That was where it was coming from. And as you said, non correlated necessarily.
Stephen Callahan
Yeah, yeah. Whereas today what we're looking at is, you know, the entire building is going to be outputting multiple megawatts of heat. And so this is an opportunity where we said, well you know, it's not, I'm from Ireland and it's a relatively temperate climate. A couple of times a year it gets up a little bit in the temperature and maybe we need to do something. But it's not great to size your data center cooling to the hottest day of the year because in Ireland we get a lot of rain, we get a lot of like cool weather. And so sizing it for the middle of June when you're in January is not great. And so one thing that we can do is we've been moving to systems called multimodal cooling. And so that's where we've got different modes available to us at the data center and we're able to balance the systems. And so sometimes we're going to use chiller, we're going to augment that maybe with some evaporative cooling or shut down the chillers. And so we can move through different modes depending on the needs of the data center at that time. And then when you add in kind of the, the future machine learning racks, we now pushing liquid cooling all the way to the GPU or the accelerator itself. And so we're instead of going pure air the entire building now we've got either all of the racks are liquid or maybe only some of the racks are liquid. And I similarly, I don't want a data center that's only going to be suitable. You know, I have to wait for a liquid cooled data center before I can launch this rack. It is never a statement infrastructure are going to make. And so we've had to come up with methods having some liquid cooled racks in a typically air cooled data center. And so this is kind of the multimodal piece that we're doing to handle this massive amount of heat load.
Simon Lesh
It's interesting the concept of an adaptable data center or set of data centers versus as you say, just you know, I build it for one Case I guess the other wrinkle you have, besides, you know, dealing with the fiendishly challenging Irish weather, is building data centers in different countries and different climates altogether. So. So I think something like Australia is a little bit different to Europe in terms of its weather envelope.
Stephen Callahan
Yeah. And we look at some of the locations that we're. There's ML data centers in Stockholm and that's very different to what we're going to have to do in Mississippi where it's far more humid and free. Air cooling and swamp cooling is not going to be as effective as something where you've got chillers. And then even regions in the Middle east in the desert, that's a very good, you know, completely opposite thing of you've got cold, you've got humid, you've got dry. And so every building. And I suppose one of the things that I think of in overall data center innovation is you we're far more attuned to not just stamping out the same thing everywhere for every use case is we need to have a number of options in our tool belt to go and say this is a machine learning data center with liquid cooled racks of this many megawatts in this particular climate and I have access to cold water or I don't or I have nearby access to which power source that I have. And so is there, you know, a local wind farm? Are we talking about some form of nuclear plant nearby? All of these things go into the recipe that builds this new modern data center. And the big thing that I think of as, you know, super innovative innovation forward that we are is having the right number of tools in that tool belt to build the best data center for the scenario that we have.
Simon Lesh
Gotcha. Gotcha. Now, we talked about the fact that, you know, new technologies, new workload use, sequencing is changing the demand on power. And clearly sustainability has always been important to us. How are we dealing with that in this new evolved world?
Stephen Callahan
In a traditional AWS region, we have all of the compute in that location. And so US East 1 is in Northern Virginia. It's near Dulles Airport. We kind of have this concept of that's where the compute is, but it is also known as data center Alley. There's lots of data centers there. And so when we look to extend this, having more and more power in that location is not necessarily going to be the best thing. And so we do look at places like Mississippi where there was a solar and a wind farm that kind of comes there. There's a site in Pennsylvania that's adjacent to A nuclear power plant. And so these are kind of looking at the different renewable energy sources that are available and locating the data centers there. And so rather than pull the data, pull the power from the source and drag it all the way to Northern Virginia, we're able to put our data centers in those locations. And so that's different for us. Right? We, we've had n regions before and now it's a different approach where we put them. And then the sustainability, you know, we, we get, we can get into a lot more things like, you know, things like embedded carbon in the concrete and you know, analyzing the steel production and like going through our supply chains and trying to find way to decarbonize that. There's a lot of wins that we kind of have in that particular area.
Simon Lesh
And I think it's interesting because we have hit our goal of pairing our data centers with 100% renewable energy. So that was hit in 2023, but like I said, the team's still diving deep. Like it's not, that's not done. There's more efficiency and economy to be found.
Stephen Callahan
There is. And I think every year for the past few years we've been the world's highest consumer of renewable energy. And so, you know, we're kind of pushing that we're at 100% renewable. We want to keep going on that. And so we just keep pushing more and more in that area. And I even believe there was an announcement not so long ago about, you know, really long term things in terms of an agreement for a small modular reactor that that's not going to like that's order of a decade away by the time we get there. And so, you know, yes, there's more things to be done now, but we also are planning 8, 10 years ahead to see where we want to be and kind of move on that direction towards Barak.
Simon Lesh
And I guess, look, there are big lead times with things like this. You know, it doesn't, you can't just, you know, magic up a huge set of chillers or a massive building structure, et cetera, like this. Yeah, it takes physical time.
Stephen Callahan
It does and it takes time. But it's also a case of like we iterate upon this at a, you know, there's a new building in construction at any point somewhere in the world, all the time. And so we're also using this opportunity to say the concrete that we've been using. And so it's a case of like I was mentioning before about in a particular region, you know, there's indirect carbon emissions that we can have that are in our supply chain, and they're not exactly within our control, but we can kind of push the industry to go along with us. And so there are, you know, there's things that we've done with a. There's a process called trial batching. When you're coming up with the concrete, what are the different mixes that you can do? And so, you know, you've water to cement ratio, air content, how much shrinkage there is. And so we look at in a particular area, like, the cement is not the same everywhere in the world. As much as we would like to think it. What you get in Brazil is not going to be the same as what you get in Virginia. And so what we've been doing is looking to see, are there other substitutes we can put in this while still meeting our performance standards. But, you know, replace some of the carbon pieces from the cement with other materials. And so, if I remember right, right, there's a supplier we worked in in Virginia that replaced 40% of the cement with slag. And so that's a byproduct of refining metals. And that meant that we reduced the amount of carbon in the cement mix by 35% also by, you know, encapsulating this byproduct that normally would have gone to something like landfill.
Simon Lesh
And so a double win, it's like an offset benefit.
Stephen Callahan
Yeah, yeah. It's one of these things that if we have a single window, it's great, but there's lots of single wins out there. But it's these double and triple wins that we get where ideally it's more sustainable and more efficient and we can go faster. Like, that's where the fun is, and that's where the, you know, I get the energy to kind of keep going on this. It's finding those double and triple wins.
Simon Lesh
So let's unpack that a little bit more because obviously we're constantly thinking about scalability and reliability and efficiency, and it's all about customers. But you're sort of touching on, I guess, some of the mental models you're using about how to get these wins and these multiple wins. But what's the philosophy here? What's really coming down in terms of why this happens the way it does?
Stephen Callahan
Well, anyone who's been around Amazon would kind of know there's a set of leadership principles that kind of guide most of our decisions. And the one that kind of dictates a lot of what you're talking about is this customer obsession. And so we're trying to obsess about where the customers are going and we're looking to anticipate their needs of the future. And so how we kind of, we're looking at this is what I call extreme ownership. And so we kind of were asking questions of ourselves and of our suppliers that we think will meet the needs of these customers. And so we've kind of already mentioned that nothing is uninteresting at scale. And you know, that is that a decision made two years ago may not necessarily be the right decision for now. And so we kind of, we always go through this trade off. You know, we want those win wins. Like we want that reliable and efficient and readily available and cost efficient and highly performant. And so anytime we change a scale effect, like we're building a new data center type or even we're in a new region, we want to relook at some of these things and kind of quickly go through these trade offs to see are we obsessing about the right things. Is there anything that was a win win that's now a win lose, and is there anything that we should do different? And so one thing that I'll bring up is some of the sensors that we had in the data center, like environmental sensors, whether they were humidity or, you know, temperature, vibration, there's lots of different things that we can do. And just the supplier and the equipment that we had was giving us telemetry, but not quite at the rate that we wanted or not quite at the fidelity or, you know, we looked at it and went, let's go and do that. And so we went so far as to write our own embedded software for these devices in these data centers that not only natively interacts with all of the systems that we have ourselves, but it's giving us the data that we need. Like maybe we don't need the temperature sensor because we have a different set of temperature sensors and we're looking at IR cameras or something. And so we're kind of optimizing those things because we want this level of ownership that we can go down all that. What I talk about going down the stack as deep as we like, we can then find those win win scenarios on something like a sensor, which means it's more efficient, potentially, it's cheaper because we don't need the, the more advanced version and we get greater control over it. That's, that's, that's fodder for Amazon people.
Simon Lesh
And it's interesting because, you know, thinking about, you know, being able to take the time and to dive deep into, you know sensor, telemetry and those types of things, even if it's not your own, if it's third party, is as you say, something you can't necessarily do if it's just a one off type thing or not at scale, but at scale it actually makes sense to do that. Tell us more about some of the other electrical and mechanical controls that you've made changes to and how they work and, and what the benefits you're seeing are because it's, it's fascinating to hear these little pathways down things that maybe most practitioners wouldn't have the time to look at and you've gone deep into it and what are some of the results?
Stephen Callahan
Yeah, I suppose like if I look at a data center design, you know, the standard one's going to have an electrical room, a battery backup room, you know, generator switch, you know, generators switch, gear. There's a set of like components or modules that, necessary to build a data center. And in a world where every area kind of is its own little kind of island, anyone will make a decision that's best for them and then kind of pass it on to the next team. But some things that we can do is we, well maybe we want to look at the entire package. And so, you know, the complexity of the electrical distribution system is always something that's going to be top of mind, right? There's a, it's kind of seen as an opportunity, lots of opportunities for us to improve. And so we've done things like look at the number of connections that we have from the transformer to the rack and examine each one of them and ask the question of is this providing the value for the complexity that it brings us? And so we went and we reduced it. I think it's from 7 to 5. And so it's a case of we took two of them out and we're also able to look at the switch gear that we have within the data center and again analyze that. And I think we again we took out 50% of it because it was a case of the value was not there for the complexity that it was bringing. I'm not talking about like dollar value, but like the business, the operational, the.
Simon Lesh
Effectiveness or the impact it has here.
Stephen Callahan
And then even things like UPS is on redundancy. You know, traditional data center will have a UPS room with these batteries and we kind of look at that as that's a very large effect radius or blast radius. And so if we were to have a problem in that location, well, all of these racks get impaired by it. Similarly if we want to take it offline, then all of these racks no longer have a battery backup. And maybe we need a redundant battery backup room and everything just kind of stacks in terms of complexity. And so we brought those power systems much closer to the rack. And so now a lot of racks actually have battery backups in the compute racks themselves. And so that means we don't have this central large blast radius point where it's going to affect a lot of compute or a lot of servers if we have problems. And it means the system has kind of higher efficiency because we're able to kind of push all these down and we see lower likelihood of failure. And if there's a failure, it's a faster time to recover. And so it's one of like, you know, it took a long time and a lot of effort to kind of basically get rid of these rooms, but in doing so, we have three or four wins. And so that's what makes it a lot, you know, worth the benefit or worth the interesting.
Simon Lesh
Really interesting. And you touched earlier on as well around cement, which is, you know, it's funny talking about data centers, you know, for most people, I'm not going to overgenerize, but most people think about data centers, their mind goes to the racks and the servers, the compute and the backup, battery backup and stuff like that. But there's a lot to be done with the concrete and the building elements of that. You touched a little bit on the, the, the sort of use of slag and, and replacing some of the cement. But there's some other stuff you guys have been doing that is really interesting around the building side of things. Can you unpack that for us?
Stephen Callahan
Yeah, it's one of the, I would say it's one of the trigger words. If, you know, we're always asking questions and if someone said, why do we do that? Answer is, well, it's the way we've always done it. That's just a red. Let's go have a conversation on this. And so it's things like, you know, we have the, when we're building the steel, when we're building the frame of the building, the mezzanine will have a steel floor. You know, there's a steel beams that make up the mezzanine. And then we cover that with concrete, you know, to give a cement floor up on top. Why do we cover it with concrete? Like the cement or the steel is pretty strong. Does that need to be covered in concrete? Is that going to be. What benefit is that bringing to us? And it turns out that we can actually just not do that. And per data center, someone did the math, and it works out at 115 metric tons of CO2 that we're able to not create by just not putting a concrete topping on the mezzanine floor. And so there's no practical effect to the customer in doing this, but we're able to, you know, again, win, win. It's going to cost less, it's going to cause less CO2, it's one step fewer in our building process, which means it's going to go faster. And it's one other thing that, you know, could go wrong that we don't have to do. And so we're avoiding all of these things by asking these questions all the way from does that concrete need to be there?
Simon Lesh
Yep, yep. That's. That's a pretty fundamental first question that, as you say, is often missed because we just always did it that way. I think it's. It's interesting hearing some of this stuff because as regular listeners to the show will know, I always say the phrase undifferentiated heavy lifting. And in fact, if you make that a drinking game, you're going to be in trouble. But undifferentiated heavy lifting doesn't mean it's not important. That's the heavy lifting part. And it doesn't mean it can't be better. And I guess that's the message here, is that, you know, there's a whole team and many teams at Amazon who are working hard each and every day for our customers who this stuff should be completely opaque to them. They don't know about it. They don't need to know about it, but it's kind of nice to know it's happening, isn't it? And I guess from your perspective, you spend time with customers and hearing what they want from a macro perspective and able to execute this on a macro perspective, it's a great, I guess, story of the concept of ownership and what it means to provide something for our customers.
Stephen Callahan
Yeah, yeah. Like the. I know we're talking about the physical data center piece today, but I spend a lot of my time in the network space and when I do talk to those customers in depth about something as trivial as updating the firmware on a transceiver in a network, like, that's way down in the stack, and customers don't think about it. It's not something that's done in the industry. But I'm able to say that because we do this at such a rate across the entire data center, your workloads are interrupted less, we have fewer failures because of this. And so that's one that, because other, you know, at some point most customers have had a network switch somewhere and they've had to deal with optics or something. And so when I can equate it to something like that, that's the depth that we can go to, that we can update all of the transceivers in a network switch at the same time. I can apply that to anything in the data center. How we manage the transfer switch, how we do the cold plate cooling, all of these aspects that do affect more of the data center and the servers. We're able to go to that depth of detail as doing this, what seems like an inconsequential thing on the network switch that actually provides a huge amount of value.
Simon Lesh
Fantastic. Stephen, thanks so much for coming on and joining us today. I think that, you know, the message I want our listeners to get is there are, there are teams of people like Stephen who are obsessed with this stuff, like super obsessed. And Stephen, it's been great to hear some of the results of that obsession.
Stephen Callahan
Yes. Well, Simon, thank you very much for having me on. I'm glad I found somewhere that lets me get to this level of obsession. And the thing is, you know, it's fun, it's engaging, it's interesting because, you know, there's very few places or very few areas of the data center that I can go that someone's going to say, no, don't look here. Well, I'm guessing if they ever say.
Simon Lesh
That they've guaranteed a long term engagement.
Stephen Callahan
With Mr. Callahan, that's what's going to happen. So yeah, that level of going all the way down the stack is something I absolutely love.
Simon Lesh
Fantastic. Thanks, Stephen. And thanks everyone for listening. We do love to get your feedback. AWSpartmazon.com is the place to do it. And until next time, keep on building.
AWS Podcast Episode #716: Concrete, Cooling, and Compute: Reinventing Data Centers for the AI Age
Release Date: April 14, 2025
In Episode #716 of the AWS Podcast, hosted by Simon Lesh and featuring guest Stephen Callahan, Senior Principal Engineer at Amazon, the discussion delves deep into the transformative changes AWS is implementing in its data center infrastructure to meet the burgeoning demands of the AI era. The episode offers insights into scaling challenges, innovative cooling solutions, sustainability initiatives, and the underlying philosophies driving these advancements.
Simon Lesh opens the episode by highlighting the often-overlooked yet critical role of data centers in AWS's operations. While AWS typically emphasizes its services accessible via APIs, the physical infrastructure supporting these services remains a cornerstone of their reliability and scalability.
Simon Lesh [00:30]: "Servers have to run somewhere. It has to run somewhere... technology is made of hardware and software, and those things have to come together."
Stephen Callahan reflects on his 15-year tenure at Amazon, emphasizing the rapid evolution of data center technologies and the continuous innovations required to stay ahead.
Stephen Callahan [00:47]: "It's been a roller coaster... feels like we've been through three or four different variations since then."
The conversation underscores the exponential growth AWS experiences, necessitating a shift in scaling strategies. Callahan humorously notes the progression from gigabits to terabits and now petabits, illustrating the relentless pace of technological advancement.
Stephen Callahan [02:11]: "It's just a different letter, you know, It'll always evolve."
Lesh emphasizes that such scaling efforts benefit customers by ensuring AWS can deliver robust and scalable solutions.
Simon Lesh [00:47]: "Every year is like three to five over here. But that's a good thing for our customers..."
A significant portion of the episode examines how Generative AI (Gen AI) workloads are reshaping data center architectures. Unlike traditional, diverse workloads, Gen AI demands high-density, power-intensive compute clusters with minimal latency. This shift necessitates:
Concentrated Compute Clusters: Entire racks or clusters of racks designed to handle large-scale AI models.
Stephen Callahan [06:00]: "We're talking entire racks are now systems... having that concentration helps these models."
High-Speed Networking: Ensuring low latency and high bandwidth between vast numbers of GPUs.
Stephen Callahan [06:56]: "We're able to get things closer to each other... under a couple of microseconds latency."
As AI workloads escalate power consumption and heat output, AWS is pioneering advanced cooling methodologies to maintain efficiency and sustainability.
AWS has transitioned from traditional cooling methods to multimodal cooling, which dynamically adjusts based on real-time data center needs. This flexibility allows AWS to:
Optimize Energy Use: Combining chillers with evaporative cooling or shutting down chillers when not needed.
Stephen Callahan [09:43]: "We've been moving to systems called multimodal cooling... depending on the needs of the data center at that time."
Integrate Liquid Cooling: Directly cooling GPUs and accelerators to manage concentrated heat loads effectively.
Stephen Callahan [10:00]: "We're now pushing liquid cooling all the way to the GPU or the accelerator itself."
AWS designs data centers tailored to diverse climates, ensuring optimal cooling efficiency worldwide—from temperate regions like Ireland to humid areas like Mississippi and arid deserts in the Middle East.
Stephen Callahan [11:58]: "Each building... has to be the best data center for the scenario that we have."
Sustainability remains a top priority as AWS scales its data centers. The company not only achieved its goal of pairing data centers with 100% renewable energy by 2023 but continues to explore deeper efficiencies.
AWS strategically locates data centers near renewable energy sources—such as solar and wind farms in Mississippi or nuclear plants in Pennsylvania—to minimize carbon footprint and optimize energy sourcing.
Stephen Callahan [13:41]: "Instead of pulling the power from the source and dragging it all the way to Northern Virginia, we're able to put our data centers in those locations."
AWS actively seeks ways to reduce carbon emissions through sustainable building practices. This includes:
Concrete Mix Optimization: Replacing a portion of cement with slag, a byproduct of metal refining, to lower CO₂ emissions.
Stephen Callahan [17:44]: "We reduced the amount of carbon in the cement mix by 35% by substituting 40% of the cement with slag."
Eliminating Unnecessary Concrete: Removing concrete toppings from mezzanine floors to save on CO₂ without compromising structural integrity.
Stephen Callahan [25:12]: "We saved 115 metric tons of CO₂ per data center by not putting a concrete topping on the mezzanine floor."
To further optimize data center operations, AWS has reimagined traditional electrical and mechanical systems:
Electrical Distribution Simplification: Reducing connections from transformers to racks from seven to five to decrease complexity and enhance reliability.
Stephen Callahan [22:50]: "We reduced the number of connections from the transformer to the rack from 7 to 5."
Decentralized Battery Backups: Moving from centralized UPS rooms to integrating battery backups within individual racks, thereby minimizing blast radius and improving recovery times.
Stephen Callahan [23:20]: "Now a lot of racks have battery backups in the compute racks themselves... lower likelihood of failure."
Underlying these technical advancements is AWS's core philosophy centered on customer obsession and extreme ownership. This mindset drives teams to continuously reassess and optimize every facet of data center operations, ensuring they anticipate and meet evolving customer needs.
Stephen Callahan [18:31]: "We're trying to obsess about where the customers are going and we're looking to anticipate their needs of the future."
This approach encourages:
Deep Technical Engagement: AWS engineers, like Callahan, delve into granular details, such as sensor telemetry, to enhance data center performance.
Stephen Callahan [21:12]: "We have the right number of tools in that tool belt to build the best data center for the scenario that we have."
Continuous Improvement: Regularly reevaluating past decisions to align with current scale and technological advancements.
Stephen Callahan [04:14]: "We may reevaluate a decision you made in the past because now suddenly we're 10 or 50 times larger."
The innovations discussed translate into tangible benefits for AWS and its customers:
Enhanced Reliability: Decentralized systems and optimized electrical controls reduce points of failure and improve uptime.
Increased Efficiency: Advanced cooling methods and sustainable materials lower operational costs and environmental impact.
Scalability: Adaptable data center designs facilitate the rapid deployment of AI-focused infrastructures to meet demand spikes.
Stephen Callahan [24:36]: "We have three or four wins... it's worth the benefit or worth the interesting."
Episode #716 of the AWS Podcast offers a comprehensive look into the intricate and innovative efforts behind AWS's data centers. Stephen Callahan's expertise illuminates the complexities of scaling infrastructure for AI workloads, the imperative of sustainability, and the relentless pursuit of operational excellence. Through customer obsession and extreme ownership, AWS continues to redefine what modern data centers can achieve, ensuring they remain at the forefront of technology and environmental stewardship.
Simon Lesh [27:44]: "There are teams of people like Stephen who are obsessed with this stuff... It's been great to hear some of the results of that obsession."
Key Takeaways:
Adaptability is Crucial: AWS continuously evolves its data center designs to accommodate emerging technologies like Generative AI.
Sustainability is Integral: Strategic location choices and innovative building materials underpin AWS's commitment to renewable energy and reduced carbon emissions.
Deep Technical Ownership Drives Success: By owning and optimizing every layer of the data center stack, AWS ensures unparalleled reliability and efficiency for its customers.
Customer-Centric Philosophy: AWS's decision-making is deeply rooted in anticipating and meeting the future needs of its vast customer base.
For more insights and updates, visit aws.amazon.com/podcast.