Loading summary
Host
Aws S3 is the world's largest cloud storage service, but just how big is it and how is it engineered to be as reliable as it is at such a massive scale? Mylan is the VP of Data and analytics at AWS and has been running S3 for 13 years. Today we discuss the sheer scale of S3 in the data stored and the number of servers it runs on, how seemingly overnight AWS went from an eventually consistent data store to a strongly consistent one, and the massive engineering complexity behind this move, what is correlated failure, crash consistency and failure allowances, and why engineers on S3 live and breathe these concepts, the importance of formal methods to ensure correctness at S3 scale, and many more. A lot of these topics are ones that AWS engineering rarely talks about in public. I hope you enjoy these rare details shared. If you're interested in how one of the largest systems in the world is built and keeps evolving, this episode is for you. This episode is presented by statsig, the Unified Platform for Flags, analytics, experiments and more. Check out the show notes to learn more about them and our other seasoned sponsors.
Interviewer (Gregory)
So Mailon, welcome to the podcast.
Mylan (VP of Data and Analytics at AWS)
Thanks for having me.
Interviewer (Gregory)
To kick things off, can you tell me the scale of S3 today?
Mylan (VP of Data and Analytics at AWS)
Well, if you want to take a step back and just think about S3, it is a place where you put an incredible amount of data. And so right now, S3 holds over 500 trillion objects, we have hundreds of exabytes of data, and we serve hundreds of millions of transactions per second worldwide. And if you want another fun stat, we process over a quadrillion requests every single year. And what's under the hood of all that is also pretty amazing scale if you think about what's underneath the hood of S3. Fundamentally, we're disks and servers, which sit in racks, and those sit in buildings. And if you try to think about all of the scale of what is under the hood, we manage tens of millions of hard drives across millions of servers. And that is in 120 availability zones across 38 regions, which is pretty amazing if you think about it.
Interviewer (Gregory)
So deep down, it all starts with hard drives sitting inside servers sitting inside racks. And then you have a bunch of these racks and then rows of them, buildings of them. Right? And that's what you said. So there's tens of millions of of hard drives deep down in the bottom of this.
Mylan (VP of Data and Analytics at AWS)
That's right. In fact, if you think about the scale of this, if you imagine stacking all of our drives one on top of another, it would go all the way to The International Space Station and just about back. And so like that. I mean, it's kind of a fun visual to have for us who work on the service. But you know, kind of fundamentally, it's, it's really hard to get your brain around the scale of S3. And so a lot of our customers, they don't. They assume the scale is there. They assume that, you know, all of the drives are always there and they just focus on what S3 is to them, which is. It just works. It just works for any type of data and all of your data.
Interviewer (Gregory)
Yeah, even. I mean, even for me for the scale. When you talk about exabytes, I actually had to look up exabytes because I know a petabyte which is already massive. If a company has like one or two or three petabytes of data, it's. It's tons. And exabyte is a. Is. Is it a. Yes, it's a thousand petabytes is an exabyte. And you told me that you're, you're thinking in that level. It's just hard to. Hard to fathom.
Mylan (VP of Data and Analytics at AWS)
Yeah, we, I mean, we have individual customers that have exabytes of data. Individual customers have exabytes of data and what they call a data lake. Although last week I heard a great term. We had the Sony Group CEO talk about what Sony is doing with data, and they refer to it as a data ocean, and not a data lake, but a data ocean. And so like, if you have exabytes of data in your data lake, it is in fact a data ocean. And that ocean is kind of fundamentally S3.
Interviewer (Gregory)
Can you tell me how S3 started? I did some research and there was a story about a distinguished engineer sitting in a pub in Seattle who knows it was true or not, but I read this was a story that he was a bit frustrated with engineers at Amazon building a lot of infrastructure again and again.
Mylan (VP of Data and Analytics at AWS)
Yeah, if you think back into, you know, S3 development really started in 2005 and we launched as the first AWS service in 2006. And if you think about the technical problems of 2006, you know, a lot of customers were building things like, like e commerce websites, right. Like Amazon.com and so the engineers at Amazon knew that they had a lot of data that at the time was very unstructured Data. It was PDFs, it was images and it was backups. And they needed a place where they could store that at an economic price point that let them not think about the growth of storage. And so they built S3 and they really built it for a certain type of storage. And so the original design of S3 in 2006 was really anchored around eventual consistency. And the idea of eventual consistency is that when you put data in storage for S3, you know, we're not going to give you an ack back on your put unless we actually have your data. So we have your data. But the eventual consistency part is that if you were to list your data, it might not show up because it's being eventually consistent. It's there, but it might not show up on a list. And so we did that at the time that a consistency model at the time we built that because we were really optimizing for things like durability and availability. And it worked like a champ for E commerce sites and things like that. Because when a human was interacting with an E commerce site and an image happened to not show up exactly at the moment where you put the data into storage, it was okay because a human would just refresh. And so when we launched in 2006, here's a fun fact for you, 2006 is actually when Apache Hadoop first began as a community as well. And so we had a set of what I think of as Frontier Data customers like Netflix and Pinterest who took a look at things like Hadoop and they put it together with the economics and the attributes of S3, which is, you know, unlimited storage with pretty good performance at a great price point. And they, they decided to build their, you know, what we first began to call data lakes. At the time they decided to build, to extend the idea of unstructured storage and include things like tabular data. And so the first wave of Frontier data customers were adopting quote unquote data lakes. And about 2013 to 2015, those were the frontier data customers born in the cloud. And around 2015 to, I would say 2020, we started to see all the enterprises take that same data pattern of how can I use S3, the home of all the unstructured data on the planet, and extend it to tabular data. And that's when about five years ago, 2020, I started to see a ton of, of exabytes of, you know, basically parquet files. And you know, I, I have worked on S3 for a minute. I started working on S3 in early 20, I guess it was 2013. I'd been at AWS since 2010, so kind of a while. And the rise of parquet was really interesting because what people did is they said, oh, okay, I like the traits and the attributes of S3 and I want to apply it to a table. And so I am going to run my own Parquet data in S3. And then around 20, I would say 19, 20, 20, we started to see basically the rise of iceberg. And Iceberg at the time is incredibly popular otf and it gives the table attributes to the underlying parquet data. And customers started to do it in many of my largest data lakes across different industries and different customers. And so one of the things that we did in 2024 is we introduced.
Interviewer (Gregory)
S3 table just for those who don't know what iceberg is. So it's, it's an open source data format for like massive analytic workflows, right?
Mylan (VP of Data and Analytics at AWS)
That's right. If I ask our customers of these data oceans why they care so much about iceberg, it's because they want to be able to have what a lot of customers are calling this decentralized analytics architecture, where, you know, they can have lines of businesses or different teams within their company that pick what type of analytics to use as long as it's iceberg compliant. And so if iceberg is the common metaphor for data, for tabular data, then you have choice, you have flexibility and choice for what type of analytics engines you use in a decentralized analytics architecture. And so I think that's one of the reasons why iceberg has just taken off is that it makes it easy to use data at scale and. But it also gives a business owner, the chief data officers or the CTOs of the world, it gives them future proofing for analytics. They can replace their analytics, they can change it out, they can adopt new types of analytics and AI because you have this iceberg at the bottom. Turtle of S3. We launched S3 Tables in December 2024. This year we've had over 15 new features that we've added to S3 Tables. And then this year, of course, we launched the preview of S3 vectors in July and then last week we were generally available. And so you know the story of S3. It's like a story that our customers have written for data. But it's been super fun to work on all these different evolving attributes.
Interviewer (Gregory)
As an engineer, what is the kind of basic architecture and the basic terminology I should know about when I'm starting to work with S3?
Mylan (VP of Data and Analytics at AWS)
When we first launched in 2006, the whole goal for S3 is to provide a very simple developer experience. And we've really tried to stick with that. In fact, when the engineers and when we're sitting around, we're Talking about what do we build next? We always go back to that idea of how do you make things really simple to use S3. And so fundamentally S3, we have a lot of different capabilities now, but it's really about the put and the get. The put of the storage in and the get of the storage out. And where we can do that really well at scale. That is kind of the heart of S3. Now we have a ton of extra capabilities that we've launched over time. But you know, fundamentally when customers think about using S3, they think about the put and the get.
Interviewer (Gregory)
Yeah, so like put data, get data, and I guess some of the other like operations. It's a bit like HTTP, right. There's also delete, list, copy, a few kind of other like, I guess primitives.
Mylan (VP of Data and Analytics at AWS)
There is. And you know, if I think about where we have gone over time, we've added capabilities on top of that, just based on what developers are trying to do. Okay, let's just take put. Okay. We recently added a set of conditionals to the put capability. And like last year we did put if absent or put if match. This year we did a copyif absent or a put if match and we did delete if match. And the core thing for us with conditionals is that we can give developers the capabilities of doing things like the put, but to do it based on the behaviors of their application.
Interviewer (Gregory)
Outside of the get and put, the basic operations. I guess the base terminology that you should just know about is the buckets, objects and keys, right? That's how we think about our data.
Mylan (VP of Data and Analytics at AWS)
Yeah. And now it's not just objects. If you think about the two latest primitives or building blocks we've introduced as native to S3, one of them is the iceberg table with our S3 tables and the other one is vectors. And you know, under the hood of an S3 table is a set of parquet files that we're managing on your behalf. But that's not the case for vectors. A vector is just basically a long string of numbers and that is a new data structure for us. And it's sitting in S3 just like your objects.
Host
My lam was talking about the building blocks of S3, like the put get tables and vectors. Speaking of primitives for building applications, leads nicely to our seasoned sponsor workos. Workos is a set of primitives to make your application enterprise ready. Primitives like single sign on authentication, directory sync, MCP authentication, and many others. One feature does not make an app enterprise ready. Rather it's the combination of primitives Altogether that solves enterprise needs. When your product grows in scale, you can always reach for new building blocks or infrastructure from places like AWS or similar. Similarly, when you need to go upmarket and sell to larger enterprises, Workquest provides the application level building blocks that you need for this. WorkWest has seen the edge cases, the enterprise complexity and solved this for you so you can focus on your core product. One example of such a building block is adding authentication to your MCP server. This is a typical screen when you are about to authenticate with an MCP server. If you would have to build it from scratch, it gets pretty complex to set up the OAuth flows behind the scenes. But with work OS, it's a few simple steps. Add the Altkit component to your project, configure it via the ui, then you just direct clients of your MTP server to authorize via Authkit, verify the response you get via some code and that's pretty much it. This is the power of well built primitives. To learn more, head to workos.com and with this, let's get back to S3 and how it all started.
Interviewer (Gregory)
So I'd like to still go back to the beginning of S3 when it was launched. It was pretty shocking for the broader community because S3 launched with a pricing of 15 cents per gigabyte per month, which was about a third to fifth cheaper than anything else. The going rate at the time was something like 50 cents or 75 cents. And on the first day I read that like 12,000 developers signed up immediately. A lot of companies immediately or very quickly moved over. And then the surprising thing was that S3 kept cutting prices. It was unheard of before you were there in the 2010s when some large price cuts happen. Can you tell me what was the thinking inside the S3 team on the this unusual pricing? It seemed customers would have been willing to pay more. And also the cutting of prices continuously. Even today. I think today it's something like 2 cents or 2, 3 cents, something like that for the same storage as it was 15 cents on launch.
Mylan (VP of Data and Analytics at AWS)
Yeah, you know, I think part of this goes back to what the goal is for S3. Okay. And so the mission of S3 is to provide the best storage service on the planet. Okay. And our goal too is that if you think about the growth of data, IDC says that data is growing at a rate of 27% year over year. But I have to tell you, we have so many customers that are growing so much faster than that.
Interviewer (Gregory)
Yeah, I was about to say it.
Mylan (VP of Data and Analytics at AWS)
Sounds pretty low, I know, like that. But that's on average across everything, we have a lot of customers that grow twice or three times that that rate. But if you think about that, okay, you think about all the data that's being generated from sensors, from applications, from, you know, AI, from all these different.
Interviewer (Gregory)
From just taking photos, I mean, every day, right?
Mylan (VP of Data and Analytics at AWS)
Photos. That's right. Like, you know, and you know, if you think about your phone too, think about the resolution and how the resolution of the cameras on the phone have grown. You just have this like kind of what Sony talked about with the data ocean. Okay. And in order to have all that data and to grow it, you have to be able to grow it economically. You have to be able to grow it at a price point where you don't really think, okay, what data am I going to delete now because I'm running out of space. You don't have that conversation with S3 customers because of two things. One is, you know, we do lower the price of either storage or the capabilities of what we're doing. Like for example, we lowered the cost of compaction for S3 tables pretty dramatically within a year after launching S3 tables. It's not just that. It's like the overall total cost of ownership of your storage. We give you the ability to tier into archive. Right. Storage, we give you the ability to do something called intelligent tiering, which is if you don't touch your data for a month, we'll give you an automatic discount on that data because we're watching your storage and you don't touch it for much. We'll give you up to 40% discount on that storage. And it's like dynamic discounting, so you don't even have to think about it. And so our whole goal is that you can grow the data that you need to grow. Because we know that's being used to pre train models, we know it's being used to fine tune and do any type of post training of AI. We know you're using it for analytics, we know you're using it for all these different things either now and in the future. And so our goal is so that you can keep your data and you can use it in a way that advances whatever the thing is that you're doing, whether it's life sciences or you're an enterprise, you know, in manufacturing. Right? Whatever you need, the data should be there and you should be able to grow it and keep it and use it any way you want.
Interviewer (Gregory)
Yeah, I did want to ask you about this part, so there's intelligent tiering which was launched in 2018, so like 12 years after S3 was launched. One thing that really got my attention, Amazon Glacier, which launched in 2012, so a long time ago. And you can store data that you don't need immediate access to. You're okay waiting for some time to get access to it, I think maybe even hours. When it launched, it was only $0.01 per gigabyte per month, which was again, this was something back then the going rate for storage was like $0.15. So almost, almost 10 times cheaper. How do you do that? Like what, what is the architecture and thinking behind how you're able to have this trade off of like, look, if you don't need your data quickly, we can do it a lot cheaper. How, how could I imagine the kind of trade offs that, that you and the engineering team were, were thinking of making?
Mylan (VP of Data and Analytics at AWS)
Well, you know, I mean, as you know, I mean you're an engineer yourself and you know, as engineering is about constraints, right? And that is the fun part about working on S3 is that when you think about constraints, you think about constraints that we have for availability, you think about constraints that we have around the cost of storage, we start to get really, really creative. Okay? And in S3, because we build all the way down to the metal of the drives and the capabilities that we have in our hardware, we're able to drive efficiencies at every single part of our stack. And so our engineers, when they get together and they talk about the constraints, they talk about the design goals, we'll do something like, we'll set a target for the cost of a byte and we'll drive for that, and we'll drive for it at every single part of the process. And the part of the process that we are also including is, you know, it includes a data center. How do our data center technicians be able to operate the service of S3 from a hardware and a data center perspective? Like the physical buildings, just like we do the same thing for the software and the layers of S3 itself. And when you have that, when you have that ability to run across the whole stack all the way down to the physical buildings and we're thinking about so deeply about the cost and the lifetime of every byte, you're able to do things like Glacier, you mentioned something.
Interviewer (Gregory)
Really interesting, that when S3 started it was eventually consistent, which means that data eventually arrives. It might not be there and you might be behind and there's a lot of things that you can do with this and it gives you some constraints. But you mentioned that the reason that the team launched this because durability and availability was more important and I assume of course, cost as well. But during those initial phases, while S3 was eventually consistent, what kind of benefits does it give to have eventual consistency? Is it a cost constraint? Is it just easier to do high availability systems from an engineering perspective?
Mylan (VP of Data and Analytics at AWS)
Well, I mean, from an engineering perspective, the main optimization was it was availability. It was not necessarily durability, but it was availability. Okay. So if you take a step back and look at the original design of S3, we were really focused very hard on availability. So let's take a step back. Okay. So when you talk about consistency, it's the property where the object retrieval the object get reflects the most recent put to that same object. Okay. And so if you think about, you know, what parts of the system of S3 that really hits, a lot of it just kind of starts with our indexing subsystem. So if you think about the indexing subsystem in S3 that holds all of your object metadata, and so that's like its name, its tags, its creation time, and the index, our index is accessed on every single get or put or list or head or delete, any API call like that. And so every single data plane request where you go back into our storage system to go get an object goes through our index. And if you think about it, more requests go through our index and our storage system because, for example, it's serving thing like head requests and list requests that don't actually end up going back into our storage system at all. Those are metadata or index requests. So if you think about our indexing system, we have a storage system system in there, okay. And that is a really central concept, a storage system in the middle of our index system.
Interviewer (Gregory)
So you need a storage system for your index system, right?
Mylan (VP of Data and Analytics at AWS)
That's right. And so we have to configure and size the system to deliver on our, you know, our design promise for both availability and durability. Okay. And so the data is basically in our index system, is stored across a set of replicas and it uses something called, you know, it's basically a quorum based algorithm. Okay. And a quorum based algorithm tends to be very forgiving to failures. And so if you think about how we implemented quorum in our index system, we start first from servers that are running in these separate availability zones. And the reason we do that is that it, it lets us avoid correlation on a single fault domain. Okay. And since the Failure of like a single disk, a server rack, a zone. It only affects a subset of data. It never affects all of the data for a single object or even a majority of the data for a single object, which we have sharded across, you know, a wide spread of servers. So, like this, this core of availability for us is this idea that we spread everything. And so when a read comes in, it's coming into the S3 front end and we just heavily cache objects across our system. When a read comes in, it could route at random and you could create a situation where you're creating an inconsistent read. And so when we have quorum at the index storage layer, we can see reads and writes overlap, but in the cache they don't because we're optimizing for availability.
Interviewer (Gregory)
So just so I understand the first part, the eventual consistency, correct me if I'm wrong, that you can just write to all these distributed nodes and you ask one of them, and if it doesn't have it, no problem, because it will be eventually consistent. You now have high availability because you don't need to worry about all of them being installed.
Mylan (VP of Data and Analytics at AWS)
That is correct.
Interviewer (Gregory)
And that's phase one of aws, and it gives you availability. And now you're now explaining how you're able to, behind the scenes, turn this into a strongly consistent, but strong consistency means that it's guaranteed to have the, the whole system's state, which is hard to do because you could have distributed failures, et cetera.
Mylan (VP of Data and Analytics at AWS)
And this replicated journal, you know, it took us a while to build, I won't lie, we don't talk about this stuff very, very much, okay? Because this is kind of the secret sauce of S3. But, you know, again, like our engineers who are in the room, they were thinking about how do you deliver on both the strong consistency without compromising availability. So I go back to constraints, okay? So in that case, we were not trading off the consistency and availability anymore. And so the engineers had to come up with a new data structure. Basically, we do this in S3 vectors, basically is a new data structure that we came up with as well. But if you think about what we had to invent for strong consistency at S3 scale without relaxing the constraint of availability, is we had to build this replicated journal, okay? And the replicated journal is basically a distributed data structure where we're chaining nodes together so that when this write is coming into the system, it's flowing through the node sequentially, okay? And so a reader write in a strongly consistent system for S3, it flows through these storage nodes in the journal sequentially. And so every node is forwarding to the next node. And when the stored roads get written to, they learn the sequence number of the value along with the value itself. And therefore, on a subsequent read, like through our cache, the sequence number can be retrieved and stored. And so now you have this strongly consistent and highly available capability in S3, and the heart of that is actually this replicated journey.
Interviewer (Gregory)
Okay, but what's the catch on one end? Because there's always something with trade offs. You always have something. So one, you obviously have more complicated business logic. And then I guess the second obvious question is, what about failures? Because in the case of eventual consistency, you don't worry too much about one failure. Clearly in this case, what if a node in the sequence fails either at the first time or later? Or how does the system monitor this recover? Because I guess that's going to be the tricky part, right?
Mylan (VP of Data and Analytics at AWS)
There's another piece to this puzzle that we implemented, which is, you know, it's basically a cache coherency protocol. And the idea is that this is where we built what we think of as a failure allowance, where in this mode we needed to retain the property that like multiple servers can receive requests and some are allowed to fail. And so it's kind of this combination of this replicated journal as a new data structure, plus we implemented this new cache coherency protocol that gave us a failure allowance. And those two things working in concert gave us this strong consistency. I will say too, this does come at some actual cost.
Interviewer (Gregory)
I was about to say, like, nothing is free in engineering, right?
Mylan (VP of Data and Analytics at AWS)
There's hardware cost in this because you can imagine we've done some more engineering behind the scenes, but I remember sitting in the room with our engineers on S3 and we did a debate on this. We debated it, we said, there's costs, there's actual costs to the underlying hardware for this and do we pass it along to customers or not? And we made that explicit decision not to. We said, really? Yeah, we said that when we launch this, we should launch strong consistency. We should make it free of charge to customers and it should just work for any request that comes into S3. We shouldn't sort of say it's only available on this bucket type or what have you. This should be true for every request made to S3. And part of that mindset for S3 is like, how can we provide these type of capabilities and how can we make it something that becomes a building block, like part of the building block of S3? And you shouldn't have to think about the cost of it.
Interviewer (Gregory)
This was the very surprising thing of this launch, by the way, that suddenly AWS said like, okay, everything is strong existent. It does not cost you more. Latency wise, your latencies shouldn't have changed significantly. I mean, I'm sure when you roll out initially you do your measurements, et cetera, but that was the promise. And that was why I couldn't really believe it when I reread history, because it typically doesn't happen. Typically strong consistency does add latency or it increases costs. If it doesn't add latency, there's always these trade offs. And I mean, it sounds like you either swallowed the costs or costs caught up, but it's very unusual.
Mylan (VP of Data and Analytics at AWS)
So if I think about that, one of the things that was also very important for us, and we haven't really talked about this as much, but we think about it a lot on the S3 team is correctness. So it's one thing to say that you're strongly consistent on every request, it's another thing to know it. And so when we built this strong consistency, you know, I talked about our new caching protocol, I talked about this replicated journal as a new data structure, you know, that took a little bit of time to do and to get right. But at S3 scale, we could not say that we were strongly consistent unless we actually knew we were strongly consistent. Okay, and so what does that mean? How do you do that at S3 scale when everybody is using it for every last workload? In fact, one of the reasons why people use it is because our scale is such that we're de correlating workloads and you can run absolutely anything on S3. But how do you know?
Host
Mailon just talked about how strong consistency made it so much easier to trust S3. Trust is something that is just as important when writing code. Especially when with AI, we write more code than before. And this is a good time to talk about our season sponsor, Sonar. What is the impact that AI is having on developers? Let's look at some data. A new report from Sonar, the State of Developer Server report, found that 82% of developers believe they can code faster with AI. But here's what's interesting. In this same survey, 96% of developers said they do not highly trust the accuracy of AI code. This checks out for me as well. While I write code faster with AI agents, I don't exactly trust the code it produces. This really becomes a problem at the code review stage where all this AI Generated code must be rigorously verified for security, reliability and maintainability. Sonarqube is precisely built to solve this code verification issue. Sonar has been a leader in the automated code analysis business for over 17 years, analyzing 750 billion lines of code daily. That's over 8 million lines of code per second. I actually first came across sonar 13 years ago in 2013 when I was working at Microsoft and a bunch of teams already used sonarqube to improve the quality of their code. I've been a fan since. Sonar provides an essential and independent verification layer. It's the automated guardrail that analyzes all code, whether it's developer or AI generated, ensuring it meets your quality and security standards before it ever reaches production. To get started for free, head to sonarsource.compragmatic and with this, let's get back to the importance of strong consistency at aws.
Mylan (VP of Data and Analytics at AWS)
How do you know that you're strongly consistent? And that is why we used automated reasoning.
Interviewer (Gregory)
What is automated reasoning? For those of us who are not as familiar with this, which will be most people outside of very few domains, like S3.
Mylan (VP of Data and Analytics at AWS)
Yeah, it's, I mean S3 uses automated reasoning all over the place. Okay. And automated reasoning is a specialized form of computer science. Okay. And girly, if you, if you kind of think about if computer science and math got married and had kids, right. It would be automated. Reason.
Interviewer (Gregory)
Is it formal methods or based on formal methods?
Mylan (VP of Data and Analytics at AWS)
That's exactly.
Interviewer (Gregory)
Oh yeah. I mean I, I study computer science, so yeah, that's fun. So it's actually proper formal methods that you're using.
Mylan (VP of Data and Analytics at AWS)
That is right. And we use formal methods in many different places in S3. But one of the first places that we adopted was for us to feel good that we actually had delivered strong consistency across every request. So what we did is we proofed it, right? We basically built a proof for it and then we incorporated our proof on check ins into this index area that I talked about, right, where you have your caching and then you have your storage sub layers of the index capabilities. And so when somebody, anybody is working on our index subsystem now and they're checking in code into the code paths that are being used for consistency, we are proofing through formal methods that we haven't regressed our consistency model.
Interviewer (Gregory)
And can you just give us a rough idea? Because the formal methods that I have studied, they were pretty abstract. The things like designing languages, how to have like different operators and of course there are some maths involved as well. But what are they like primitives like Servers, network, et cetera, and models being built, data flows. How can I imagine a simple proof of something inside S3 roughly at a really high level?
Mylan (VP of Data and Analytics at AWS)
Yeah, I mean if you go back to the fundamental notion of a proof, you are proving something to be correct. Okay. And so the places that we use these proofs, we use them in consistency, where we built a proof across all the different combinatorics to make sure that the consistency model is correct. We use it in cross region replication to prove that a replication of data from one region to another arrived. And we use it in different places within S3 to prove the correctness of API in all of these cases. You know, we talk about durability, we talk about availability, we talk about cost. But just as strong of a principle, a design principle for us across S3 is correctness. It's a correctness of a thing, an API request, an operation, as it were. And the key thing for us too is that you don't want to just proof it once. You want to proof it on every single check in and you want to proof it on every single request so you can verify, you can validate and verify that you are doing in fact what you say you do. And I think for us at a certain scale math has to save you, right? Because at a certain scale you can't do all the combinatorics of every single edge case. But math, math can save you and help you on this at S3 scale. And so we use formal methods in many different places of S3. We have some research papers too. I can send you some links to some research papers where you talk about.
Interviewer (Gregory)
Yeah, please, please do. And we will and we will put it in the show notes below so anyone can check it out because I think it's really interesting. I feel formal methods are not really a thing in a lot of startups and even infrastructure startups yet. But it sounds very reassuring to me to actually have an ongoing proof of that. And speaking of which, I want to ask about one thing that is related to this durability. Amazon S3 has very, very like high durability promises. I think it's 11 nines which I had to like do a double check on because in backend systems whenever you say three nines it's like eh, when you say four nines of availability, we're not talking durability availability, four nines is already hard to achieve and beyond that it just gets very expensive. And I have never heard of eleven nines of durability. Now this is durability and not availability. One question that I got when I shared the statistics publicly, what people? One thing people were asking, and I was also thinking, how can you prove that not just in a formal way, but you're now storing, as you said, 500 trillion objects, which is now large enough that just by this durability promise you might be losing some of them. Do you actually like validated on the actual data as well, outside of the proof? Because I assume in the proof you will have assumptions on hardware failure rate which might or might not be true. So my question is that at Amazon S3 level, when you are able to look at the, are we living up to, for example, durability promise? How do you go about that and what are your findings?
Mylan (VP of Data and Analytics at AWS)
Yeah, so we just spent a lot of time talking about our index subsystem because that is the subsystem that is related to consistency. But when you think about durability, I mean you think about it all at different levels of the S3 stack, but we really think about it in the storage layer. And so if you think about it in the storage layer, you have this design, this promise of the design here. And underneath that is a combination of things. It's software, but it's also the physical layout of where our data is across everything that we have in S3. And one of the things that I talked about is that we have disks and servers which sit in racks, which sit in buildings, and we have tens of millions of these hard drives. We have millions of servers and we have 120 availability zones across 38 regions.
Interviewer (Gregory)
Yeah, and one availability zone is like two availability zones are two physically separate locations.
Mylan (VP of Data and Analytics at AWS)
Just to be clear, they're physically separate and sometimes they're a ways away from each other. And in some of our regions we have more than 3 availability. I mean the availability zones gives us a different domain, a fault domain. If I were to think about durability, I think the most important thing for us is our auditors. So you think about a distributed system. We talked about the PUT and the get. We have many, many, many microservices that are all doing one or two things very well in the background. Okay. And so we have many different varieties of health checks, but we also have repair systems and we have auditor systems. And our auditor systems go and they inspect every single byte across our whole fleet. And if there are signs that there is repair needed, you know, another repair system will come in place. And these are all, you know, in the, in the world of distributed systems, these are all microservices working together. Loosely correlated, but communicating through well known interfaces. And so that, you know, collection of systems which are over 200 microservices now, that all sit behind one S3 regional endpoint. And a fair number of those subsystems, those microservices, are all dedicated to the notion of durability.
Interviewer (Gregory)
So. So they will go and check and log and report back. So do I understand correctly that in any given time frame at S3, someone or some people or some systems can actually answer the question of what is our durability? The past week, month, year and so on?
Mylan (VP of Data and Analytics at AWS)
Yes.
Interviewer (Gregory)
Okay, great. So you can actually verify your durability promise that check. If the math is mathing, yes.
Mylan (VP of Data and Analytics at AWS)
And you know, part of our design is that at any given moment in this conversation that you and I have had just today, we're having servers fail because servers fail. And so what we are building and what we've built in S3 is an assumption that servers fail. And so a lot of our systems are always, you know, first of all, they're checking to see, you know, where any failure might hit an individual node, how does it affect a certain byte, what repair needs to automatically kick in place. And so this system is constantly moving behind the scenes, if you will, while. And that is a completely separate thing from the get and the put. The get in the put is what the customer sees. There's this whole universe under the hood of how do we manage the business of bytes at scale.
Interviewer (Gregory)
I'm just thinking because for a lot of us engineers who are building like moderately sized systems, I'll say, compared to S3, they can already be big. But a failure is a big deal. Like, you know, like a machine going down again. I have a small side project and my storage filled up and I started to give errors. And this is a big deal because it rarely happens to me. This is the first time it happened in three years. But I understand in your business or when you work at S3 scale, this is just every day. And the question is not when, it's just how often. How do you deal with it? I guess it's a different world.
Mylan (VP of Data and Analytics at AWS)
It is a different world. And the trick is to really think about correlated failure. Okay, so if you're thinking about availability at any scale, it's the correlated failure that'll get you.
Interviewer (Gregory)
And what is a correlated failure?
Mylan (VP of Data and Analytics at AWS)
Okay, so that's super interesting. So if you think about what I talked about with, you know, eventual consistency, we talked about quorum, okay? And quorum is okay for one node to fail, but if all of the nodes go south, for example, and they're in the same availability zone or on the same rack, then you're really going to be messing with your availability of the underlying storage. Okay, you just lost your failure allowance that I talked about with the cache, because they all fail together. And so like, a correlated failure is an incredibly important thing to think about when you're thinking about availability. And so when we're designing around correlated failures, the thing is that we have to think about is like, do we expose or how are those workloads exposed to different levels of failure? So when you upload an object to S3 with a put, we replicate that object. Okay? We don't just store one copy of it, we store it many times. And that replication is important. It's important for durability. But what's interesting about it, it's also important for availability because if any of those correlated failure domains fail, like if a whole AZ fails, there's still a copy somewhere else and the data is still available somewhere, even though an availability zone has failed or a rack has failed or a server has failed or so forth. Okay, and so that idea of how do you manage and design around correlated failures with both our physical infrastructure as well as our logical infrastructure is super important for S3, for both availability and durability. We also do things like we think about something called crash consistency. I mean, Gregory, you can tell I can go on and on about this, so you just have to stop me.
Interviewer (Gregory)
No, but this is the interesting stuff.
Mylan (VP of Data and Analytics at AWS)
All right, so the whole idea of crash consistency is that a system, any system that you build it, should always return to a consistent state after a fail, stop failure. And if you can do things like reason about the set of states that a system can reach in the presence of failure, and you just always assume the presence of failure, then you also assume the presence of consistency and availability, then you just design all of these different microservices to all work together and an underlying capability like S3. But that's what our engineers do. They think about, like, crash consistency, they think about correlated failures. You know, they think about failure allowances and caches. Right? And it's all that deep distributed system work that our engineers come in every day to work on.
Interviewer (Gregory)
Can we talk about how you think about failure allowances? Because again, there is a concept of error budgets outside in other companies as well. I feel it's a bit like loosely handled, whereas I feel this is kind of your bread and butter. So what is a failure allowance and how do you measure it and what do you do if you overstep it or overspend it.
Mylan (VP of Data and Analytics at AWS)
Yeah, I mean, I think that the idea of a failure allowance is want to have it. Like you have to have it. If you assume for. No. That you'll never have a failure, you'll actually have a very bad day for. For your customer. And so we account for failure allowances. But the most important thing is let's just talk about the failure allowance in our cache. So how do we manage that? Well, we manage it in such a way that you'll never experience it because we size it right. And if you're sizing the cache and you're making sure that the underlying capabilities and the hardware are always there and we have, like I talked about those distributed subsystems, those microservices that are all interoperating under the hood, we have a ton of them that do nothing but just track metrics. Right. And like, you know, the sizing of our cache is all related to the metrics and the size of our underlying system.
Interviewer (Gregory)
All the metrics.
Mylan (VP of Data and Analytics at AWS)
Yeah, yeah, that's right. And so one of the really big benefits of running on S3 is because our system is so huge, you have these massive, you know, layers, right? And the massive layers are all managing things like correlated failures and failure allowances. And because they are so huge at the scale of S3, any application that's sitting on top of S3 gets the benefit of it.
Host
Let's take a break a minute from S3 to talk about a one of a kind event I'm organizing for the first time. The Pragmatic Summit in partnership with statsig. Have you ever wanted to meet standout guests from the Pragmatic Engineer podcast, plus folks from cutting edge tech companies and learn about what works and what doesn't in building software in this new age of AI? Come join me on the 11th of February in San Francisco for a very special one day event. The Pragmatic Summit features industry legends and past podcast guests like Laura Tacho, Kent Beck, Simon Wilson, Chip Huan, Martin Fowler and many others. We'll also have insider stories on how engineering teams like Cursor, Linear, OpenAI Ramp and others built cutting edge products. We'll also have roundtables and carefully curated audience where everyone and everyone is interested to meet and chat with. Something I'm hoping will make this event extra special seats are limited and you can apply to attend@pragmaticsummit.com talks will be recorded and shared and paid. Subscribers will get early access afterwards as well as a thank you for your additional support. I hope to meet many of you there. And I am so excited about this event. And now let's jump back to S3 and the massive scale of the service.
Interviewer (Gregory)
To get a sense of what the reality is like working as an engineer, an engineering leader inside an organization like this. I read a quote from a distinguished engineer, Andy Warfield, who said. I'm, I'm just quoting what he said. Early in my career, I had this sort of naive view that what it meant to build large scale commercial software, that it was basically just code. The thing I realized very quickly working on S3 was that the code was inseparable from the organizational memory and the operational practices and, you know, the scale and the scale of the system. Since you've now been more than a decade in S3, how do you think of this beast, this really complex system, hundreds of microservices, data that is hard to fathom unless you think of the hard drive stacking all the way to the space station. And how do you know engineers kind of wrangle this because it does feel a bit intimidating. I'm not going to lie.
Mylan (VP of Data and Analytics at AWS)
Well, I think so much of this just comes back to the culture and the commitment on the team. I've worked on S3 for a very long time now and I have such deep respect for the engineering community on S3. And you know, honestly, I mean, this is true for all of the services in our data and analytics stack. But we have engineers in S3 and they come in every single day with this deep commitment to the durability and availability and the consistent of your byte. And so the type of conversations that we have are so interesting because we have people and really, you know, these are people who are early out of school. There are people who've been working on S3, we have engineers who've been working on S3 for 15 years. And everything in between the creativity and the invention of S3, you have this tension, which is on one side you're like, you have to be very conservative with S3. And on the other hand we have this principal engineering tenant called Respect what came before. And that's an Amazon engineering tenant, which is, if it has worked for many, many years, you have to respect that. But then there's this also this tenant. These two tenants are a little bit in tension with each other, which is kind of what makes it so fun. Amazon engineering tenant is called Be Technically Fearless. And I believe that the S3 engineers are just amazing at this, at respecting what came before. Because if we build new capabilities in S3, we have to maintain the properties, the traits of S3, which is it just works and you get that durability, availability, et cetera. But at the same time, we have to be technically fearless because our ability to go into the world of conditionals, our ability to go into the world of native support for Iceberg or for Vectors, means that we are extending this foundation of storage in a way that helps customers build whatever application they need now and in the future. And so that combination of the two things, that is sort of when I think about our S3 engineering team, I think they come in every day and they embody that.
Interviewer (Gregory)
Now, going back to the evolution of S3 from unstructured to structured data, you were mentioning how Hadoop, the data warehouse, was a big use case where customers started to use it on top of S3. And then at S3, you noticed your, like, what a lot of customers or some of your biggest customers doing, and then you kind of built it yourself with more structured data. And then S3 tables came along and then vectors. Would you mind sharing a little bit more on how you evolve S3? Because this was another question that when I asked people about what they'd like to know about S3, one of the question was like, like, is it done, is it finished, or is it still evolving? Because there is this notion that S3 can store anything already. Right. Like any object, any blob, what, what new thing is there? And yet we have a lot of new things.
Mylan (VP of Data and Analytics at AWS)
Yeah. And if you kind of go back in time a little bit and you think about, you know, the rise of parquet, okay, so the rise of Parquet data in S3 started about 2020, and we started to see more and more people store their tabular data in S3. And if you think about what Iceberg provided, it provided a replacement for Hive. Okay, so if you think about Hive and Hadoop, Hive was basically giving your file system access into S3 unstructured storage. Iceberg is giving that Iceberg that tabular access, including the, you know, the compaction and all the table maintenance that goes along with it into your parquet data. And I actually think that the world's data for tabular data is going to live in the future in S3. And if you just think about the launch that, for example, Supabase did last week, Supabase announced that their postgres database is now going to. Is just going to do secondary writes directly into an S3 table, just like their Postgres extension for vector is going to integrate directly with S3 vectors. And so if the world of database, if the world is data as a source, if you will, goes directly into an ICE S3 table, what does that mean for the world's data? Okay, so SQL, as we know, is a lingua franca of data. And the world's LLMs have all been trained on decades of SQL and therefore.
Interviewer (Gregory)
And Python, SQL and Python and the.
Mylan (VP of Data and Analytics at AWS)
Stuff that's already out there. And so if you think about this, you know, we have many, many AWS customers who know the S3 API pretty darn well by this point. It's pretty simple API, but now you have the ability to interact with data in S3 through SQL. And what that means is that you don't have to be somebody who's building cloud applications or know S3, you just need to know SQL.
Interviewer (Gregory)
And this is with S3 tables, right?
Mylan (VP of Data and Analytics at AWS)
With S3 tables. And so you can just write SQL into an S3 table. And whether you're an AI agent or a human, right, you're introducing the lingua franca of data as a native property of S3 with S3 Tables. And I think you're just going to see that take off in the upcoming years.
Interviewer (Gregory)
And your Latest launch is S3 Vectors. Can you share a little bit what it takes to build a new data primitive like vectors, just behind the scenes, how long it takes, how the teams come together, and maybe what are some engineering challenges of launching something like this? And again, we're talking about vectors, right? So, like, you can use embeddings. Whenever you have LLMs, you create an embedding. It's a vector. You want to store that somewhere, you will need to do search on it. There are specialized vector databases, there are specialized vector additions, etc. So I'm assuming this is the function that S3 Vector supports very nicely.
Mylan (VP of Data and Analytics at AWS)
Yeah. And you know, I mean, today a lot of customers use vector databases. Just like back in the day, a lot of people put their, you know, their tabular data in just databases. Okay. And they just use the structure of the database in order to take advantage of being able to query their data. But they didn't really need to use a database, they just put it in a database. And then S3 came along, and then we introduced this way with the help of open formats like Apache parquet, and being able to store that structured data in S3, that's kind of what we're doing with vectors right now. And if you think about vectors, vectors are basically a bespoke. The data type. A vector at the end of the day is a very Very long list of numbers and vectors have been around for a long time, and they've been in vector databases for a while, but they really kind of took off in people's, you know, data worlds in the last couple of years with the rise of, as you said, the embedding models. Okay. And so if you take a step back and you think about one of the great ironies of data, it is that you have to know your data to know your data, right? You have to know what your schema is, you have to know what the data types are, you have to know where it is. And as these data lakes become data oceans, you have this situation where it gets harder and harder to know what's in your data, Right? And the beautiful thing about embeddings is that embedding models will understand your data so that you don't have to understand your data. And the format that these embedding models puts, this semantic understanding of your data is, in fact, a vector. And so when we talk to customers, and they're so excited about how these embedding models are getting better and better, they want to apply more and more basically semantic understanding to their underlying data, whether it's unstructured or structured, that they have in storage. And so they kind of want to store billions of vectors.
Interviewer (Gregory)
But. But just to say when they want. You say they want to understand. Correct me if I'm. But hypothetically, you have a bunch of text data or maybe some image data, and you're saying that a lot of people, customers, teams, they would like to write queries to say, like, hey, can you find an image that looks like a puppy? Or can you find an article that contains this or this? And embeddings are, as we know, are great for that, but then you need to actually create the embedding, build a system, et cetera. Right?
Mylan (VP of Data and Analytics at AWS)
Yeah. And, like, exactly what you're saying. Like, I mean, if you think about what vectors can do, if you think about all the data that a given company has, you know, your knowledge across your business or your knowledge across your life isn't organized into rows and columns like a database. It's in PDFs, it's in your phone, Right. It's in audio customer care recordings which capture the sentiment of how a customer actually feels about their interaction with you. It's whiteboards. By the end of this day, this whiteboard is totally filled up with ideas, and it's in documents across dozens of systems. And so it's not that you don't have data. You have tons of Data. But understanding what data you have across all of those different formats is a real problem. And it's one that AI models can help you with. And so the capabilities of those AI models have gotten so much better in the last 18 to 24 months. But we needed a place to put billions of vectors, billions of, you know, the semantic understanding of relationships. And that's what we built S3 for. The state of the art embedding models combined with the ability to have vectors across S3 is like a really important part. And it's not a database. I mean, it's the cost structure and scale of just S3, but it's 4 vector storage.
Interviewer (Gregory)
And then do I understand that? Did you need to build new primitives to store this? Like going down to the metal figuring out exactly where we do this, or did you build it on top of your existing, you know, like again, existing primitives as well, like blob storage, etc.
Mylan (VP of Data and Analytics at AWS)
It's actually a new primitive. And so, you know, we had talked about S3 Tables. S3 Tables is building on objects because those individual parquet files, at the end of the day they're an object. Vector is totally different. So with vector, we built a new data structure, a new data type. And you know, it turns out that when you're building vectors, searching for the closest vector in a very high dimensional space, which is basically vector space, yes, it's often really hard to find the nearest neighborhood. And so you basically, in a database, you have to essentially compare every vector in a database. And that's often like super expensive. And so what we do in S3 is because we aren't storing all of our vectors in memory, we're storing it on our fleet of S3, very large fleet. We still need to provide a super low latency. And in our launch last week, you were getting about 100 milliseconds or less for a warm query to our vector space, which is actually pretty fast. It's not database fast, but it's pretty fast. And the way that we do that is we pre compute a bunch of, think of them as vector neighborhoods. Okay. And so it's basically a cluster, a bunch of vectors that are clustered together in similarity, like a type of dog. As an example, these vector neighborhoods, if you will, they're computed ahead of time offline, they're computed ahead of time asynchronously, so that when you're doing your query, it's not going to impact your query performance. And then every time a new vector is inserted to S3, the vector gets added to one or more of these vector neighborhoods based on where it's located. And so when you are executing a query on S3 vectors, there's a much smaller search that's done to find the nearest neighborhoods. And it's just the vectors and the vector neighborhoods that are loaded from S3 into a fast memory. That's where we apply the nearest neighbor algorithm. And it can result in like really good sub 100 millisecond query times. And so, you know, if you think about the scale for S3 will give you up to 2 billion vectors per index. You think about the scale of a S3 vector bucket, which is up to 20 trillion vectors, and you think about that combined with 100 milliseconds or less for warm query performance, that just opens up what you can do with creating a semantic understanding of your data and how you can query it.
Interviewer (Gregory)
It sounds very interesting and also challenging because you have to build this for scale from day one. I guess that's one of the, I guess, benefits and curses of working at S3, that everything that you launch you need to prepare for what will be extreme data elsewhere. But here it's just Monday.
Mylan (VP of Data and Analytics at AWS)
We have S3 service tenants as well. And one of the tenants, and one phrase that I use all the time, and our engineers do too, is scale is to your advantage. So if you are an engineer and you think about that and you think about one of your tenets for anything you build is that scale must be to your advantage. It just changes how you design. It means that you can't actually build something where the bigger you get, the worse your performance gets, or the worse some attribute gets. It has to be constructed so that the bigger you get, the better your performance gets. The bigger S3 gets, the more decorrelated the workloads are that run on S3, that is a great example of scales to your advantage. And so when we build vectors just like we build everything in S3, we ask ourselves, how can we build this such that scale is to our advantage. How can we build this such that 100 milliseconds or less is just the start of the performance that we're going after. And how can we make sure that the more vectors we have in storage, the better the traits of S3 for vector?
Interviewer (Gregory)
I have a different question about the limitations of S3. I read that the largest object you can store in S3 is 50 terabytes. Why is there a limit on the largest object? I mean, I think we can imagine this will be through either multiple hard drives. And so On. But why did you decide to have a limit? I'm just interested more in the thought process of how the team comes up with like, okay, this will be the limit. And this is why, I mean, I.
Mylan (VP of Data and Analytics at AWS)
Think first of all, that limit of 50 terabytes is 10 times greater than what we launched with. We launched with 5 terabytes and now we're 50 terabytes. And sometimes we sit and tell customers that and they go, what am I going to store? That's going to be 50 terabytes. And we're like high resolution video, Right. And so, you know, if I think about customer. Right. And so if you think about this sort of thing, you know, like if you think about, I don't know, size, size limits, generally speaking, we do try to optimize for certain patterns. And when you raise the size of an object by 10 times, like we did, we're just optimizing for the performance and scale of the underlying systems. It's like we increased the scale of our batch operations by 10 times last week too. And the idea behind that is that the underlying systems were just optimizing for distributions of work that are the new norm for how people are doing things. And we'll just keep on changing. We don't have too many limits, to be honest, but we'll just keep on, you know, looking at what customers are doing across the distribution of workloads and seeing if there's something that, that needs to be changed. The big thing for us, you know, again, we, we did have a lot of conversations with customers and they're like, really like, I don't have that many individual objects that are that big, but with the increase of cameras and phones and things like that, we are seeing more and larger size objects and we just wanted them to be able to grow unfettered in S3.
Interviewer (Gregory)
And so how does S3 evolve and how has the roadmap changed? Because so far what I picked up is everything that you told me is saying, well, our customers were doing this or that, and you obviously here, you live and breed data. So you see the patterns, you see stats, you see the objects, you also talk with them. Is it only you talking with customers? Seeing what's happening, what they're struggling with, what they're using more of, and then deciding to improve that? May that be the limits? May that be figuring out we need a new data type because they're now building their own data types on top of it? Or is there also some kind of. More kind of. All right, here's a vision. Here's a roadmap of what we'll do.
Mylan (VP of Data and Analytics at AWS)
It's a great question. And in fact, one of the things that we talk about all the time is the coherency of S3. Right. And so there are certain things that people always expect from S3. It's the traits of S3. It's a durability, availability, attributes that we talked about. And so a fair amount of engineering goes on under the hood for that. Okay. And it's a set of capabilities that, you know, we may or may not have talked about today. In fact, if you think about, I think Back to. To 2020, I think we've launched over a thousand new capabilities since 2020 in S3. And some of them are what we think of as the 90% of the roadmap, which is what people ask for explicitly. Okay. And so, for example, you know, some of our media customers want the bigger object size, and so we delivered that. We have other customers that do a lot with batch operations, but then we have some things that we invent because, you know, we look at what customers are doing with the data and we ask ourselves, how can we build that vector? Kind of falls into that category for vector. When we looked at S3 and how S3 is evolving, we told ourselves, like, look, you know, we can continue to make S3 the best repository for data on the planet, and we will, we will. We have engineers that come in every day working to make that so. But there's this other element of how do you make sure that the data that you have is in fact, usable? And how do you make sure that it's usable in a way that's industry standard, like that iceberg layer on top of our tabular data. But it's usable because AI models have now gotten so good at embeddings that you can have AI give you a semantic understanding of your data, if only you had the cost point of putting billions of vectors into storage so you could actually understand and use your data in a different way. And so for us, a lot of it is kind of taking a step back and looking not just at what customers ask us for, but we want to remove the constraint of the cost of data, which is what we do in S3. And we want to remove the constraint of working with your data, which is what we do in S3 too. And when we can do both of those things, if we can make it possible that your data grows as your business needs it, and you can tap into all the capabilities that you're getting with AI and how the world is changing for data, then we have a shape. We call it a product shape. Then we have a product shape.
Interviewer (Gregory)
Product shape. What's a product shape?
Mylan (VP of Data and Analytics at AWS)
It's sort of like an emerging. Like when I think about S3, I think of it as almost like this living, breathing organism where the shape of the product is evolving, but it's evolving with coherency around what you expect for the traits of S3. But it's evolving in a way that lets you steer into how you want to use data. And how do you want to use data? Not just now, but in the future. And we will continue to evolve the Product Shape of S3 based on what you want to do with data. And so in a lot of ways, we're sort of transcending the boundaries of what object storage was or what a database traditionally was, because now we have tabular formats, we have conditionals, and we're evolving into this new shape. And it is ultimately uniquely S3.
Interviewer (Gregory)
It kind of sounds like you have all these microservices. It's kind of evolving almost like a plant or a living organism.
Mylan (VP of Data and Analytics at AWS)
No, yes. I, I am in fact a former Peace Corps volunteer from forestry. And so, you know, a lot of times I will go back to the natural world for my, my metaphors. And Yeah, I mean, S3 is this living, breathing repository of data that lets people do things with data that they never thought possible.
Interviewer (Gregory)
It's just interesting because I think as engineers, we don't often think to relate the systems that we build with like a, a living organization, when in fact, I mean, obviously there's code, but as, as you said, there, there's people, there's servers, there's failures that now happen at, at a, at a cadence. You can almost just. You, you can probably predict how many hard drives are failing today, in fact, at, at your scale already, which again, maybe it's. Do you think it's because of the scale? When things become large enough, they start to have these characteristics? Because what, what I find fascinating talking to you is the way engineering works inside of S3 feels very different to how it works inside a smaller organization. Your kind of startup, which again does terabytes of data or maybe even a few petabytes, but that's kind of it. And you've seen some of these organizations, what changes at this large scale. What do you think that makes it, it feels pretty different. The world that you and the ES3 teams work in.
Mylan (VP of Data and Analytics at AWS)
It does. But in order for us to sustain the traits of S3 and to evolve it over time, we have to constantly go back to simplification. We have a very complex system with all of our different microservices. But I kind of go back to those microservices, have to do one or two things really well and we have to stay true to that. Otherwise the complexification of a distributed system, you know, it's unmaintainable over time. And for S3, this concept of, okay, there's a simple in S3, and the simple in S3 is a couple of things. One, it's a simplicity of the user model where not only do you have a simple API, but now you have the simplicity of using SQL with S3 or you have the simplicity of being able to leverage these AI embedding models, which makes semantic understanding of your data so much easier than having to annotate, you know, whole metadata layer. And so that concept of simplicity is in the user model of S3. But under the hood, if you are sit on any of our engineering meetings, you will hear our engineers talk about how do we make sure that we implement this capability with the greatest simplicity that we possibly can.
Interviewer (Gregory)
Speaking of which, what type of engineers do you typically hire to work at S3 in terms of what kind of traits, potentially past experience do you look for?
Mylan (VP of Data and Analytics at AWS)
Well, we hire all kinds of engineers. We have a lot of engineers on S3 who are early career, they're straight out of school, or they're at a undergrad or graduate school. And like I said, we have a ton of engineers who have been on S3 for a long time and everything in between. I think there's a really strong element in our teams that work on data around ownership. It's a, it's, you know, people feel this like personal sense of commitment. I feel it, I feel it every day I come in where I feel a personal sense of commitment to your bite, to the preservation of your bite, to the usefulness of your bite, to the ability for you to think about what your application does next and not the types of storage that you need or how you grow it. And that deep sense of ownership and that deep sense of commitment is a very, very common thread across our data teams because we know that at the end of the day, every modern business is a data business. And everything that people are trying to do with traditional systems, AI, whatever, is based on your data as shaping the core of your application experience. And so that data is our responsibility and we feel it very deeply.
Interviewer (Gregory)
And what would your advice be to let's say a mid career software engineer, someone who has a few years of experience working at different places, who is actually, after listening to this, gets really enthusiastic and decides like, one day I'd love to work on a deep, strong infrastructure team like S3 for like, let's say, like more experienced folks. What are experiences, activities that you might look for that might help you consider these folks more?
Mylan (VP of Data and Analytics at AWS)
There's a strong value in relentless curiosity. Okay. And you know, I talked a little bit about coloring within the lines and how when you work on S3 or a large scale distributive system which continues to reinvent what storage means, you're not really coloring within the lines, you're just kind of looking, you're taking a step back and you're saying, you know, I will draw what the lines are today and I will know that I might have to rub those out and draw new lines in the future for wherever things go. And so, you know, I have three kids who are in university, I have two kids in university and one in grad school. And that is one thing that I, you know, I think is really important is to always take a step back, take a look at the latest research. And some of the papers that I'll share with you are around how we, you know, we either took formal methods and we brought them into storage systems. Right. Or we thought about failure in a different way. Where that, that creativity, that relentless curiosity and that creativity with engineering, I don't think you can go wrong with that. I think the next generation of software, no matter if it's built in S3 or elsewhere, it is all driven by the creativity of the engineering mind and it is in all of us. We just have to kind of unlock it and unleash it and we will build amazing things like S3.
Interviewer (Gregory)
And I also love that with S3, not only has S3 created something that did not exist, and I think like it just was unimaginable because it didn't exist. But now I'm hearing startups that are building on top of S3. I think TurboPuffer is a good example. You know, they're a bit of innovation because now they have a base layer and I feel there's different levels of innovation. You decide where you want to innovate, at the very lowest level, one level higher, and so on. And you just use the right primitives. Right. In your case, this is just doing hardware and storage better than anyone. In the other layers, it'll be using the right primitives better than anyone.
Mylan (VP of Data and Analytics at AWS)
Yeah, it's very exciting for us to see so many different types of infrastructure built on S3 now.
Interviewer (Gregory)
And as closing, what is a book or a paper that you would recommend reading, that you enjoyed and why?
Mylan (VP of Data and Analytics at AWS)
I read a lot of different papers. I am fascinated by how quickly the evolution of embedding models are coming along now. And in particular, a field of science that I'm quite interested in is the multimodal embedding model. Because as you know, the world that we experience is multimodal and therefore the understanding that we have of data should be multimodal as well. And so there's this whole field of science that's emerging quite rapidly around multimodal embedding models. And so that is something that I encourage people who are working in the field of data to look at, because I think that is the next of data. If you think about the next world of data lakes, I think it's actually going to be on metadata, it's going to be on the semantic understanding of our data and understanding how that is created through vectors and how it's being searched and done across multiple modalities, I think is an important area of both research and advancement. And so that's what I would encourage people to look at in the world of data. I think vector is going to be quite big, particularly at the price point that we've introduced for S3 storage for vectors. And I'm excited about it. I think, you know, I think we're just getting started with data and an understanding of our data, and I can't wait to see what comes next.
Interviewer (Gregory)
Amazing. And do you have any book recommendations?
Mylan (VP of Data and Analytics at AWS)
I will give you a book recommendation just in case your readers are interested. It won't be in the field of computer science. It will be about the evolution of the ecology around us and supporting the bees, the native bees and insects around us. So a tiny bit farther afield. But I'll give you a book recommendation and if your readers are interested, they can take a look at how to support the bees of the planet.
Interviewer (Gregory)
Well, Mylein, thank you very much. This was fascinating and very interesting to get a peek into this massive world of scale of data and respecting the byte and treating it and making sure that it's durable.
Mylan (VP of Data and Analytics at AWS)
It was great talking to you and thank you to both yourself. I know you're a fan of S3 and to all of your listeners who use S3. We quite literally wouldn't be able to do what we do without the feedback and the encouragement from Everybody who uses S3 today. So thank you for that.
Interviewer (Gregory)
Just.
Host
Wow. I always suspect that there's a lot of complexity behind a system like S3, but I just did not realize the scale of it. Whenever I worked on systems with even hundreds of virtual machines, failure of one machine was a rare event and not something that we really counted on. During my conversation with Mylan, she casually mentioned that several machines have failed during our conversation, which is something that the S stream knows and prepares for and treats it like an everyday event. I personally really liked how AWS has two conflicting tenants, heavily used on the S3 team, respect what came before and technically fearless. For such a massive system. It would be easy to say let's move conservatively because of how many companies depend on us, but if they did so, S3 would fall behind. Finally, I'm still in awe that AWS put strong consistency in place, rolled it out to all customers, and did not increase pricing nor did the increased latency at S3 scale. This is an absolutely next level engineering achievement. In fact, it was probably one of the lesser known engineering feats of the decade. I hope you found the episode as fascinating as I did. If you'd like to learn more About Amazon and AWS, check out the exclusive deep dive I did with AWS's Incident Management Team on how they handle outages in the show notes below in the Pragmatic Engineer. I also did other deep dives about Amazon and aws. They are also linked in the show notes. If you enjoy this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show thanks and see you in the next one.
Host: Gergely Orosz
Guest: Mylan, VP of Data and Analytics at AWS
Release Date: January 21, 2026
This episode is a rare, behind-the-scenes exploration of Amazon Web Services Simple Storage Service (AWS S3)—one of the world's largest and most critical cloud storage platforms. Host Gergely Orosz is joined by Mylan, VP of Data and Analytics at AWS and long-time leader of S3, to discuss how S3 is engineered for reliability, durability, and innovation at a mind-boggling scale. Software engineers and tech leaders will find deep dives into distributed systems, strong and eventual consistency, formal verification, emerging use cases (like vector storage), and S3's unique engineering culture.
Put, Get, Delete, List, Copy—akin to HTTP verbs.This episode offers an enlightening, deeply technical look into the engineering that powers AWS S3. From how the world's data is reliably stored and accessed at cosmic scale, to the advanced methods for ensuring consistency and durability, and the cultural values that drive the team, listeners get both practical and philosophical insights. For anyone building large-scale, distributed systems or interested in the cutting edge of cloud infrastructure, this conversation is invaluable.