
Think you know Amazon S3? Think again. Discover game-changing new features like S3 Metadata and S3 E
Loading summary
Simon Huberthy
This is episode 710 of the AWS podcast, released on March 3rd, 2025. Hello, everyone, and welcome back to the AWS Podcast. Simon Huberthy. Great to have you back. And we're doing a deep dive. We're doing a deep dive deep into S3. And the only person I could reach out who could help me for this was of course, Wally Akbari, who is a principal solution architect and a storage specialist at aws. G'day, Wally. How you doing?
Wally Akbari
Hey, Simon. I'm excited to be here on your podcast as a first timer, so please take it easy on me. And look, obviously love talking about data and storage, so let's dive deep into S3 and anything else that you want to talk about.
Simon Huberthy
Absolutely. Now, we did an S3 deep dive a few years ago, in fact, one of our most popular subjects, because S3 is kind of in there in many, many things. It is in many ways a magical thing when it comes to storing data at scale. It's one of our old services, not the oldest, but it is one of the oldest and certainly probably the one that I'd say most people had their first experience of AWS with. So let's do the very high level. You know, what is S3, what was it built for and what does it serve? What purpose does it serve?
Wally Akbari
All right, awesome. Well, if we want to start at the very basic level, what Amazon S3 stands for. So Amazon S3 stands for Amazon Simple Storage Service. Look, it's a highly available and durable object store that's really designed for cost, performance and scale. Now, when data stored in S3, it's stored in S3 bucket as objects and they're based on unique key value pairs. It's a little bit different to the file world where you have a tree and a file, so to speak. And to put simply, S3 buckets are where you upload and download your data to and from. Now, to be honest, I know it says simple storage service, but. But it really is a simple AWS service to use, especially given the rich features that it has that, you know, our customers use to store their data for a wide variety of use cases. So we've got customers storing data on S3 ranging from, you know, let's talk about machine learning, right? They're storing their training data sets to their inference based models, all the way to data for their data lakes and analytics and, you know, all the way to storing durable copies of their backup and archive assets, so to speak. Now think about all those important backup Simon that, you know, customers want to store in a, in a durable, available place that's secure. So that's really what S3 is at a glance. And you know, what some of our customers use it for. But you know, how do our customers actually use Amazon S3? I've talked about how it's a great place to store different data types because it's so versatile. But it's, you know, customers can natively integrate with S3 using the S3 APIs or command line, or they can do it through application integrations into Amazon S3. You know, some apps have S3 connectors, but I really want to touch on, you know, some of the traditional applications and architectures that our customers have. And they want to integrate and use cloud technology. But look, their apps don't talk native S3 API. So I'll give you an example.
C
Right?
Wally Akbari
Imagine your application uses the SMB or NFS file protocol for access. But you wanted these applications to access data in S3. Well, one method is you could spin up an S3 file gateway instance. This provides a file interface that applications can access.
C
Right.
Wally Akbari
And that file interface is a connection into S3. So basically your application has a SMB or NFS mounted on it, and the backend is S3. And you access those objects, right? Yeah, pretty simple. And I'll finish off with one other scenario. Right, Simon, when was the last time you used the SFTP protocol?
Simon Huberthy
It's been a while.
Wally Akbari
Yeah, right. But SFTP is a robust protocol. It's been around for decades and it's still widely used, Right?
C
Right.
Wally Akbari
In our customer architecture. So how does an app that uses SFTP integrate with Amazon S3? And maybe it's a way to integrate with the data, access the data, upload data. Well, this is where you can stand up one of our fully managed services. It's called AWS Transfer for sftp. Basically, it's a SFTP endpoint that you spin up in a few minutes, present it to your SFTP client. But that SFTP endpoint that you've spun up is, Is actually backended by S3. So all your clients that use SFTP for data transfer now are actually, you know, uploading and accessing data from S3. Pretty cool, right? It's really as simple as that, to modernize your traditional architectures with Amazon S3.
Simon Huberthy
And the thing that is amazing to me about S3, cause I worked in storage for a long time, is it takes away the undifferentiated heavy lifting of storage. You know, storage is difficult because it's typically on some sort of recordable media, be it hard disks or SSDs or whatever comes next. The bad news is you have to refresh those pieces of hardware year on year, or every three years, or every five years, which is a real pain in the whatever. And also you've got to manage capacity and you've got to always pay for more capacity than you need, et cetera. And so suddenly this model where a you can store as much as you want exabytes, have fun. No problem. You can store as many objects as you want. You can store objects up to five terabytes in size each, and you only pay for what you use as you're using it. And you don't have to worry about capacity management or migration. It's a huge thing that in many ways we forget. But what we're going to talk about today is a some of the new things that have come out recently through Re Invent, but also some of the more advanced and detailed capabilities that maybe folks haven't really had a chance to poke at or have a go at. So what are some of the things that really popped up at re invent 2024 Wally, that we can share?
Wally Akbari
Sure, there was a fair few things, but you know, I'll try to keep to top view. So first look, Amazon S3 released increased the default quota for how many buckets you can have from 100 buckets to 10,000 buckets by default per AWS account. Now, I just want to point out, just because you can create 10,000s3 buckets per account doesn't mean you need to create. You know, there's a saying, keep it simple, have the least amount of buckets and at the same time don't have one bucket to rule them. All right? And by the way, customers can actually increase requests for a quota increase beyond 10,000 buckets per account. That's a lot of buckets, right? Another really cool capability or feature that we released, it's called Amazon S3 Tables. Basically it provides fully managed Apache Iceberg tables in S3. Now we keep talking about S3 buckets. Now there's a new type of bucket. S3 Tables is a new type of bucket called a table bucket. And it's purpose built for managing and storing tabular data at scale using the Apache Iceberg standard. And it's really optimized for performance. So you know, you're probably thinking, well, how can S3 Tables help customers with analytics applications such as they may be using Amazon Athena, Amazon Redshift, Amazon ema? Well, it makes a customer's it Makes it easier. Sorry. For customers who are actually using a self managed approach with their tabular data. Because if they use S3 tables because it's fully managed, the underlying storage has been tuned for maximized performance. It does things like automatic compaction, snapshot management and unreferenced file removal. And you talked about the undifferentiated heavy lifting. So don't worry about the storage and the scale anymore and don't really think too hard about how do I run the compaction and the snapshot management and the unreferenced file removal. So my analytics data is really. You can query it at maximum performance.
C
Right.
Wally Akbari
So that's really what it's about.
C
Right.
Wally Akbari
It is introduced to help those customers and we talk publicly about that. Customers using S3 tables could see their queries accelerate by up to three times.
C
Right.
Wally Akbari
Versus self managed Apache aspect tables in general purpose. S3 buckets and another. So another one that was really. So you may not think this is a big change, but I feel this is a big change.
C
Right.
Wally Akbari
So we released, you know, S3 metadata, which automatically captures metadata which when you upload an object, there's lots of metadata.
C
Right.
Wally Akbari
Now, you know, we make that metadata creatable via read only table and you can access that metadata in near real time in terms of minutes. So what does that mean?
Simon Huberthy
This is very cool. Very cool.
Wally Akbari
Yeah. Think about it. If you have hundreds of millions of objects, or even billions. I've got customers with billions and billions of objects, right. Imagine you wanted to query that data, right? Even a list operation. Don't do a list operation. If you've got. Don't do it. Well, you could use an S3 inventory report, right, which gives you a view 24 hours ago because it's generated every 24 hours. And you could look at, okay, what type of data I had, the names of the files, you know, use Amazon theme to query that. But what if you wanted something closer to real time reflection of your data assets? That's where you would use S3 metadata. And you guessed it, or maybe not, but this S3 metadata is actually stored on Amazon S3 tables. So if you're looking.
Simon Huberthy
It all ties together.
Wally Akbari
Yeah, it all ties together. If you're looking to query your metadata in near real time and S3 inventory report didn't do it for you and you didn't want to run list operation. This is a win.
Simon Huberthy
Yeah, absolutely. Absolutely. Now let's talk about cost because again, in, in the storage world, storage always grows, it never shrinks. And so people get very wrapped up about cost per gig or cost per terabyte or what have you and as they should. And one of the things that's always been battled for in the storage world is tiering of storage. You know, I want to, I'll, I'll spend more money on storing a piece of data for frequent quick access if it's worthwhile from a business standpoint. But if I don't need it, I don't want to pay top dollar. But managing the tiering was always a challenge. Talk to us about what the state of the art is today.
Wally Akbari
Oh yeah, cost and performance are top of mind and cost. Look, I don't remember the last time I deleted all my photos. Like my photos from on the phone just keep growing and growing, so to speak.
C
Right.
Wally Akbari
And we tend to archive that data in a cost optimized manner. You never know when you can access that data. And look, in reality, when we talk to our customers in enterprise, there's a lot of requirement to keep data for a long period of time.
C
Right.
Wally Akbari
Different retention policies.
C
Right.
Wally Akbari
And how do you actually store that data in a durable and cost effective manner and not have to worry about, like you said, tiering. And what do I need to tier? So we released S3 Intelligent Tiering a few years ago. Now, S3 Intelligent Tiering is a storage class that actually automatically T's your data between its frequent, infrequent and archive tiers based on access patterns.
C
Right?
Wally Akbari
Now, when you think about the object world, when you upload an object, it has a last, it has a created date and we use algorithms to understand the access patterns to that data. So to put simply, you could upload 10 terabytes of data, for example, Simon and out of that 10 terabytes, one file is very hot, right? And you upload these 10 terabytes January 1st in a particular year and that one file is hot all the way to December. Now s intelligent tiering knows that that file is hot and the rest of the data is, let's say it's not accessed, it'll tee all the other files and data accordingly and leave that one hot file in the frequent access tier. So what the way intelligent tiering works is it assesses your data patterns. In the first 30 days, everything's in the frequent access tier. Then if the data hasn't been accessed, then it moves it into the infrequent access T where it sits there for another 60 days. And if it hasn't been accessed again, it automatically T's that into the glacier. Instant access T these are all online storage classes, by the way. So millisecond access now, how's that?
C
Right.
Wally Akbari
You don't have to worry about tiering. It automatically moves it down, saves you.
Simon Huberthy
On cost because the access semantics for the object doesn't change, which wherever it is, like the application is not aware and doesn't need to be aware of the tier.
Wally Akbari
That's correct. And for those who are familiar with our S3 lifecycle policies, right, you could create that when you understand your object access patterns. So you know, and S3 lifecycle policies work on object created date. So you create a rule saying, I want you to tier my data that's older than 180 days to this storage class. Fantastic. But storage S3 intelligent tiering is designed for situations where customers don't know their access patterns. And you know, it's really hard sometimes understanding how your users interact with a vast array of data. So it takes away that guessing game for you. It automatically tees your data down for you to save you on cost.
Simon Huberthy
And I think if you think about it like my recommendation for customers is to just use intelligent tiering as your default option. That's where you start. And really lifecycle policies comes into play if you know the nature of the data. So let's say you've got data that's for backups and you know, well, this data has to be kept for 30 days in this tier, then it's gotta be 120 days and can't be deleted in this tier. And then it has, and then it has to be deleted. That's when lifecycle policies come into play. That type of, as you said, that really well understood data classification. But I'd argue that probably 80% of data, no one knows what the classification is in terms of use. So just, just use intelligent tiering and you don't have to worry about it, but you get the saving.
Wally Akbari
Yeah. And that's it.
C
Right.
Wally Akbari
So you know, we have S3 standard which our customers used to start their data life data journey on. Have a look at S3 Intelligent Tiering.
C
Right.
Wally Akbari
If you have automatic tiering and access patterns that are worked out for you. Sounds like a win.
Simon Huberthy
Yeah, exactly. Just, just happens. Now we, we touched on, I guess, data management and observability of data. Well, let's go a bit deeper because there's a lot in there once you start picking away. Well, what do I know about an object and what can I do with it? There's a lot and there's a few different ways to approach it. So maybe give us A if you want to do this, this is how you do a type view of how to use these different capabilities.
Wally Akbari
Yeah, look, with Amazon S3 we're talking about different scales of data. Some customers have terabytes, some have petabytes of data and billions of objects. How do you actually like. It's one thing to have data, but it's observability and then the data management's key. So the way I talk to our customers is have a way to look at it at a macro level, like organization wide level or at the top level. And also have a way to look at at the micro level. So let's look at the micro level really quickly. So if you wanted to work and understand your data at the prefix or name or tag level, that's where you'd use S3 inventory reports. You know, I can create an S3 inventory report for a bucket or a specific prefix and I can check the data there when I need to write Inspector, really at the micro level. But you know, when you look at it at, you know, at the macro level, what if you had hundreds of buckets? How frequently access some buckets or some prefixes are because you know that also then falls into security. Right. Your security team may want to understand, okay, this bucket, for example, why is this bucket being accessed all of a sudden with hundreds of thousands of requests? We don't expect this. This is normally a dormant bucket, dormant data, so to speak. So this is where we released S3 storage lens, right? It gives you observability at the macro level, starting at the org level. And you can drill all the way down from having a glance at all your buckets, how much data, how many objects you're storing at a glance, which is amazing by the way, all that data available, you can go down to the bucket level, the prefix level, the account level. And on top of giving you that visibility, you can then actually, it actually gives you recommendations, some really intelligent recommendations.
C
Right.
Wally Akbari
Firstly gives you, when you use S3 storage lens, it gives you outliers which are calculated using statistical analysis of the data, you know, of the last 30 day trend that for example, we saw 5 million requests to this bucket or prefix in the last three days. You may want to check this out. Secondly, it gives you cost efficiency recommendations. Thirdly, it gives you data protection recommendations around best practices like hey, you know, around encryption and replicating your storage. So that's really at the macro level. And one of my favorite features of S3 storage lens, apart from the fact that there's a free dashboard Configured for customers. And they can optionally pay for the advanced metrics which give you prefix level capability and longer historical reports. One of my favorite features is the top end dashboard for those who get to play with storage lens. It gives you a glance at the top end and you can pick that whatever end metric is. It could be top requests to a bucket or biggest bucket in terms of size and things like that. It is really, really cool, that top end dashboard. Now, we talked about macro and micro level observability, but then there's also data management.
C
Right.
Wally Akbari
Especially customers want to create sometimes copies of data. How do I do that without having to write scripts and code? That's where we have S3 replication. Now you can use S3 replication to replicate data that's newly created in S3 bucket to a different bucket in a different account, in a different region to a different storage class. Right. So many options there. But what we also provide now is if you have an S3 bucket and you turn on S3 replication, it'll prompt you saying, hey, Simon, you've got existing data in your S3 bucket. Do you want to replicate the existing data as well on top of the newly created data? You can choose yes or no. So that's really an awesome capability, right. To manage your data and replicate it for different reasons.
Simon Huberthy
If you're activating it much later in the life cycle of the bucket, because your requirement has changed. In the past, you had to do some fancy coding to manage it yourself and now you don't have to.
Wally Akbari
Oh, definitely, definitely.
Simon Huberthy
And that replication also you can sort of control or visualize the lag between one location and the other as well. So you know that data is being replicated within a good timeframe too.
Wally Akbari
Yeah. And if you had really important application data and you wanted to have it replicated ASAP, we have a capability that you can enable called S3 replication time control RTC. And that means that your data will be replicated. The majority of your data will be replicated in under 15 minutes. Now that's pretty awesome when you think about cross region.
C
Right.
Wally Akbari
So it's one thing to be in the same region, but also cross region. So that gives customers that added flexibility.
C
Right.
Wally Akbari
I need this is really high priority data in this bucket for this rule. I want this data to be replicated asap.
Simon Huberthy
Let's dive right down into the object level. So that's the sort of the most small component that we have within the S3 construct. And talk to us about versioning and tagging because they're two Very important concepts. I think it's easy to forget they exist, but they're actually really handy.
Wally Akbari
When you turn S3 replication on, it'll ask you to turn on versioning. That's something to remember. But look, versioning gives our customers peace of mind because they've used it for different reasons. Now it could be for a. Apart from keeping multiple versions, which, you know, makes sense. It's called optic versioning, but they use it as a, almost as a failsafe backup mechanism. And it's not really a backup, but you know, when you delete the active version of the object, it actually doesn't delete it. It's hidden from the view of folk until you go and actually delete it. So versioning lets you know if you've got an application that generates lots of data and it's always up creating the same file name, it gives you that peace of mind that you can go back to any particular version of the file depending on how many versions you want to keep. So gives you that peace of mind, if that makes sense.
Simon Huberthy
But I always turn it on because I know I'm going to make a mistake. I always turned it on because I know I'll make a mistake at some point.
Wally Akbari
Yeah, and it's not a backup, but it gives you that peace of mind.
Simon Huberthy
But really important rollback capability.
Wally Akbari
Yeah, exactly, like a rollback capability. But you know, I tell our customers, with versioning, just make sure you have the guardrails. Don't let it keep as many versions as possible. That's one thing.
C
Right.
Wally Akbari
And S3 storage lens, the beauty of it will tell you about how many non active versions you have, like in the data. So if you've got someone's accidentally put the wrong object versioning policy on, you can use Storage Lens to kind of point in that direction along with other things for cost savings. Now, tags, Tags are great. You don't want to search for data in S3 objects by the name or the prefix, right? So tags are a great way for cost attribution, for observability of data. But the one thing to remember is that, you know, create a framework for what type of tags you want to use and then apply it.
C
Right.
Wally Akbari
Having too little tags may not give you the data that you need later on, but you can always add tags. Um, you can always remove tags as well. But you also don't want to have too many tags, right? You don't want to have 20 tags for every object. It doesn't make sense.
Simon Huberthy
No, no, it's just the Right amount. And again, I think it's, it'll be interesting to see how the use of tags evolves now that we have the metadata query capability too. So again, it's, it's, it starts to be, it's our classic answer. It depends, it all depends what you're trying to do.
Wally Akbari
And that's the beauty of Amazon S3. There's, you know, our customers have unique requirements even when it could be to observability. Some may use inventory reports, Some may use S3 storage lens, some may use metadata and capture that data and build their own capability. Some, you know, may want to leverage tags in a particular way and you know, it really helps them customize to meet their requirements.
Simon Huberthy
Absolutely. Now we talked a bit about replication, which is more sort of a continuous replication. What about if I just need to move some data around on a one off thing or maybe just periodically?
Wally Akbari
Well, how good are your scripting skills? Some.
Simon Huberthy
It depends if I'm using Amazon Q Developer or not.
Wally Akbari
Well, look, the good news for you, Simon and some of our other customers who aren't scripting gurus, neither am I. Look, We've got a S3 feature called S3 batch operations, right? It basically does what it says, it performs batch operations and our customers use this at scale. And to keep it simple, you give batch operations an input file list like an inventory report that you've generated or your own custom one and say, these are the objects I want you to perform. This next action on it could be a copy, it could be, you know, change the tags on it. There's a few features that you can apply to it and it makes it simple. You give it an input file, you tell it what action to perform, where to write the output to, maybe to different bucket and away you go. So maybe, you know, if we go back to that replication question, I've got an existing S3 bucket. I turn on S3 replication for new objects. I've got a lot of data there, but I just want to copy some of the existing data. Not everything. I don't want S3 replication to replicate everything. So you could get batch operations and give it that list or the prefix list and say, hey, just copy this to this other location. Now another capability and we Talk about Amazon S3 and there's a lot of integrations by other services into Amazon. S3 is our AWS data sync service, which really is an at scale data movement service, right? So you can use Data Sync, create a data sync task to move data between S3 to for example EFS or EFS to S3 elastic file system times 1 S3 or even between S3 buckets.
C
Right.
Wally Akbari
So you know, it depends on again your use case, but there's different ways to obviously move and create copies of data.
Simon Huberthy
Yeah, absolutely. Now one thing, one thing that's interesting I think about S3 is often we don't have to think about performance, you know, performance pretty well and just does its thing really happily and you sort of set and forget, which is great. But there are times, as you mentioned, where customers like, no, I need the absolute maximum performance possible. What should they be reaching for?
Wally Akbari
Well, they could reach out and leverage our new S3 Express one zone storage class. So it's a high performance single zonal storage. Now for those who are Familiar with Amazon S3, they said, Wally, did you just say single zonal? Yes, I did. Now traditionally like S3 Standard and other storage classes like S3 Glacier are regional based services, they're multi AZ. Now customers have said, hey, we want consistent single millisecond basically performance capability from an S3 storage class for our machine learning workloads and for analytics workloads and so forth. So, so S3 Express 1 zone gives our customers that capability. It's single zonal storage. Our customers can actually, when they're deploying their high performance compute stack, for example, they will deploy in a particular Availability zone. They can then select the S3 Express One Zone storage class to also be in that same Availability zone. So we're talking about minimizing latency at the storage layer and the access point to that storage, if that makes sense, Simon.
Simon Huberthy
And we're trading durability for speed in this case. Cause you know, typically an S3 bucket is across all the Availability zones in a region. There's multiple copies of the data. There's lots of overhead that takes place to do that. And what we're doing here is kind of shrinking it down, saying, well this is not for data that you're trying to keep for the next hundred years. This is for data you're processing very quickly.
Wally Akbari
Durability. We're still designed for 11 nines, right? One is 11 nines across, let's say a region and its availability zones. Let's say it's 3 AZs, like for S3 standard 3 availability zones. But you can still have it designed for level nines of durability within a single Availability zone. So how do you get the lowest latency? It's by localized access, right? So therefore firstly it's a zonal bucket and secondly it's A special type of bucket called a directory bucket. Okay, so we, we have general purpose buckets now, which is what a lot of our customers are used to. Yeah, we have table buckets for Amazon S3 tables and we have directory buckets for S3 one zone express.
Simon Huberthy
Now let's talk a bit more about performance. So I mentioned, you know, as a, as an S3 user, you don't kind of worry about performance per se, it just happens. But let's look under the covers a little bit. How is S3 actually managing performance? You know, if I'm accessing particular data heavily or how does it manage the access requirements?
Wally Akbari
Well, that's the beauty. We do all that work for you, Simon, so you don't have to worry about it. But so I'll go back to just finish off something. I just remember S3 Express 1 zone. So along with obviously the single digit, consistent single digit millisecond latency, it's also designed for hundreds of thousands of, you know, access requests, right to the directory bucket. Now if we go to our general purpose buckets, for example, and performance that you're talking about, so, you know, they're designed for 3,500 puts and 5,500 get per prefix per bucket per second. That's a long one there. And we always talk to our customers about, in the past, you know, many, many years ago, probably in the previous podcast, you know, about partitioning your bucket.
Simon Huberthy
And naming the objects and all that.
Wally Akbari
Sort of stuff, having unique prefixes, right. But guess what? We now have this. We, we've had this capability for a while. It's auto partition. So, you know, Amazon S3 assesses what partitions are. All prefixes are hot. And you automatically, you know, do its magic on the back end to ensure that that data, that's hot, all right, it uses access patterns. Oh, this data is hot. It's been hot for a while. You know what, it needs more performance. And it makes the adjustments accordingly on the back end.
C
Right.
Wally Akbari
So you don't have to, you don't.
Simon Huberthy
Have to manage the, you have to do the fancy, the fancy object naming dance anymore.
Wally Akbari
Oh, you can if you want. It's always fun, right, Having those unique.
Simon Huberthy
Names not stopping you.
Wally Akbari
But the vast majority of our customers don't have to go through that. And they just basically, like you said, undifferentiated heavy lifting is taken away from them. They just go, okay, I need to create X buckets. I need it for this purpose. I need this framework around the bucket. And they're not really worried about Performance, cost or scale.
Simon Huberthy
Let's talk about what is job zero, our first priority, which is security. And there is no excuse for S3 bucket to be public this day. Because even if you want the data in your S3 bucket publicly accessible for some reason, web serving, et cetera, it should be behind a cloud front distribution anyway. But let's go back at the start. Let's talk about block public access and how that works and also some of the other things related to security for S3 that people may not be aware of.
Wally Akbari
Yeah, and look, Simon, security is job zero for us. Spot on. And what we did a few years ago is we released a by default block public access policy which is enabled on all S3 buckets by default. So if you go to create an S3 bucket right now, that policy will be checked. You actually have to manually go in and check it. So that's a fail safe right there in a guardrail.
C
Right.
Wally Akbari
Secondly, from a security point of view, S3 also encrypts all new objects by default, right, Using server side encryption, or you can pick your own encryption type that you want. So we stop public access directly to the bucket by default and we encrypt all new objects. But there's a lot more to, you know, securing your data, right? So you would need to have an AWS IAM or identity access management policy created and also leverage Amazon bucket policies. So you know, we're talking about securing your bucket from access at the access layer, at the data layer as well. So there's a few elements there. And you know, it's not just okay, then you've stopped public access and you've got your IAM policies and your bucket policies. Fantastic. But what about malicious intent? What if you know it's something else from within the organization, like people that have access? Well, we have a capability called S3 object lock, right. Which you can enable in compliance or governance mode. What that gives you is data immutability capability for your S3 objects. So you could have a bucket policy that prevents all deletes, but may maybe an admin's got access, right. You never know. So you could enable object lock for that really critical data and that further enhances your security posture. Right. So you've looked at the access angle, you've looked at the permissions angle, you've looked at the data angle to ensure your security posture is where it needs to be.
Simon Huberthy
Yeah, absolutely. And I think the other thing is that the ability to also track what's gone on from an API level through cloudtrail as well, means that you can see what's going on internally, what those usage patterns are as well.
Wally Akbari
Yeah, definitely. And look, you can go through cloudtrail, you can go through server access logs as well and at the same time they give you that logging capability. So you know, you can go and triage that data. But again, I want to go back to S3 storage lanes. Well, how do you know, right, how many of your buckets are encrypted or your objects are encrypted and what's your security posture then? How much of your data is replicated and should it be replicated? And you know, which buckets all of a sudden have hundreds of thousands of requests coming in. You know, you have the different mechanism from the top level, the macro down to understand, oh, our security postures should be better for this bucket. And then at the cloud trail level you're inspecting on, hey, what's going on at the very granular level. So you've got both ends of the spectrum, Simon, which is really awesome to have.
Simon Huberthy
Exactly. One other thing I want to talk about in terms of accessing buckets is this service, as you mentioned, has been around for a very long time and the team have iterated on it continuously for customers. And so if at the start you were used to sort of, you know, ACLs and bucket policies and stuff like that, like that's not the way to do it anymore. So S3 access points really gives you a far better, cleaner and more scalable approach to accessing your data. So let's touch on that briefly because I think again I'm guilty of this for some of the longer term, I was going to say older users, longer term users, we're sort of in the habit of doing it the original way, but this is a much better way to do things.
Wally Akbari
Yeah. So you know, we've got bucket policies which are, you know, obviously at the buck level, they're very top level.
C
Right.
Wally Akbari
And S3 access grants is a way to give, you know, more fine level access to different applications and teams to that data. You know, for example, you've got a bucket, it's got different data types, but you want to have maybe your, your machine learning team and your analytics team access the bucket, but at the different folders, if that makes sense. So it gives you that more granular access to the data, whereas the bucket policy gives you a more top level approach to managing access to the data.
Simon Huberthy
It's really useful, I think also if you have multiple defined customers or customer groups, internal or external, that need to access the data, it gives you a much more scalable way to manage that.
Wally Akbari
And look, the name kind of says itself. It's an access point.
C
Right?
Wally Akbari
You know, if you. For those who are familiar with Amazon Elastic File System EFS and access points, it's kind of similar to that, right? It's the same location, but you give a different access point. Almost like. And obviously the policy attached to the access point means you have a particular access. So we could be accessing the same bucket. Simon. Right, but I'll use one access point, you use another, and we get different views of the data.
Simon Huberthy
Exactly. And it also means if you need to rotate credentials or turn off, let's say I can't have access to that data anymore, but you still do, then my access point gets disabled, but yours stays maintained. So there's no sort of, you know, big disruption versus if you have a single policy to control everyone, then it's like, oh my goodness, I've got to redo it for everyone.
Wally Akbari
Policies can get very long if you try doing very complex, if you try doing that.
C
Right.
Wally Akbari
And then at the same time, you don't always want to be mucking around with bucket policies for a single user.
C
Right?
Wally Akbari
Because then if someone accidentally makes a mistake, it could have a much broader radius impact.
Simon Huberthy
So customers have been asking for the ability to use S3 as a file system, and there's been many ups and downs around that. And obviously the semantics aren't the same and there's costs involved in using gets and puts in a file system structure. But we do have something now which is specifically designed for customers to do that, called mount point for Amazon S3.
Wally Akbari
Yeah, and Mount Point for Amazon S3 has been a long ask from our customers. It's an awesome capability. So to put it simply, if you wanted a file view of what's stored in the Amazon S3 bucket from your local client, then Linux client, then you would install the mount point package on your Linux client and then you would just mount it like a normal, let's say NFS mount point, similar to that.
C
Right.
Wally Akbari
So you could go into the directory do ls you see all your S3 data as files now. Really important, right? This doesn't convert Amazon S3 as an object store into a file system. Okay. They're two different things. Don't treat, you know, you shouldn't treat object stores as file systems because they have different characteristics. So this is a great way that our customers who use S3 to read data and write new data into S3 using File Protocol. This is A win for them. And this could be, give an example, you've got a few servers, Amazon EC2 instances and you want to give them read access to the S3 data using a file interface because maybe that's what the application needs. It doesn't talk S3API then this is a great way. All these applications can use amount point, access the data through a local, through a folder and they have their own little cache amount point cache as well. So there's some tuning parameters there. So that's really awesome. But you know, if we look at the other end of the spectrum, again, you want to provide your applications, they need data in Amazon S3 and it could be for generative AI inference, it could be for machine learning and training data sets, it could be for high performance compute examples, right? And you've got lots of Amazon EC2 instances or Amazon EKS containers and pods. Well, you know, we have a service called Amazon FSX for Lustre and it has native integrations with Amazon S3. So FSX for Lustre is high performance parallel file system that is pretty self explanatory. It's a high performance parallel file system, right? Designed for performance, right. We're talking about tens of gigabytes, hundreds of gigabytes a second of performance, big performance. But you know, that's a, a file based protocol. Simon. But if you've got machine learning assets in the S3 bucket. Well, how do I kind of get my app that uses Lustre to access this S3 data? I don't write scripts. So when you spin up an FSX for Lustre file system, you can tick a box and say I want you to import data from this S3 bucket and also export data back to this S3 bucket automatically. That's for new data, change data, deleted data. When you spin up the FSX for Lustre file system instance, you see your petabytes for example of S3 data through the view of the Lustre file system which is mounted on your compute host. Now if you have a petabyte of data in S3, you don't need a petabyte of FSX for Lustre. It's a high performance cache. It could be a few terabytes in size. And you go in there and basically any file that you touch, it'll pull it in from S3 the first time into the file system. Subsequent access is sub millisecond super fast. But the beauty is what if you then create new files on that file system? You've done some work on the raw data you've got export, you write it back to the file system, it automatically writes Damson S3 for you, exports it very quickly.
Simon Huberthy
So it just happens in the background.
Wally Akbari
Yeah. And now you tie in S3 replication to assignment. What if you need to share data assets between different teams in different regions?
C
Right.
Wally Akbari
So then you could have S3 replication created and when that data hits the S3 bucket and you've got a rule to replicate new objects, it's replicated to a different bucket. Same, you know, same region. Different region, for example, could be for backup, for doctor sharing, data sharing. So you can see how S3 becomes a centerpiece and apps integrate with it in different, different ways.
Simon Huberthy
So we've covered a lot of ground and there's, there's even more we could cover. But we're not going to go too, too long. But let's touch on again. Monitoring. We talked a bit about CloudWatch and CloudTrail and you talked a lot about Storage Lens. But how, how do I capture long term the data or the metadata about my storage? My S3 storage?
Wally Akbari
Fantastic question. So, you know, customers have used Amazon CloudWatch for a very long time. You know, there's S3 metrics in there. You can even create custom dashboards using CloudWatch, which I love doing. Right. You can create a custom dashboard and then share it with other folk who don't even have AWS console access.
C
Right.
Wally Akbari
Which is amazing, right. So you know, you give them, it creates a username and passwords and uses our AWS authentication mechanisms. And this could be app owners, right? So they can then see their S3 metrics, for example, get puts and other bits. But that's for self service monitoring. But if they wanted longer term, again, we also publish storage lens CloudWatch metrics, if that makes sense. They're additional metrics that are published into CloudWatch which customers can leverage as part of their CloudWatch dashboards. And for historical trending, again, I lean customers towards looking at Storage Lens and I've said that so many times, we should say bingo, Simon. And with the advanced metrics, which are paid metrics on top of the base ones, you get up to 15 months worth of reporting, right? So you could go in there and look at the capacity, the total storage, right, or total requests for a bucket, for example, and it'll show you from X months ago till now, you know, and you got that graph which, you know, we love to see. How, what does this look like over time? Or how has this looked over time?
C
Right?
Wally Akbari
So you know, that's like capacity trends and performance, like access trends at the.
Simon Huberthy
Level, how is it being used? Like, what is going on?
Wally Akbari
Yeah, yeah, that's right. And yeah, we've got this, a lot of things. We've got storage access analyzer and, and there's too much to obviously discuss in.
Simon Huberthy
The time you can, you can see as much as you want about your storage is basically that's the end. Whereas at the start, if we think about again, let's go back 15 years, at the start you had list objects and that was it, that's what you got. And then figure it out yourself. Well, in the subsequent years the team has worked very hard to give you all kinds of visibility. So you don't need to worry about that from that perspective. One other thing I want to mention, I'm going to come back with one last question for you Wally, is that again, folks who have used things for a long time will think of S3 as being eventually consistent. And in fact some of the certification exams used to talk about, you know, what is the effect of eventual consistency. But now S3 is strongly consistent for gets, puts and lists as well as operations that change tags and ACLs or metadata. So this means you don't have to worry about. That is my short message there. So, yeah, good to bear in mind.
Wally Akbari
That is, that is a awesome capability. You think about strong read after write consistency for all, you know, applications without impacting performance or availability. So that's, that's the extra win there, right? So you know, for new objects, deleted objects, you know, subsequent reads and lists are consistent, which is amazing. But I just want to talk about one other re invent release that popped up and real quick is that we now also support conditional rights. Now you may think, what are conditional rights? Well, you know, customers have asked for how do we check if an object exists, you know, without writing code? So we don't overwrite the object and we don't want to turn versioning on, for example. Well, conditional checks allow you now to actually check if an object exists before they uploaded. So this is awesome right on top of, you know, the strong read after write consistency. So it helps our customers actually continually optimize their applications.
Simon Huberthy
It's a nice little one. And again, if you're used to using DynamoDB, which has, you know, conditional updates, et cetera, it's a similar sort of concept, very handy and fundamentally reduces the amount of code you have to create. Now Wally, I've got a question without notice for you and the question is going to be, besides using Storage Lens, which I get the point is important to use what's the one tip when you're talking to customers, you keep finding yourself giving them what's the thing, the one sort of tip that keeps popping up as a really common thing for folks to do to get the most value.
Wally Akbari
Wow, you saved the hardest question till last time.
Simon Huberthy
And that's the way we roll.
Wally Akbari
That's the way. To be honest, talking to so many customers, they all have unique requirements and you know, there's tips from, you know, cost optimization like. I'll go with that, right? I say look at S3 intelligent tiering, right? If you've got all your Data on Amazon S3 Standard right, now, have a look at S3 Intelligent Tiering. Because S3 Standard doesn't give you the tiering. This, this does effectively, right. And it's a two way door, right? So you can always go back to S3 and for example, S3 intelligent tiering will not T objects smaller than 128 kilobytes in size, okay. They'll stain the frequent access T. So if you're using S3 standard and you've got lots of small files and lots of large files as well, well, the rest of your data will tee down.
Simon Huberthy
So clearly intelligent archiving is the thing to look at because that gives you, it gives you a few things, it gives you cost control, it gives you potentially performance benefits as well. And I like your reference to a two way door. So for folks that aren't familiar, two way door is a door that a decision that you can undo easily. And the beauty part here is we're not talking about to emphasize this an application level change like the Interface 2S3 doesn't have to change. This is a backend cloud architect change, isn't it?
Wally Akbari
Yeah, it's basically it's another lifecycle policy or a copy or, or a batch S3 batch operations job. There's a lot of ways.
C
Right.
Wally Akbari
To move the data back.
C
Right.
Wally Akbari
And that's why we make it sound like it's a two way door.
C
Right?
Wally Akbari
So it's designed to save you on time, cost and money.
Simon Huberthy
Exactly, exactly. Wally, thanks so much for sharing your insight into all the wonderful ways you can use S3.
Wally Akbari
Thank you for having me, Simon. I hope the folk listening are as excited about data and storage and Amazon S3 as I am. And yeah, we've certainly learned a lot.
Simon Huberthy
And we do love to get your feedback awspodcaston.com and if you're wondering, yes, the podcast files are stored on S3 and they are served via Cloudfront, and the RSS feed is also on S3 and it is also served on Cloudfront. So just saying, we use what we talk about and until next time, keep on building.
Release Date: March 3, 2025
Host: Simon Huberthy
Guest: Wally Akbari, Principal Solution Architect and Storage Specialist at AWS
In Episode #710 of the AWS Podcast, host Simon Huberthy engages in an in-depth conversation with Wally Akbari, AWS's Principal Solution Architect and Storage Specialist. The discussion centers around Amazon S3 (Simple Storage Service), exploring its evolution from a straightforward storage solution to a sophisticated, scalable platform equipped with advanced features for modern data management needs.
Simon Huberthy opens the discussion by highlighting Amazon S3's pivotal role in AWS's ecosystem, noting its ubiquity across various applications and services. He remarks:
"It's one of our old services, not the oldest, but it is one of the oldest and certainly probably the one that I'd say most people had their first experience of AWS with."
(00:41)
Wally Akbari elaborates on S3's foundational aspects:
"Amazon S3 stands for Amazon Simple Storage Service. It's a highly available and durable object store that's really designed for cost, performance, and scale."
(01:18)
Simon and Wally delve into the latest advancements introduced at AWS re:Invent 2024, showcasing S3's continuous innovation.
Wally Akbari announces a significant update:
"Amazon S3 released increased the default quota for how many buckets you can have from 100 buckets to 10,000 buckets by default per AWS account."
(06:10)
He emphasizes best practices in bucket management despite the increased quota.
Wally introduces Amazon S3 Tables, a new bucket type optimized for managing tabular data using the Apache Iceberg standard:
"S3 Tables is a new type of bucket called a table bucket... it's fully managed, the underlying storage has been tuned for maximized performance."
(07:13)
This feature simplifies analytics workloads by automating tasks like compaction and snapshot management, potentially accelerating query performance by up to three times compared to self-managed solutions.
To address the need for real-time data visibility, S3 Metadata was introduced:
"Amazon S3 metadata automatically captures metadata which when you upload an object... you can access that metadata in near real time in terms of minutes."
(08:42)
This enhancement allows users to query metadata swiftly without relying solely on S3 inventory reports.
A significant focus is on S3 Intelligent Tiering, which automates data tiering based on access patterns to optimize costs:
"S3 Intelligent Tiering automatically tiers your data between its frequent, infrequent, and archive tiers based on access patterns."
(11:06)
Simon advocates for using Intelligent Tiering as a default storage class to manage unpredictable access patterns efficiently.
Advanced replication features like S3 Replication Time Control (RTC) ensure rapid data replication across regions:
"S3 Replication Time Control RTC means that your data will be replicated... in under 15 minutes."
(19:14)
This is crucial for scenarios requiring swift data availability across different geographical locations.
Addressing high-performance needs, S3 Express One Zone offers single-zonal storage with millisecond latency:
"S3 Express One Zone gives our customers consistent single-digit millisecond latency... designed for high-performance workloads."
(25:28)
This option trades off multi-AZ durability for enhanced speed, suitable for transient or rapidly accessed data.
Effective data management and observability are paramount for large-scale S3 deployments.
S3 Storage Lens provides comprehensive visibility into storage usage and patterns:
"S3 Storage Lens gives you observability at the macro level... you can drill all the way down from a glance at all your buckets to the prefix level."
(16:47)
Features include outlier detection, cost efficiency recommendations, and data protection best practices, aiding in informed decision-making.
For bulk data actions, S3 Batch Operations simplifies mass modifications:
"S3 Batch Operations performs batch operations... you give it a list of objects and specify actions like copy or tag changes."
(23:03)
This tool eliminates the need for extensive scripting, streamlining large-scale data management tasks.
Integration with Amazon CloudWatch and CloudTrail enhances monitoring and auditing capabilities:
"With Amazon S3, you can leverage CloudWatch for metrics and CloudTrail for logging API activities."
(41:09)
These integrations facilitate long-term tracking of storage metrics and access patterns, ensuring operational transparency.
S3's performance is engineered to handle vast scales without user intervention.
Auto Partitioning automates data distribution to maintain optimal performance:
"Amazon S3 assesses what partitions are hot and automatically adjusts on the backend to ensure data performance."
(28:52)
This feature removes the historical need for manual bucket partitioning based on object naming conventions.
Transitioning from eventual to strong consistency enhances reliability:
"S3 is strongly consistent for gets, puts, and lists as well as operations that change tags and ACLs or metadata."
(44:05)
This guarantees immediate consistency across all operations, simplifying application development and data integrity.
Security remains a top priority for S3, with multiple layers of protection.
By default, S3 now enforces block public access policies and encrypts all new objects:
"We released a by-default block public access policy which is enabled on all S3 buckets by default."
(30:28)
This ensures that data remains secure unless explicitly configured otherwise.
S3 Object Lock provides data immutability in compliance or governance modes:
"With S3 Object Lock, you can enable data immutability... enhancing your security posture."
(31:03)
This feature prevents accidental or malicious deletions, safeguarding critical data.
S3 Access Points offer scalable and granular access management:
"S3 Access Points provide a more granular access control mechanism, resembling the access points in Amazon EFS."
(34:15)
They enable multiple access policies for different applications or teams within the same bucket, enhancing security and manageability.
S3's flexibility is further extended through integrations with other AWS services.
The Mount Point for Amazon S3 allows users to interact with S3 as a traditional file system:
"You can install the mount point package on your Linux client and mount it like a normal NFS mount, viewing your S3 data as files."
(36:34)
This facilitates applications that require file-based interfaces without altering their core logic.
Integration with Amazon FSX for Lustre supports high-performance computing needs:
"FSX for Lustre is a high-performance parallel file system that integrates natively with Amazon S3, enabling rapid data access and movement."
(39:12)
This synergy allows seamless data handling for machine learning and analytics workloads.
Towards the end of the episode, Wally shares actionable advice for maximizing S3's value.
Wally recommends:
"Look at S3 Intelligent Tiering. If you've got all your data on Amazon S3 Standard, have a look at S3 Intelligent Tiering. It effectively optimizes cost by automatically tiering your data based on access patterns."
(46:32)
This approach minimizes costs while maintaining performance without requiring deep insights into data usage.
Highlighting reversible actions:
"S3 Intelligent Tiering is a two-way door. You can easily revert to S3 Standard if needed without disrupting your applications."
(47:01)
This flexibility allows organizations to experiment and adjust storage strategies with confidence.
Simon and Wally wrap up the episode by reaffirming Amazon S3's integral role in modern data architectures. From its robust security measures and performance optimizations to its intelligent cost management and seamless integrations, S3 continues to evolve, empowering developers and IT professionals to build scalable, efficient, and secure cloud solutions.
Simon encourages listeners to explore S3's latest features and provides a nod to the podcast's infrastructure:
"You'll find that the podcast files are stored on S3 and served via CloudFront, demonstrating the practical applications of the technologies discussed today."
(47:38)
For more insights and discussions on AWS services, visit awspodcaston.com.