#696: [The Frugal Architect w/Werner Vogels] WeTransfer's journey to cut costs, not corners - AWS Podcast

Summary5 min read

AWS Podcast Episode #696: "The Frugal Architect w/Werner Vogels – WeTransfer's Journey to Cut Costs, Not Corners"

Release Date: November 25, 2024

Introduction

In this insightful episode of the AWS Podcast, host Alicia teams up with Amazon’s Chief Technology Officer, Werner Vogels, to explore the concept of frugality in architectural design. Joining them is Dan Conte, the former CTO at WeTransfer, who shares his experiences and strategies in building cost-effective, sustainable, and scalable systems without compromising on quality.

Setting the Stage: Frugal Architecture

Werner Vogels introduces the episode by emphasizing the importance of learning from those who have navigated the challenges of building resilient architectures both within and outside the cloud era.

Werner Vogels [00:35]: “Frugal doesn't mean cheap. Frugal means absolute value for your money... thinking about cost as a proxy for sustainability.”

He elaborates on how cloud computing initially removed many constraints, fostering rapid innovation. However, this freedom sometimes led to overlooking cost efficiency—a trade-off that many companies are reconsidering in recent years.

WeTransfer’s Origin Story

Alicia prompts Dan to recount the inception of WeTransfer, highlighting its seamless solution for transferring large files—a common challenge for creative professionals.

Dan Conte [03:47]: “WeTransfer definitely pre-dates my time at the company, it started about 2009... It was designed as a utility for creatives to easily upload and distribute large files.”

Dan explains that WeTransfer originated within a creative studio in Amsterdam, aiming to simplify file sharing without relying on physical media like drives or couriers. The platform's inherent simplicity and utility quickly garnered viral growth, transitioning WeTransfer from an internal tool to a widely-used product.

Transitioning to a Technology Company

Initially driven by creatives rather than technologists, WeTransfer faced scalability issues as its user base expanded. Dan details how the company had to pivot towards building robust engineering solutions to support its growing infrastructure needs.

Dan Conte [05:23]: “As you start to scale up, you start to get more users... we began investing heavily in engineering to build a scalable web application on the cloud.”

By 2014-2015, significant investments in engineering fostered a culture focused on scalable and reliable systems, laying the groundwork for future growth.

Embracing Frugality in Architecture

The discussion delves into frugal architecture, where Dan shares how his background in resource-constrained environments influenced his approach at WeTransfer.

Dan Conte [07:07]: “My first jobs were all in embedded systems... This forced me to think creatively within constraints.”

He stresses that frugality is not merely about cutting costs but optimizing resource usage to deliver maximum value. This mindset became crucial as WeTransfer navigated scaling challenges and accrued technical debt.

Challenges Faced: Security, Reliability, and Cost Overruns

Upon Dan’s promotion to a management position in 2022, WeTransfer confronted severe security and reliability issues alongside escalating costs. Dan recounts how these problems underscored the need for a frugal approach.

Dan Conte [17:07]: “We had a major security issue, significant reliability issues, and a cost overrun occurring simultaneously... These revealed our technical debt.”

A critical incident involved a flawed storage solution that led to billions of leaked blocks, highlighting inefficiencies and the necessity for better observability.

Enhancing Observability and Managing Technical Debt

Dan describes the shortcomings in WeTransfer’s observability practices, which hampered their ability to detect and address issues promptly.

Dan Conte [18:59]: “We had a whole observability stack in 2017, but adoption within teams was low... We established a reliability team in 2022 to champion observability.”

Efforts included setting minimum observability standards, educating teams, and demonstrating the value through exemplary projects. This shift empowered engineers to identify and eliminate waste effectively.

Cultural Shift: Embracing a No-Blame Environment

A significant cultural transformation was necessary to foster an environment where discovering and addressing waste was encouraged rather than criticized.

Werner Vogels [37:02]: “Mistakes are okay, they happen... It’s important how you learn from your mistakes.”

Dan echoes this sentiment, highlighting how open discussions about cost inefficiencies built trust with the finance department and motivated engineers to contribute to sustainability goals.

Dan Conte [38:17]: “The finance department was excited that we cared about costs... It became empowering for engineers to actively reduce waste.”

Balancing Short-term Fixes with Long-term Solutions

Dan explains the strategy of balancing major architectural overhauls with smaller, incremental improvements to manage both immediate and future challenges.

Dan Conte [42:54]: “We aimed for one or two major investments each year... supplemented by smaller, impactful changes.”

This approach ensured that while foundational improvements were addressed, ongoing optimization opportunities were consistently seized without overwhelming the engineering teams.

Lessons Learned and Best Practices

Both Werner Vogels and Dan Conte share valuable insights on maintaining a frugal yet innovative architecture:

Iterative Improvement: Continuously refine systems based on real-world usage patterns and feedback.
Comprehensive Observability: Ensure robust monitoring and data collection across all services to detect inefficiencies early.
Empowered Teams: Foster a culture where engineers are motivated and equipped to identify and eliminate waste.
No-Blame Culture: Encourage open discussions about failures and inefficiencies to promote learning and growth.

Werner Vogels [35:25]: “Understanding the evolution of systems... it's more important to get things in the hands of your customers quickly than to pre-optimize.”

Conclusion: The Path Forward

The episode concludes with an emphasis on the importance of frugality in sustainable and scalable architecture. By balancing major projects with incremental optimizations and fostering a supportive culture, organizations can achieve cost efficiency without sacrificing innovation.

Werner Vogels [48:35]: “Great stories... making sure that you don’t waste anything.”

Alicia wraps up by encouraging listeners to apply these principles in their own organizations, leveraging cloud elasticity to maintain both growth and sustainability.

Key Takeaways

Frugality Equals Value: Focus on maximizing the return on every dollar spent, not merely cutting costs.
Scalable Solutions Require Investment: Transitioning from a creative to a technology-driven company necessitates significant engineering investments.
Observability Is Crucial: Robust monitoring systems are essential for identifying and addressing inefficiencies.
Culture Matters: A no-blame, learning-oriented culture empowers teams to continuously improve and innovate.
Balance is Key: Successfully managing both large-scale overhauls and small optimizations ensures sustainable growth and cost efficiency.

This episode provides a comprehensive look into WeTransfer's journey towards cost-effective and sustainable architecture, offering valuable lessons for developers and IT professionals aiming to build resilient systems without compromising on quality or innovation.

Loading summary

Transcript64 lines

[00:00]
Werner Vogels
This is episode 696 of the AWS podcast released on November 25th, 2024.
[00:09]
Alicia
G'day, everyone. Welcome to the official AWS podcast. I'm Alicia here and I'm happy to bring you a special series with Werner Vogels. Here's Werner to tell you all about it.
[00:18]
Werner Vogels
Thanks, Simon. Welcome to the Frugal Architect podcast where we dive into the journeys of technology leaders building cost to air sustainable and modern architectures. These are longer form conversations where we explore these topics in depth and I hope you enjoy them.
[00:35]
Alicia
And coming to us from a very glamorous remote island location that he may or may not reveal during the episode, Dan Conte, who's the former CTO at We Transfer. G'day, Dan. How you going?
[00:46]
Dan Conte
Howdy, Simon. Thanks for having me.
[00:48]
Alicia
Thank you for joining us. Now we are talking about frugality, one of the topics near and dear to all Amazonians hearts as one of our leadership principles. But more importantly, something really important to architects and Verna, of course you introduced this concept reinvent back in 2023 and we continue to dive deep on this topic and really the goal of this series is to really talk with some of our most amazing customers who have done this and want to share the lessons. As we go into this conversation with Dan, what's the mental model that you want listeners to have in their minds as they're thinking about what we're going to be talking about today?
[01:26]
Werner Vogels
I think most importantly is that we hear the stories of people that have the scars, I mean, before Cloud. And I think Dan can also talk about that is our constraints breed a lot of bread, a lot of creativity. We always had to live within constraints. Cloud was amazing in that it removed all those constraints and suddenly moving fast became way more important and innovating really fast and doing all the things that couldn't do when you were constrained. But that sort of left the art of being creative within constraints, left it off the table. And I think in the past year, since 2020, 2022, we clearly saw that a lot of our customers started to scratch their head and say, is this really the amount of money that I should be paying for this architecture? Are we architected in the right way? And that's why we sort of took a step back and sort of tried to lay out a bunch of principles around what we then call the Frugal Architect. Now, frugal doesn't mean cheap. Frugal means absolute value for your money. And I think that's sort of really starting to think also as architects that we should have the idea of cost. And with cost, I also think, as a proxy for sustainability in mind. So that was what set this all off. And one of the documents that we highlighted on stage was V Transfer. And, you know, as being a. One of the oldest European companies on aws, you know, they had a long history and great stories to tell.
[03:10]
Alicia
Absolutely. And that's a great call out about the, I guess, longevity of Wetransfer. I was thinking myself back to, you know, when was the first time I ever used Wetransfer. It's one of those services that I think all of us in it have been like, darn it, I've got to send a really large file to someone. I have no way to do it. What's available. And then this we transfer thing exists and you sort of look at it and go, wow, this is alarmingly easy for me to use. And I can just do it and it's okay. Dan, tell us the origin story of Wetransfer. For me, the origin story is I found it on the web and I used it many years ago. But there's a lot more to it than that. Tell us a bit about it. Give us the context of where Wetransfer came from.
[03:47]
Dan Conte
Yeah, for certain. So Wetransfer definitely pre days my time at the company, it started about 2009. And back then, if you were a creative studio, you had large files you needed to send to another studio or to a client. Easiest thing to do was to put it on a drive and send it by courier or send it next to FedEx. It was fast, reliable, everything got there. But certainly for a lot of the studios in the time in Amsterdam, like, there is an idea that maybe there should be a way to just, you know, upload files and have people download at a later date. And so this particular studio built it sort of a utility for themselves first. Hey, can we just distribute files to our clients? Can we make it easy to upload? Can we make it easy for them to download at a later time? And the natural virality of all of these customers downloading kind of led to this growth over time very quickly. Like people were trying it out using Wetransfer as a utility to move files. So from that early beginning, a little bit of press happened, a little bit of explosion of virality, and suddenly it goes from a simple utility that they were using internally to kind of a full company and a product. And the way it was designed, where it was so simple to actually distribute a file and upload or download and so much of the canvas of the product is actually left for app advertising. It made it a very attractive product for advertisers too. So free, easy to use, beautiful, high quality ads and just a really steady growth rate starting from 2009 all the way forward.
[05:13]
Werner Vogels
So in the early days it was also really driven by the creatives much more than by the technologists. How did it become a technology company?
[05:23]
Dan Conte
Yeah, actually it's a great call out because it started within a creative studio like the. I think the first developer that worked on it was the part time IT person. In building something in Flash, it actually became more of a technology company out of necessity. So as you start to scale up, you start to get more users. Some of the choices you make initially just to get something working, the scaling choices break down. One of the ones that was made was, hey, there's never going to be more than 4 billion files on this platform. So we'll use integer indices and that's all going to be fine. And then after a few years things start taking off. There's no database migration strategy, how do you update the schema? No one's ever dealt with that. So this kind of situation where a creative company that's built this tool that's actually growing and getting a lot of use sort of necessitated some engineering internally to go build up, tooling and systems to go scale. So 2014, 2015, the investment in engineering really started taking off and starting to build a culture of how to build a scalable web application on the cloud.
[06:26]
Alicia
How hard could it be is always the question. But as you say, you make certain assumptions at the start and whilst we all want application to be successful or useful, we can't always predict the level of that success. And I think what's interesting is that you came into we transfer with a very distinct frame of reference in your own career that I'd love us to unpack a little bit. Because you came from that most constrained of environment. Werner talked about the fact that with cloud we kind of unconstrained developers, but there is a world where you have very finite amounts of memory and processor, et cetera, and that's what you started in. So maybe tell us a bit about that because I think that that helped inform your mental models that you were able to use at Wetransfer.
[07:08]
Dan Conte
Yeah, for sure. My first jobs straight out of college were all in embedded systems. And this was, I'm going to date myself Here, this is 99, 2000. I'm dealing with systems where there's like, you know, 48K of RAM, no virtual memory. You know, you have single thread, single core processors with no adjustable clock rates. So everything is very fixed. If you go over the budget you have for memory or cpu, there's no safety net. That's it. And so that, that forced and I think kind of Werner talked about this a little bit before. It forces this mindset of how do you creatively solve within the constraints you have. There were products I shipped where I knew every byte of memory and how it was being used because I had to use every byte to try and get performance over here and make those trade offs between compute and space. So that was kind of the very first work that I did at a school for four years or so. And then even after that working on desktop operating systems and working on Windows application development, you still have constraints and trade offs. You can allocate as much memory as you want, but then at some point you're going to get into the page pool, the system slows down, you can run in a more intense workload, but if you're on a laptop, it's going to get hot battery life kind of. So there's always these trade offs and constraints that force you to think about how you're using resources that you have.
[08:25]
Alicia
And that clearly changes the way you approach the problem domain. And I guess from a frugal architect perspective really that it starts with the thought process, not so much the tools, doesn't it? Like I think, I think Dan's story is a really great example of how constrained you can be but still build stuff.
[08:42]
Werner Vogels
And I also think that constraints can be, let's say a 2 constraint can be a negative on you. But you know, to be able to drive creativity, we can also impose on ourselves sort of virtual constraints just as an exercise for your mind. And I think also as engineers, one of the bigger things I think is that we have this tension between what the business wants and what we can deliver as technologists. And so where we need to expose these kind of topics as constraints or what do we want to give to the business to make sure that they understand the decisions that they make instead of always this accepting or the business wants this new feature, let's go get it out as quickly as possible. And not thinking about the cost, especially when things start to scale where the business needs to understand decisions that are being made, whether it's with respect to performance, whether it respect to reliability actually come back as a cost. And so that kind of, I think it's not just a matter of that. We as engineers are more constrained and enjoy sort of building things as cheaply as possible or as efficient as possible. It is also making sure that we give the business the tools to actually make the decisions for us. Where we as engineers shouldn't make the decisions, but the business should be the one making the decisions. We need to make clear what the constraints are for the business.
[10:22]
Alicia
Dan, I was going to come to you on that one is it's how do you explain to the business what the constraints are? I don't think in your case they were really thinking about the meaningfulness one way or another of the integer choice for a counter, for example has a business business outcome.
[10:37]
Werner Vogels
Well let's, let's start the business actually is pretty good in, in sort of when we think about sort of regulatory constraints kind of things you're allowed to do and what you can't do. And many of those things need to find a way back into the domain specific systems that you build. And so the business is known about constraints. You know, do you really think about sort of Amazon web page on Amazon does the website best seller list absolutely be nines available? Because if it needs to be four nines available it comes at a certain cost. And so those kind of trade offs where reliability or is probably one of the easiest ones to think about does everything need to be close to 100% available? If you say tell that to invest the first time, they'll say yes, of course, everything needs to be 100% available always under every circumstance. And then you have to tell them and that's going to cost you this much under these conditions. And so then maybe if you decompose and create smaller building blocks and reason differently about those building blocks, then suddenly the business can actually make decisions about under these failure conditions, maybe that one can have a bit lower priority and I'm willing to spend less on it.
[12:02]
Alicia
What about from your perspective?
[12:03]
Dan Conte
Yeah, I was going to say I think there's also this view that the ability to spend to go solve the problem. I think it's become the norm in cloud development in some ways rather than a tool that you can exercise should you need to. So to go back to your earlier point on frugality, there are certainly situations that we reached where it was an option like do we need to go spend more to buy more capacity or go up an instant size or can we make an investment on the engineering side to be more efficient or to reduce this digital waste, this unused capacity capacity that we were consuming in some way. We hit this in situations where the easiest thing would be to consume more S3 storage. But as we were looking at that, we also found places where in sort of maybe like an OPEX versus COGS discussion, we were looking and saying, hey, we could spend a little bit of time over here and clean up some of these cases where we've leaked storage over the years. That's two petabytes, four petabytes, and just kind of remove some of that waste. And so I do think that there's definitely an aspect where it's with the business like, hey, do we need to go invest more money to get the outcomes that we want? But for me, I found that those discussions got so much easier when we were also demonstrating that we were pursuing a reduction of waste, when we were trying to be frugal about the times that we use money and how we use money. Because people knew when I went to go ask for money, it was because, hey, we actually really need it for this problem. And we've pursued these other avenues to go clean up waste and reduce costs.
[13:30]
Alicia
What did you find? You've come into we transfer, you've got fresh eyes and as in all roles, you only get one chance to have fresh eyes, then you're part of it. But you've come from this embedded system background into a world of highly distributed, high capacity, using lots of stuff. What did you discover from an architectural standpoint? Tell us a little bit about it. And how did you start to apply some of that reasoning to it?
[13:54]
Dan Conte
Yeah, for sure. I think one thing that's key to note, I joined in February of 2021 and I actually joined as an IC. My intention was to come in and just build for a year. I just wanted the experience of meeting everyone, building with a team and learning the customers and the products. That gave me a sense of just who and how are the systems created and how is engineering working. But it also put me in the team at a time where everything was scaling up because February 2021 usage of WE transfer had gone up dramatically during 2020 in the pandemic. And so the company was growing, there was a lot of hiring going on, the business was growing. And so it's sort of a time of change. To your point, what I learned is that I think this is very common across any company that's gone through Scala. The problems that really had to be solved had been worked on and solved. And then all of the other problems that hadn't yet become big enough problems yet were kind of left. And it's intentional. We always focus and prioritize on the Things that we have to go address and we can leave this kind of trail of other things that hopefully don't catch up to us later on. But because the company had run in such a lean way for so long, because it was always in this place where it was really like a creative and a brand place and the technology was always kind of coming in second, it had run very lean and there were certain areas where the investment just hadn't caught up with where it needed to be. And especially as adoption grew in the product and a lot of new customers were coming in and there was a lot of pressure to build new products and build new features, we really had accrued some debt that was actually causing some challenges. And so when I moved into the management position in 2022, there was a period of time over two months where it was major security issue, significant reliability issues, where the site was going down for hours and hours at a time. We also had a cost overrun that was going on where we were just leaking stor, you know, on a month over month basis and couldn't trace it down. So a lot of the debt of like, hey, where did we not have observability, where did we not have tests, where didn't we have the reliability culture? Kind of all came due at the same time. I think the other part that happened as part of that is there was always this part of the company that was focused on what is our role in society, what is our role with people, like, how do we impact the planet. The company became a B Corporation in 2020. And so a lot of focus on people, planet and how to do that and still be a profitable company. And so there was always undertone of how can we be frugal in what we're doing. But the challenge is that when you go look at an architecture and you build a system and you're thinking from this perspective of how do I do this in the most efficient way? You have to go test the theory and see how it actually works in practice. You see how customers use it, you see how your theories and your ideas and your assumptions play out, and then you have to iterate and move from there. And so as we were going through this growth phase and we were rolling out systems like how we manage storage, we started to learn which things didn't work the way we expected and which systems where we thought we were doing the most efficient, frugal thing. It turned out we were actually generating a lot more complexity and a lot more problems. And so certainly I kind of came into this in A time of transition. And I think that opened the door for us to go figure out what's the mindset we want to have at the company moving forward, how we want to approach architecture moving forward.
[17:04]
Werner Vogels
How did you prioritize what to address.
[17:08]
Dan Conte
First in the security reliability, cost overrun discussion? Unfortunately, it was secure security because the issue that we had was it was existential to the business at that time. So we had to take that one first and then we had to do reliability and then we had to do cost, which feels very icky, especially as we're having a discussion about frugality. But it was one of those where again, the money one is one that it could kind of slush for four, six weeks if we needed to. While we were sorting out these kind of more existential problems, as we got to why we're cost growing on storage so much, it really highlighted just places where we didn't have enough data, enough observability. We didn't under understand the patterns of how the storage was being consumed and the interrelationships between systems and what the failure points were. Relatively simple issues. Like the idea would be we need to go delete a set of files after seven days because transfers expire and so the files you upload after seven days, they kind of self destruct so that they're no longer available. Just the amount of data that we had to delete was far beyond what any of the systems were designed to handle. So we were running out of memory, querying databases to go delete blocks and files. Those types of issues came up in just sort of inopportune time timing. But yeah, there's definitely security, then reliability and then cost at that time.
[18:25]
Alicia
And that's an interesting one where you talked about the fact that there were practices and processes in place, but in the case of sort of the cleaning up type situation, it was kind of failing silently. And this kind of ties into that world of observability and understanding all the assumptions you make help us unpack that a little bit more. Because I think the thing that you guys have that's fascinating is just the sheer scale you're operating at and the fact that this stuff just has to work and running very lean. What that meant from an operational standpoint and then how you went to diagnose and solve that problem.
[19:00]
Dan Conte
Yeah, for sure. And so certainly as things scale up, you start to see new usage patterns, new behaviors emerging. We also had kind of a confluence of other factors. So we had implemented a new solution for how we manage storage pretty recently as the company was hiring, certain people were seeing the company change. And so there was a fair amount of attrition. And we ended up in this place in May, June of 2022, where we had a relatively new storage solution in place. Everyone who had actually worked on building that storage service on top of S3 had left the company. So we had 100% turnover. And then we were seeing on a month over month basis that our costs were starting to increase at a pretty steady clip. And, you know, first. First month we looked at it and it's like, maybe that's some seasonal stuff, maybe that's the usage is changing a little bit. Second month, hey, that's a real problem. And we started to dig in. And by the time we got to the third month, we were talking real money. In terms of the impact of this, what had happened is that in the original design of our new storage service, there is this intention around being as efficient as possible with our use of S3. And so as you upload files in a transfer, so you might be uploading a thousand photographs that you've taken, we recognize that some of those may then be sent in another transfer to someone else and then another transfer to someone else. So we'd end up with three, four or five different copies of, of the same file being stored. So we implemented a duplication checking layer where we would do reference counts to files and only store them once. In addition, there was an idea that because when you upload a file, you basically don't upload it as a single stream, you would split it into these 10 megabyte chunks and try and run them in parallel, that we could potentially just store these chunks directly on S3 and then deduplicate across those. So that original assumption and intention, which was very much from this mindset about how we can be as effective, efficient as possible with storage. What turned out to happen is that as we were scaling up and seeing more usage, the reference counting overhead of managing all of that became very intense. And just the volume of 10 megabyte chunks, as we also saw people sending larger and larger files became a huge challenge. And so we were trying to, like, the systems that we built were trying to stay on top of all of these new creations of files as people upload and then deleting them as a transfer expired after seven days. And literally getting to a place where we would run out of memory, querying for all of the chunks of files that we needed to go delete. And this was just a pattern that we'd get stuck in and then periodically system load would drop and we'd be able to delete some of them. But after three months, we ended up in a place where it's always a little bit embarrassing, but it was something like 2 billion blocks that we had leaked in storage that we thought should be deleted, that we were still storing and paying for in S3 and effectively wasted. And this is a, you know, imagine a new team that's just coming in trying to understand how the system works and how the thinking was and you know, having to go push like a big red delete button on all of.
[22:01]
Alicia
The storage because everyone's super calm when it comes to deleting customer data. Like, it's really not a problem, you know, totally.
[22:07]
Dan Conte
I mean, what could go wrong, Right? But there's one particular dev after four months on the job, like, this was his task and, you know, sweating, right. It takes like three days to delete everything and kind of clean up databases. But so we had situations like that. And again, it wasn't that there was the wrong idea in place, like maybe a little bit more iteration on it. We could have seen where that would or wouldn't work. But it was certainly a case where we didn't have the observability to see that these tasks were crashing in the background, running out of memory, failing to delete. We didn't have the data to tell us were these assumptions that were made in the original architecture for the storage solution. Right? Like, are we actually saving? Like we measured later on and found that we saved 7% of our storage costs due to the file deduplication and something like 0.03% of the storage cost due to the block to duplication. And it made it really clear like, hey, this is a lot of overhead for something that's not saving a lot of actual storage and probably costing us more in terms of database because we had to bump up instance sizes and.
[23:06]
Alicia
Those types of things and just dev time and complexity and being able to understand the system, the juice wasn't worth a squeeze.
[23:13]
Dan Conte
That's right. And it's one of those things where we kind of felt it both on cogs and opex. So I guess to bubble up from that, that for me, in that moment of time it just became really clear like, you know, yeah, there's the intention around, like, how do we build the most efficient system we can, but then there's the maintenance about observing and learning from how that works in practice, seeing what the patterns are and then seeing how you then go iterate on that to adapt to what's coming in. I think about it in the context of S3 too. When S3 launched, it was relatively straightforward, right? But now you look at it, there's tiered storage, you have caching like it's. Because as you look at the usage patterns built on top of this, you understand which capabilities you need to go to go meet those.
[23:53]
Werner Vogels
Well, I mean, the biggest example I think probably is when we launched, we launched VP Venture Consistency for a long.
[23:59]
Alicia
Time, a long time.
[24:01]
Werner Vogels
And we saw many of our customers building strongly consistent layers on top of it, just because they creating a bucket and then from another thread trying to store in it. But the bucket wasn't there yet. Sort of was not a model that many of our customers could really deal with. And yeah, at some moment we decided to drop that inside the S3, but there was a major overhaul of all the different pieces in them because it's not just the fact that you need to add the functionality, you need to prove to yourself that you have covered all the edge cases. And if by that time you're at 250 microservices, there's a lot of edge cases to cover.
[24:42]
Alicia
Oh yeah, absolutely, absolutely. I think one of the interesting things here too is that we talk about frugality, but the flip side of that is the speed and the rapidity that you can operate in when you're not applying the frugal lens at the right time. So you can deploy something, see if it has value, use, et cetera. And then as it starts to grow, you can start looking at how to do that better. And I think, as software engineers are often told, don't optimize too early. It's so tempting to do it. And I think your discussion about the block deduplication is a great example. It feels like the right thing to do, it feels like the right design decision, et cetera. But the data tells us a different story. And I think what's interesting as well here is that the cost that you were seeing not match your business growth was a proxy for some degree of waste or lack of attentiveness to what was going on there. Which is again really useful because in the old world you didn't know, like someone bought you a storage array, you just happened to fill it up quicker than you thought you were going to. So you just thought, well, my capacity planning wrong, not maybe I'm not cleaning up what I should be cleaning up. There's also assumptions that are well placed and I think this is also relevant in terms of sort of the evolution of our own services that we go through where you design something, something is the case and then the case changes. And I think one of the great things about the work that you and the team did, Dan, is you dove deep into the code, like into the code, like the lines that make the difference. And you found stuff and you found stuff that made sense. Clearly at the time that didn't hold true anymore. Tell us a little bit about. There's a story I think you share about these janitor type jobs and how long they run for, et cetera, but also when they get run and what impact that had on the cost side of things.
[26:26]
Dan Conte
Yeah, for sure. I think certainly some of the things that you do that impact costs are like major large architectural decisions and some of them are subtle smaller things. And over time the costs kind of add up and catch up with you. Early in the days of we transfer, there was this strong concern about losing customer data. Like you never want to lose someone's files accidentally and we would get customer complaints. My transfer expired just before I was going to download it. How can you get these files back? For me, those things actually happened. And certainly as people were learning, hey, when you get the email that there's a transfer there that starts a seven day timer, you really need to download in that window. Those type of situations happened. And so early in the days of we transfer there was this idea of like, well, maybe we'll hold on to the files a little bit longer than seven days. And I had heard some rumor of this and I was like, well that's interesting, but you know, whatever.
[27:19]
Alicia
It's almost like an apocryphal story.
[27:21]
Dan Conte
Yeah, yeah, it's like, you know, it's one of those things like you're learning like, where did the name come from? Or like what it was back before. So as we're going and building more observability and just understanding how the systems work and a new team is ramping up, one of the engineering managers working on storage started to track, like, how long does it take to actually go go delete a transfer after it's been sent. And what we found is most transfers, the expiration is set for seven days. It should delete after seven days. And he was finding that most files lived on the platform eight and a half to nine days. It didn't make sense at the surface, like, hey, this should be gone after seven days. Where does it go? But it turns out that legend of holding onto files afterwards had made it down into a line of code to basically hold files for 36 hours after they expired in the storage layer. And this was independent of the products buried way down deep. No one really knew about it. The team had changed. And when we did the math on it, at the scale that we were operating at, this is something like 15% of our storage costs. And this is actually pretty substantial just because instead of Something being on S3 for seven days, it was now there eight and a half, nine days stretching out. And the other thing that had happened is that no one in any of the product teams was aware that we were doing this. No one in customer support knew that. So even if a customer had an issue, they were still being told like, oh, I'm sorry, your files are gone, even though they were still somehow there may be recoverable. And so it's one of those things where every engineering team has had this, where you kind of get to the root cause of something and you kind of like scratch your head and look at your shoes and everyone's like, well, and then you go fix it and move on. But we really did have that moment where it was a substantial change in terms of our overall costs. And the city state storage that we had, that really just came down to, you know, an idea or an assumption or a scenario that was, you know, started long ago and had been lost track of, no longer used.
[29:18]
Alicia
And it was well intentioned. I mean, you got one line of code that did a whole lot of good for the business, for customers and was the. No doubt was the right call at the time, but it had to be revisited.
[29:31]
Dan Conte
Yeah, and I think that's, that's really an important part is like, I think that as engineers, we're all wired to try and do like the right thing and build the best things. And so I look at all these decisions and none of it's incriminating. It's all basically the team was doing the best possible thing at the time to make, you know, to solve the customer problem, to enable the business to grow, to work through issues that were happening. And it's really about how does the landscape change around that over the years? What happens as, you know, things start to scale up? We talked about not optimizing too early, and that's really because part of it is like, you know, maybe you make the wrong choices, you need to learn which things actually need that energy and effort and where to go, focus that. But it's also just the reality that as the problem changes over time, you may find certain things phase in, certain things phase out. You need to adapt. So in this case Understanding of we transfer as a product to change. No one really hit this issue of like, I need my files back. The value of that had deteriorated, had gone away, but no one was aware that was still there.
[30:30]
Alicia
And do you find that because obviously a big part of this journey for yourself and for we transfer was that observability piece. The classic question of if you did it over again or if you're building something new, how would you think about instrumentation? How would you think about again, not over instrumenting, but what's the quote, unquote right amount to go with so that you can detect some of these things maybe a little bit earlier or just as part of day to day operation?
[30:56]
Dan Conte
Yeah. So the fascinating part with that is we had a whole observability stack in 2017, 2018. There was a whole solution in place. It's just that the adoption within the teams were sort of relatively low. And I think it comes down to how much do you need to be watching how much needs to be in place, what's the minimal level of sufficiency and the set of things that you don't solve as a business because you're so focused on whatever your highest priority is, there's some minimum level of observability that you have to have in place. Otherwise you'll lose track of just how does your service operate. And I think that we had that minimum level at a platform level, but I think in individual service teams we were lacking that. And so for me, one of the big pushes that we made over the past few years just to establish what that minimum bar is. If you have a service in production, you're building a new feature. Of course we have to be able to see this type of data. We have to have these metrics in place. We need to have this level of observability. We went and we found a few scenarios where we'd go make what's the best in class example, we were having issues where super large, large file transfers, sometimes the reliability would drop off. And we realized that's such a critical part of our customer base that we were going to go get an A on observability for that. We were going to go all in. And the value of that is now we had a system where it was clear the minimum bar for everyone, but then we had a very clear understanding of what does it look like to get an A in observability for a customer scenario or problem space or a set of services. I think those two powers kind of, you know, the minimum and then exceptional and being able to bring those in from the start, I think that's it's important. And like I said, I think engineers are all wired to want to do the right things. If you have an environment where you say here's the requirement and here's what like, you know, kind of the A looks like, I think people will gravitate to what the right balance is between those for the scenarios they're working on.
[32:53]
Werner Vogels
It's interesting to see how you can do this, let's say for new services, new products that you're building. But how did you go about, let's say refactoring or let's say adding observability to the existing code base?
[33:08]
Dan Conte
Yeah, I think about this one a lot because certainly if you're starting something brand new, starting a new company, new product, it's a completely different landscape than it was 10, 15 years ago for us. We started with centralizing on a single tooling solution. So we had kind of a makeshift observability stack. We moved over to data. We built reliability functions. So the company didn't have an SRE team until 2022. So we established a reliability team that owned the tooling, that would be champions, that would do sits with with different teams that would help them ramp up. We identified a few different teams to go demonstrate how to build high quality SLIs and SLOs and how to build dashboards. It was a fairly substantial effort, but it really came down to a combination of single tooling solution that everyone could go deep on on and felt like it was durable value to really understand center of excellence within the organization that would do sits with education training ramping up. And then the 102halo example teams that would get that best in class to demonstrate what was achievable and what the value of that would be. I've been in three different situations where just kind of broadly I had engineering teams that weren't deep on the use of data to inform how they operated. And in all three cases, the most important thing was is having the tooling, the education and then that halo effect where someone was like, wow, I could have that value in my system. Oh my goodness, I need to go make this investment. I could be seeing this type of information or understanding what failures are happening. But that was really the shift that we started in kind of mid-2022.
[34:47]
Alicia
It's interesting, that mindset shift and maybe Verna, can you talk a bit more about, I guess, culture and developer culture? You know, it's easy to be unkind as developers and Go, well, why are you wasting that? Or what a boneheaded decision it was to have this set this way, et cetera. And certainly younger Simon behaved that way because younger Simon wasn't as nice as today Simon, because younger Simon hadn't learned the hard lessons that others had gone through. But, you know, there's a sense of, I guess, humility and celebration of optimization and learning, et cetera. How do people think about that? Finding and discovering and returning waste is not bad. It's actually good. Help us decode how to think about that, how folks should be thinking about that.
[35:26]
Werner Vogels
We all understand the evolution of systems, and you mentioned premature optimization. There's often the case also that you have no idea how your customers are going to use your product or your feature. And until you know that, it's very hard to build a cycle around that, having a first version, that may be extremely inefficient. Yeah, but for example, because you're experimenting with a new UI and you're using Ruby on Wheels for that, which is not necessarily the most efficient or the most scalable platform, but it is a great platform to experiment on. But after a while, when you understand, when you know how your customers are going to use your service, you may start to transition to a more scalable or more efficient second approach. But, you know, not until you know how your customers are using your systems. There is, there is no fault for anyone. And often it is actually, I think, more important to get things in the hands of your customers really quickly than sort of already have optimized to the minimum upfront, because often you have made the wrong decisions there. I do think also there needs to be this culture of, you know, mistakes. Mistakes are okay, they happen. You know, it's more important. How do you learn from your mistakes? Do mistakes become a badge of honor? Or is it going to hamper your career within the company? Now, if it's going to hamper your career within the company, you're probably going to shut up about it.
[37:03]
Alicia
Yeah, yeah.
[37:04]
Werner Vogels
But if it's a. If it's a badge of honor, because there's a learning associated with it that actually everybody else can learn from. And that I think is the way to go. Actually, one of our customers in Italy, Enel, has an internal TV station and they have this program there that is called My Biggest Failure.
[37:26]
Alicia
That's awesome.
[37:26]
Werner Vogels
And basically where engineers and program managers come on to talk about the things that went wrong or whether it's at the product level or whether it's at engineering level. And I think that sort of and everybody wants to go in that program because everybody has a story to tell where everybody can learn from. So, you know, having a no blame culture is crucial, I think in any engineering. After all, you know, we're building new products, we're not doing the same thing over and over again. So we're highly creative people as engineers and as such, you know, sometimes especially you build new things, you don't know how your customers are going to use it.
[38:06]
Alicia
So true. Dan, how did you find that pervading through we transfer in terms of that concept of. Of. Yes, there was waste. It's okay, we've done good here.
[38:17]
Dan Conte
Yeah, I think the way that I look at it, it's very similar in terms of thinking about no blame culture. And I really think about how a healthy reliability function works. If you're operating at scale, you have reliability teams, you have postmortems. Everyone's trying to learn from the issues that happened and figure out how do they build and prevent against those in the future. And you get this mindset that's really not about blame or finger pointing, but it' really about how do we learn and go improve as engineers based on that. And I think for costs. Initially I kind of joke about it, but people were a little like, oh my goodness, I can't believe we wasted this much money. And certainly the company was profitable. It wasn't disaster making or anything like that, but it was definitely something that people were conscious about. What I realized were really two things. Back in the very beginning, we talked about how adoption of the cloud is more of a business discussion. And it comes to a conversation with your finance department. It turns out my finance department was really excited that we cared about costs and that we were making investments in how to reduce and manage those. So as much as individual engineers would be like, oh my goodness, we did this thing and it cost 100k over the past 3 months or whatever. The fact that we found it and then we talked about it and we fixed it and we were open about that, actually bought a lot of credibility and trust in those relationships. So I think that that was the first part of it. The second part of it is within the engineering team, especially because frugality is. This is not about like, you know, setting your priorities for the year. It's about what's the culture you want to create. As these things were found and we made progress and people realized like, hey, like, we're saving these costs, but we're also consuming less and when we consume less, that reduces our, you know, environmental impact. And improves our sustainability. It actually became this very empowering thing where engineers would say like, hey, I could do this thing that actually directly impacts this goal that I care about. A lot of the engineering team came to Wetransfer specifically because as a B corporation, specifically because they loved the role of the company in the world and the things that we did for creators and that we were conscious of the environment and all of a sudden some of the efforts we were making made that very tractable and directly impacted for them and their day to day job as an engineer. We really got quickly out of the, oh my goodness, this was a mistake. I think this happened like in the first six months of finding these issues and then it got into a place where people were like, hey, I think I found some waste over here where we're storing a little bit that we don't need to, or hey, I think I can make this system a little bit more efficient. And in fact we were spending a lot more time on. If you think about kind of the spectrum of, you know, on the one end you have kind of growth at all costs and like, you know, don't worry about it and in the other you have sort of deep cost optimization. Like we're not growing and we're just going to try and like drive everything down. We really found ourselves just trying to strike that balance in the middle of how do we reduce waste and where do those boundaries exist so we don't slide into one camp or the other. I found that, you know, we got past that kind of that concern about how this is going to be perceived very quickly and it got into a very empowering place.
[41:23]
Werner Vogels
So actually just not. Another interesting story where one of our larger enterprise customers, after the whole frugal architect thing actually have not bug hunt weekends, but cost hunt weekends where basically gamification is being used for. Let's go through our code base and see where we're wasting money. And not from the idea that that was bad in the past. I mean the systems all work and do their job perfectly fine. But can we look at it with a new, fresh set of eyes where there is a set of prices at the end for who saved the most money. It's just a realization also that we've gone through a phase where growth and innovation was more important actually than really keeping the purse closed. And it's not about closing the purse, it's just making sure that you don't waste anything. And I think that's sort of the realization that we've done some of that in the past for different reasons, mostly for speed of execution, execution and now we're just taking a step back and take a look at it.
[42:32]
Alicia
That's the beauty of the elasticity of the cloud. And I guess as we come to the end of our time here, Dan, when you're thinking about sort of the balance between short term fixes and long term solutions, when thinking about sort of architecture from a frugal lens, how do you balance? Do you have a percentage ratio? Is it time of the business? Like how do you like to sort of rationalise out there?
[42:54]
Dan Conte
I think that's such an interesting challenge. And the climate that we were in, I think that we were heavily focused on building new products and features. It made the pace at which we could go undertake kind of large architectural investments a little bit slower than I would have liked, just to be candid. But we were trying to take on kind of one to two like major investments. A year 2022 we moved to tiered storage and we started using cluster auto scalar and those were huge lifts in terms of reducing waste and our consumption. This year we were working to co deprecate image previewing services, moving over to kind of on demand previewing through Lambda and we were also completely removing all of kind of the last traces of block storage and reference counting that we were doing and simplifying the storage solution. So I think that we were trying to do like one or two major bets just because we didn't want to exhaust all the dependent teams and kind of go through these major re architectures all the time. And then to pepper that, and I really love kind of that cost hunting kind of exercise to pepper that with opportunities we found here and there that were kind of small ones, not huge investments of time, not huge investments of energy. But hey, one person, one week can go make a difference that's meaningful in some way. Some of those small ones was someone would spend a week on something and it was enough to cover hiring another developer. It was substantial enough that we could kind of ground it in those terms. That balance where yeah, for any given year, kind of one big push, maybe two big pushes if we had the capacity and then a lot of smaller ones we found was starting to work really well for us.
[44:29]
Alicia
That's really interesting. Werner. Are you seeing that as a trend amongst customers? That ratio of big to little, how do you see it play out? Or again, is it a factor of the life cycle of the organization?
[44:42]
Werner Vogels
Well, I think this works for every other company. Different of course, but indeed the especially for a younger business Or a smaller business like we transfer where your engineering resources are limited and where you apply them to where their energy is limited as well. Indeed, going through major changes in background that do not have to do anything with building new features or new products. We've seen the same thing in the early days when I joined Amazon. Observability in the early days of Amazon wasn't that great either. And so we established a whole new culture around how to measure and what does measurement mean. Also in the understanding of not only engineers, but everybody that's looking at the numbers. 50% of latency of a web page doesn't mean anything. It means the customers are getting a 50% worse experience somewhere. Just the whole culture around it. We did also all year on removing all single points of failures. But that was next to sort of the teams also working on, you know, building new features and building new services and things like that. But we didn't do all of those things at the same time. We did also do a whole year on efficiency and that went nowhere. Mostly because Amazon engineers are very much focused on, let's say customer focused. Yeah, they love building things for that are customer facing now. And efficiency definitely in those days was much more sort of a bottom line kind of thing. I mean after all this was retail margins. Our ways of in any impact on the bottom line that we had with capacity was sort of immediately hurting the business. But nobody could get really enthusiastic about sort of working on bottom line kind of things. And I think it isn't until tail we later on actually got to a point where we're much better at that and much more thinking about decomposition, smaller building blocks, tiers, what cost for which and things like that. But it wasn't until we got the visibility into that and the architecture to go with it that we could do these kind of things. But yeah, we would have a search Service. We had 32 different search services, something like. Like that. You know where some of them were. Twice the cost of another service.
[47:05]
Dan Conte
Yeah, yeah.
[47:06]
Werner Vogels
But nobody really knowing exactly why. Oh yeah, that was actually still a three, two bit box laying around somewhere. That vendor moving it to a 64 bit box actually reduced the cost by was it 50% or more? Things like that. Simple things. And there's a lot of these kind of simple things. And next to that, yeah, there's big project. But you can't overload everybody with big projects because also you need to make sure that you complete them to the end. I mean starting or removing all single points of failures doesn't help anything. If you stop halfway through.
[47:42]
Alicia
Yeah, we've removed half the single points of failure.
[47:47]
Werner Vogels
For example, we introduced game days where we would take out the data, pretend to take out the data center. But, yeah, those things are big events and big projects that you cannot necessarily always be in the way of, you know, new feature development and things like that. But it's crucial enough to the business to get all gnosis in the same direction.
[48:09]
Alicia
Absolutely. Dan, thanks so much for spending so much time with us, really diving under the covers and giving us a different perspective on frugality and how it can be applied. Really appreciate it.
[48:20]
Dan Conte
Thank you. That's great.
[48:22]
Alicia
And Verna, it's always a pleasure to sit together and hear from our customers. It's a fun thing. We have customers all around the world, so we get to experience different perspectives.
[48:31]
Werner Vogels
Thank you. Thanks, Dan, for talking to us. Great stories.
[48:35]
Alicia
And of course, we do love to get your feedback. That's our own version of observability. AWS. PodcastAmazon.com is the place to do it. And until next time, keep on building.