Summary7 min read

AWS Podcast #735: The Frugal Architect w/ Werner Vogels — Zillow's Chief Architect on Why Cheap ≠ Frugal

Release Date: September 1, 2025
Guests: Craig Link (Chief Cloud Architect, Zillow), Werner Vogels (CTO, Amazon)
Hosts: Simon Elisha, Hawn Nguyen-Loughren

Overview

This episode dives into the nuanced difference between being cheap and being frugal in cloud architecture and engineering, emphasizing how informed, intentional choices drive effective innovation and business value. Craig Link shares a narrative arc from formative childhood lessons on frugality to real-world scenarios optimizing large-scale, cost-effective cloud infrastructure at Zillow and previous ventures. The panel, with Werner Vogels’ perspective, explores concrete technical strategies, hard-won lessons, and their broader implications for teams balancing innovation, resilience, and cost.

Key Discussion Points & Insights

1. Origin Story: Frugality vs. Cheapness

⏩ 01:14 – 03:20

Craig Link recounts family road trips where his father's approach prioritized spending on experiences, not logistics, and measured efficiency (like miles per gallon):

“We'd minimize how many stops we'd have [and] really maximize…the things that we could do at our destination…A certain frugality of how you're choosing to spend your money and your time in the places you want it to be.” (Craig, 02:12)
The experience instilled a mindset of focusing resources on what truly matters — an early form of cost observability.
Werner Vogels distinguishes being frugal (intentional, value-driven) from being cheap (mindless cuts):

"Cheap and frugal are two very different things. Where frugal is a conscious decision to spend your money on those things that really matter to you for sure. Where cheap is for whatever reason." (Werner, 03:20)

2. Innovation Born of Constraints

⏩ 04:26 – 09:13

Early Microsoft Gaming Zone:
- Craig describes technical constraints of 90s dial-up gaming—every byte counted.
- Innovations included virtual LAN drivers and bit-packing to minimize network load:
  
  “Bytes were almost too big at that point…you really bitpack things…how do we really optimize the amount of traffic because the band was slow…as you mentioned, bytes were almost too big at that point.” (Craig, 05:05)
- Pooling and reduction strategies reduced bandwidth while improving user experience.
- Werner draws parallels to the Kindle’s networking design, where Amazon ate connectivity costs but had to engineer carefully to avoid business-impacting overruns.

3. Scaling: Elasticity & Pragmatic Overprovisioning

⏩ 09:13 – 12:23

FigurePrints Startup:
- Craig’s team brought 3D-printed World of Warcraft figurines to market, facing sporadic bursts of massive traffic.
- Used early AWS EC2 instances to scale rendering during blog promotions, and learned the priority of "throwing capacity" at critical problems before optimally resizing:
  
  “…Sometimes it makes sense to over provision…It's better to solve it, reduce the customer impact…get things stable and then right size it appropriately.” (Craig, 11:14)

4. Deep Technical Optimization at Glimpse

⏩ 12:51 – 16:19

As a location-sharing startup, Glimpse needed maximum efficiency per AWS instance.
Profiling revealed JSON serialization was a bottleneck, prompting a custom, highly-optimized serializer:

“We wrote our own custom JSON serializer…set up to know…where could we reduce memory copies…measured to be about 2.8 times faster than any other open source or built in JSON serializer…” (Craig, 15:08)
Key takeaway: Identify and solve for the bottleneck that has the biggest business impact.
Werner adds: Amazon shifted from bloated libraries to minimal, optimized code for significant savings (“We wrote an open source version … purely focusing on performance and minimizing bytes on the wire … that saves us a significant amount of money…” (Werner, 16:19)).

5. Zillow’s Cloud Evolution: Visibility & Automation

⏩ 16:58 – 23:51

About Zillow:
- Real-estate marketplace focused in North America; highly variable and regional workloads.
- Craig led the on-prem-to-cloud migration, emphasizing repeatable infrastructure and cost visibility.
FinOps, Tagging, and Automation:
- Tagging is critical for cost allocation, accountability, and clarity — but spelling errors and lack of uniformity present challenges.
  
  “You need to understand…not just the account basis or the whole bill…we do slice and dice that based on these kind of business lines and even down to the team level.” (Craig, 22:55)
- Zillow built internal tools (service catalog, ETL pipelines) to create, normalize, and maintain high-fidelity tagging even in legacy/’lift-and-shift’ environments.
- “Guardrails” system (inspired by Amazon’s own approaches) flags best practice, security, and cost issues, balancing autonomy and compliance.

6. Real-time Cost Feedback & Guardrails

⏩ 23:51 – 27:29

Engineers get real-time dashboards and JIRA tickets for policy violations or spend anomalies.

“With this internal service catalog…you can actually see kind of what your spend is based on those tags…” (Craig, 25:03)
“Guardrails not gates”—the system alerts and suggests, rather than blocks, allowing teams to move fast but correct course when necessary.
Automation is evolutionary; integration with IaC/CICD is planned, trade-offs between business features and infra improvements are ongoing.

7. Balancing Optimization, Experimentation & Practical Lessons

⏩ 28:00 – 39:18

Premature optimization can backfire (Knuth’s law), as with NAT optimization for Kubernetes clusters that inadvertently raised costs:

“…It basically saved us around $15K a month…very simple change to get there of something that we thought we’d need that we ended up really not needing.” (Craig, 31:00)
Observability:
- Make systems observable across metrics: cost, logs, traffic, resource utilization.
- Rapid iteration and pivoting is now viable—cloud as code.
- Cost spikes are “the canary in the coal mine”—often the first sign of deeper inefficiency.
Werner: Historical example—Amazon reduced search cost 4x by switching from 32-bit to 64-bit instances after benchmarking.

8. Aligning Cost with Business Activity

⏩ 33:30 – 35:07

Monitor cost relative to usage; increases driven by growth are good, unexplained increases are red flags:

“If cost is going up and the number of transactions, users…what have you is going up as well…that's a happy problem.” (Craig, 33:48)
Always ask why—is the cost justified, or is it a sign of waste? Cultural curiosity is key.

“If nobody asks the question why, you won’t catch that.” (Craig, 34:27)

9. Empowering Engineers: “Think Big, Move Fast”—Without Waste

⏩ 35:41 – 38:59

Zillow promotes rapid prototyping and experimentation, but with conscious awareness of cost and value.

“We definitely encourage people to experiment…think big. And move fast…but also want to empower our engineers…It’s that shared ownership…being aware of your spending.” (Craig, 36:09)
With generative AI and new services, trade-offs (like choosing between heavyweight and lightweight Foundation Models) can be significant.

“Trading off quality versus cost there is important, I think.” (Werner, 37:16)
Guardrails provide safety, rapid feedback, and course correction—empowerment, not restriction.

10. Final Reflections and Advice

⏩ 39:18 – End

Craig Link’s Advice:

“Make sure that you really democratize the data and get it to any and everybody’s hands…You want everybody working on it and kind of being a shared ownership model.” (Craig, 39:39)
Lower barriers to observability, empower all engineering teams with actionable data & accountability.
Constraint breeds creativity:

“Constraints can breed creativity. I mean, it forces our human brains to live. I mean, AI can’t fix this. This is something that only we as humans can do.” (Werner, 41:09)

Notable Quotes

“Frugality is a conscious decision to spend your money on those things that really matter to you…where cheap is for whatever reason.”
— Werner Vogels, 03:20
“Bytes were almost too big at that point…how do we really optimize the amount of traffic because the band was slow…”
— Craig Link, 05:05
“Sometimes it makes sense to over provision…It’s better to solve it, reduce the customer impact…and then right size it appropriately.”
— Craig Link, 11:14
“We wrote our own custom JSON serializer…measured to be about 2.8 times faster than any other … out there.”
— Craig Link, 15:08
“Cost is the canary in the coal mine…If you see true spikes in your cost that are not related to your business activity, you have a benchmark to put this against.”
— Werner Vogels, 32:12
“If nobody asks the question why, you won’t catch that.”
— Craig Link, 34:27
“Constraints can breed creativity. I mean, it forces our human brains to live. I mean, AI can’t fix this.”
— Werner Vogels, 41:09

Key Takeaways

Frugality ≠ Cheapness: Strategic investment and intentional constraints empower greater creativity and ultimate business value.
Embrace Observability: Democratize access to cost and performance data across engineering teams—let those closest to the work make informed decisions.
Guardrails, Not Gates: Enable autonomy while enforcing organizational objectives through transparent, automated feedback systems.
Technical Optimization: Bottlenecks are context-dependent—profile, measure, and surgically optimize.
Business Context: Link technical metrics (especially cost) tightly with business outcomes—never view in isolation.
Iterative Mindset: Be prepared to pivot, learn from anomalies, and promote a culture of curiosity and shared responsibility.

For further feedback and to engage with the show, visit awspodcast.com.

Loading summary

Transcript83 lines

[00:00]
A
This is episode 735 of the AWS podcast, released on September 1st, 2025.
[00:09]
B
Hello, everyone, and welcome back to the AWS Podcast. I'm Alish here with you. Great to have you back for another very special episode of the Frugal Architect. And of course, I'm joined by Werner Vogels, our Vice President and Chief Technology Officer at Amazon. G', day, Werner. How you doing?
[00:22]
C
Doing very well, thank you.
[00:24]
B
Excellent. And we are joined by a special guest today. We're joined by Craig Link, who is chief Cloud Architect at Zillow. G', Day, Craig. Welcome to the pod.
[00:32]
A
Hello both of you. Thanks for having me on.
[00:34]
B
It's really exciting to have you here because we've got some stories to tell. This is going to be a good one. Werner. I think Craig brings a really interesting background to the table here.
[00:44]
C
Absolutely. I think. Greg, first of all, thank you for writing the blog post. What I really liked about the blog post, that I won't say it shows your age, but it shows how from one job to another job, have you taken your previous learnings and apply him in the next situation that you're in. And it is great storytelling. So I urge everyone to read the wrong.
[01:13]
A
Thank you.
[01:14]
B
It's a definite must do because it also helps, I think as with most superheroes, you need the origin story. It helps with some origin stories. So maybe, Craig, you talk in your blog post about your father's frugality during road trips and how that helped, I think, form your views. Can you tell us a little bit about that? Take us back there.
[01:36]
A
Yeah. I was growing up in the Midwest. Often during the summer we do a two week long family vacation somewhere, often out to the mountains in Wyoming or somewhere and or down to Florida to Disney World or some location that involved a fair amount of travel. Back then, it usually meant climbing into a pretty small car with me and my two younger sisters and my mom and driving several hours to get there. Um, and part of that would be that we'd minimize how many stops we'd have where hotel stays as well. So now it wouldn't be uncommon for them to wake us up at 4 or 5 in the morning to get in the car, have like, you know, some snacks or whatever, have it. So we pillows and that's what we'd sleep in the car and they'd have several hours under the belt before we'd wake up and, you know, drive 10, 12 hours that day and then, you know, be at our destination sometime the next following day and really minimize kind of what we're having to spend on the trip to get there and really kind of maximize the time and money and things that we could do at our destination, where we're going to, which basically, you know, if you look back on that, is a certain frugality of like how you're choosing to spend your money and your time in the places that you want it to be. And so I think, you know, just across kind of my upbringing in life, that kind of just resonated with me, kind of built into who I was. Similarly, you know, we talk about doing car maintenance a lot himself. He's a very handy man. And one of the things too was always kind of watching kind of the mileage that the car was getting, which would be indicative also how well it's running, other things going on, perhaps engine problems. So it was always kind of computing, like when you fill up the tank, fill it all the way up, know how much gallon gas you can use by looking at the odometer and then kind of computing what are you getting as your miles per gallon. So it's, you know, almost a unit cost metric in a way that you could leverage to understand some health.
[03:20]
C
Yeah. The ports part there, of course, is to realize that cheap and frugal are two very different things. Where frugal is a conscious decision to spend your money on those things that really matter to you for sure. Where cheap is for whatever reason.
[03:37]
A
Yeah. Wasn't always obvious as a young child at that point, but it definitely becomes more obvious as you grow and learn and kind of understand.
[03:46]
B
Isn't it interesting that perspective matters as well as the experience? And maybe I think that's another call out is that helping folks understand why you're doing things when you're doing them could also be useful. Now, obviously it's different when you're a kid, but working with colleagues or if.
[04:06]
C
You a young startup and you may want to where speed and innovation matters more than the exact bucks that are getting out of your pocket. But at some moment you have to realize that the technical debt that you created, you have to pay it off. As with any other debt, it will come back to haunt you.
[04:27]
B
And there's no right answer, but you know, when you see it, but it's interesting. Again, this comes back to this concept of living with constraints and the reality of constraints that we have. And you dealt with some really interesting constraints when you were working at Microsoft's Gaming Zone. And I love these stories because these are, you know, kids these days, they don't understand what it was like to game in this environment. But you're working in a really interesting, interesting world where I think as you mentioned, it wasn't that the bytes matter, that bits mattered. So tell us about what was going on there because there's some really interesting innovation you did there around that.
[05:06]
A
Yeah, so back in kind of the early and mid-90s, you know, not everybody had high speed Internet. There was still a lot of dial up via all or CompuServe, the different things. And we were had a business where it's really getting people together to have a community to play games online with each other. There was kind of your card and board games, but we were also getting in the matchmaking for some games like Age of Empires and Quake was a big one that was back there before that you could play land games, but it didn't really play on the Internet. And so we had actually created a device driver that allowed kind of a virtual land to be created over the Internet over kind of these dial up connections and move kind of the IPX packets back and forth between them. But as you mentioned, bytes were almost too big at that point. Like we were looking what we could, we really bitpack things and how do we really optimize the amount of traffic because the band was slow, you know, a 9600 baud modem was pretty commonplace at that point. You know, when you got to 14 4, it was like that was a huge upgrade. And so one of the things we did with that being that matchmaking service was really trying to find other people who were closer to you in latency so that you would have kind of, you know, what people refer to as a low ping time. And so doing that, you know, you're basically reaching out, sending network packets to the other person, seeing how long it takes to respond, kind of measuring that. But when you're in say a game lobby that has hundreds of people in that, if you're doing that to everybody there, you're saturating your line as well. So you have to be kind of smart about how you could actually move those packets around, leverage data that's going back. So some of the things we had done was basically realize that when I'm sending a packet to you and you're sending back, we both don't have to send two packets. We could actually do eliminate one of the packets by leveraging one of the packets that you had said to basically be a response that we only had to send three packets. So we've basically saved some network bandwidth there and then also scheduling out how Frequently we send those packets and across the lobby in kind of a measured way, gives a much more consistent measurement. So we don't have kind of bursting nature where packets are stomping on each other or causing congestion unexpectedly, et cetera.
[06:58]
C
Well, it's not that different from when we built the first Kindles. The first Kindles had free networking inside. You didn't have to pay for it or you paid a little bit more for the modem that sat in it if you wanted that. But basically we ate the cost of transfer for you to ping back to the central system to see whether the new books for you or transferring the books to you. It costs some money. Everything costed money and it was, you didn't want to have a surcharge on the book, but you know, you still had to architect for some for a hidden cost.
[07:39]
B
And that's, it's interesting because that's where the, the paying attention to the deep details becomes important. And I think this is a challenge for a lot of folks because I mean this in technology, there's so much to cover and there's so much to think about. It seems almost over the top to dive so deep to, you know, in this case into the, the packet level. But the ramifications from a pure business perspective were really strong here and from a user experience perspective were fundamental. So you kind of had to do it like it's worth your time to do this, even though it seems really, really, really like in there.
[08:15]
A
Yeah, and I mean, you know, similar to these things, you know, when you have a small startup scale, you know, you're not using that much, perhaps network traffic on the Internet, but you start to go to kind of worldwide scale, Amazon scale, or you know, kind of massive. Every byte or two that you're sending on a URL link or any of that stuff that adds up over time really fast. And so, you know, being conscious of what you're sending makes a huge difference and can affect the bottom line.
[08:37]
C
Well then think about something like Fortnite, you know, or when a next version or a next sort of feature comes out. A million customers will show up on your doorstep at that particular moment. And no matter how much bandwidth you have, and no matter how much capacity you have laying around, it still matters a lot in those situations. Even today with massive broadband everywhere.
[09:02]
A
Yeah, especially when you have those kind of large scale events that happen at a similar time. The pipes are only so wide, only so much fits through at a certain point. And so it's, you know, need to be conscious of that you do, you do.
[09:14]
B
Well, related to that, you had a really interesting experience at an organization or company called Figure Princess, which, which brings physical and virtual together in a really interesting way. And I think it was also your first taste of elasticity and what it can do.
[09:28]
A
Yeah, yes, it was a trending thing about 2004, 2005. So AWS was also just kind of getting off the ground. It pretty much just had ec2s3sqs is kind of its core services. Figure Prints was a company that we had created that used color 3D printers to create people's real life images or figurines of their World of Warcraft characters. So it's, you know, a keepsake you could put onto your desk and that. And so what we had done is created a website where we could get people's character information. They could then go pose it, kind of spin it around, look at it, change the armor, change kind of the poses they had with the armor, and then basically say print. And it would send it off to our machines and we'd process it. Part of the deal we had was with Blizzard at the time and they would feature features, feature us on the World of Warcraft blog page website every quarter or so. At the time, I think they had 6,7 million worldwide subscribers playing the game. And so just we were talking about, you know, on one given day, all of a sudden our traffic would be go from you know, 10 to a few hundred to millions. And people would hit the servers and try to render their character. And then, you know, we'd fade off the blog page and their main site and the traffic would dwindle down for, you know, several weeks of the quarter, then you know, next quarter would spike up. So we really didn't have this huge need for all this capacity to do all this rendering on a regular basis. But we wanted to be able to have a responsive website when customers came to us. And so that's when we started to use Elasticity of AWS in the cloud. And so they had come out with some of their first kind of compute type instances, which were the C1s early on, versus some of the more generic ones had spun that up. And we basically had created a set of renderer servers that were using some open source 3D rendering software to basically do kind of the renderings online, generate those images and send them back to us. And so as we were doing that, we'd be featured on this page, traffic would go there, we had it all planned out. Soon was going to work well. And then the first time we had all this traffic, it was much more than we had expected and the servers fell over. We couldn't keep up with it. And so I tried that a few more. It caught up for a brief period and fell over again. And it was really finally I'm like, you know, I'll add like my camera was like 10 or 12 and we provisioned enough. We are fine. We got up there and then I was slowly able to pair back and that would be a lesson that I'd have to kind of learn a little bit over and over time. Sometimes it makes sense to over provision like especially if you're having an incident or something like that. Instead of trying to guess what is the right amount to be there and not quite getting there and not solving issue, it's better to solve it. Reduce the customer impact, perhaps reduce any, you know, revenue impact to your business, get things stable and then right size it appropriately. So yeah, it was that was first dabbling with AWS and it was really from that day on I kind of got hooked and you know, kind of led to my career in the cloud and where I'm at today.
[12:11]
C
Yeah, throw capacity at it is a good first line of fire in solving problems.
[12:19]
A
Especially if you remember to turn it down as well.
[12:21]
B
Well that's right.
[12:23]
C
Or maybe you go home at night.
[12:25]
B
But it's interesting too when you think about it. We're already sort of. We're talking at two very different extremes here because we just talked about sort of diving deep to the packet level and the bit and then we were just talking about throwing servers which at the time the C1 servers were pretty significant. To be able to even get that sort of hardware on demand in the.
[12:46]
A
Early century you could easily do. There are very other cloud platforms out there. Right. It was really revolutionary at the time.
[12:52]
B
It was. And so on one hand you're going deep, then you're going broad again, then you're at Glimpse and again this optimization bug continued to get at you and you sort of notice something. And this is again this was about invention. Which is really interesting to me is it wasn't just sort of tweaking, it's like you. This was causing such an issue. You had to really dive deep and make something new. Tell us about that because really it's a fascinating experience of optimization.
[13:19]
A
Yeah. So Glimpse is a real time location variant startup where you're basically similar, you know, with Uber and different apps. Now you get a map and kind of real time but you're able to kind of share yourself with your friends. Or whoever you choose. And we kind of update that location in real time. But also being a small startup, our budget was really, really limited and so we were trying to get everything could out of, you know, every instance that we had running in aws. So it's how do we maximize the request per instance and you know, throughput how fast are responding? Because you know, the longer you take for requests, the more system resources you're holding on for that given request, which starts to limit things. So it's really how fast can you process it? What's your throughput, how many resources that each of those requests taking while they're on the instance running. So it's got to a point where it's taking from. A lot of my game development days was using profiling, so really running profilers and understanding what are the hotspots and flame graphs for kind of your code and where maybe there's opportunity to optimize. And one of the things that stood out at the time was we were, you know, back in the day it was very restful development platform. Everybody was kind of making their APIs sending down rust. So it was typically JSON type blobs coming back to things, which is not a very compact protocol. So there's, you know, it's verbose. But not only that, it actually took a lot of time to serialize those JSON responses. And a lot of our responses be a lot of just numerical GPS locations and kind of arrays and strings and that. And so that time to serialize decimal numbers into a string and pack them out was taking a fairly large amount of processing time. And often when you're kind of using the string serializers in kind of most native languages and this happened to be in C at the time as you hit the size of a buffer, it basically has to allocate a larger buffer. It then does a memory copy, copies those bytes. And so when you're doing that, you often hit locks in your memory as well that prevent other threads from grabbing perhaps memory off the heap and different things like, so we're having some contention there, we're using different memory. So we did a lot of seeing that was a hotspot. We wrote our own custom JSON serializer and it was basically set up to know that hey, where could we reduce memory copies? So if we knew how long this was going to be, we could allocate that larger buffer size up front. It's also smarter about how we do kind of serializing the decimal numbers to strings and kind of really focus on how do we reduce those memory copies, which would also reduce the heap contention and get the number the data out faster. And also then, you know, we'd have less memory fragmentation and other issues that would potentially go on. So at the time, I believe was measured to be about 2.8 times faster than any other open source or built in JSON serializer that we had out there. It wasn't a completely custom that anybody could take it and drop into it. It was definitely tailored for what we were doing. Slowly. Yeah. But we started to evolve it over time where it became fairly generic. But it was, and it was very much a serial serializer, not a de serializer as well.
[16:06]
B
So you're solving for the problem that you saw that was causing the biggest effect.
[16:10]
A
Yeah, and it helped us basically serve more requests per instance, which would keep our cost down for our business, which meant we could stay alive as a startup longer.
[16:19]
C
Well, with different things that we have gone through as well. I mean we, I mean originally most of our services were using lib SSL as their interface, but that one has about 2 million lines of code. And we release, we wrote an open source version of it that is actually sort of minimalized and we're purely focusing on performance and minimizing bytes on the wire and serialization deserialization to be done. So that saves us a significant amount of money just by not carrying the other one and a half million lines of code.
[16:59]
B
Yeah, less can be more so. So, Craig, now you're at Zillow. So firstly tell us a bit about Zillow and then there's some really interesting work you've done around monitoring and automation because those things go together very much. But for our listeners, because not everyone's in the U.S. tell us a bit about Zillow and what, what you do there.
[17:16]
A
Yeah, so Zillow is a company that we basically focus in the real estate space and encompass everything, really trying to meet our customers on what is called home and really their next journey, whatever that may be, whether it may be buying their first home, renting, perhaps downsizing, maybe buying a vacation home or maybe you're also in the real estate side of it and you're an agent or a broker and you're interacting with buyers. So it's really connecting all those different people and helping them smooth out that process that in the past has often been fairly opaque and unclear what's kind of going on there. We're focused 100% in North America, so the US and Canada, and just part of that is the real estate industry. Is so varied across the world, it's hard to understand all generic, follow the rules. And there's still a big opportunity where we're at now. So we're really kind of focused on continuing to grow there.
[18:11]
B
Fantastic. And you use AWS a lot for delivering that. And again, this is one of these sort of, you know, unpredictable workloads, different regions having different activity depending on what's going on in local markets, et cetera. And you sort of step back and thought about how things were being monitored and managed and you evolved that thinking. Help us tie those strands together and understand what your thinking was and what you got to do.
[18:39]
A
Yeah. So, you know, from my previous positions at the other companies had been using aws. I was aware of that. I came into Zillow to help move them out of some on prem data centers into the cloud and really take advantage of aws. And we started to put processes in place. Started that was with infrastructure as code and kind of getting a repeatable way to build that code. But having had experience trying to run a pretty lean ship at my previous startups, I knew that we also would have to be aware of cost and where that spend is and be able to identify it. And much like the previous podcast you had with Tom Lehman from Warner Brothers, Disney was really starting to think about how do you tag things so that you can get that visibility. So there's, you know, you can have course granularity, which is kind of say the AWS account boundary. And, you know, we've increased the number of AWS accounts we have over the years. I think we're at close to 300 at this point, where they're divvied up based on teams and different organizational structures and production, non production. But even within those, you need finer granularity where you may want to know at a given service level, or maybe there's six or seven development teams that may be sharing an account, or you may have some legacy account that has hundreds of teams doing different services or things that may have existed because of the forklift nature of that. And so as spend evolves over time, you need to understand where there may be a spike or a decrease or something's going on or an opportunity. And just saying, hey, it's EC2 is not going to be good enough. You need to be able to understand on who to reach out to and empower to make those changes. And so having some kind of tagging taxonomy really kind of helps with that.
[20:09]
C
So often the taxonomy comes to life when you have the luxury of greenfield and building piece by piece and you know which path you're going. But I understand your first, your first phase of what you did at Zillow was lift and shift.
[20:24]
A
Yeah.
[20:25]
C
So how did you go from lift and shift, let's say the old architecture to get to a new point. How did you do that?
[20:36]
A
So fortunately, I would say from our lift and shift we had a pretty strong naming scheme for most of our existing services that were lifted and kind of the way even the host they were running on. So we had a general idea of what was going on there and then we really leveraged the AWS cost and usage report files and pulled that into a redshift so we had all the data in there and then leverage tableau and some things to slice and dice that, but to kind of address that point where not everything was tagged during that lift and shift and all that. So using some ETL process in that we augmented that data that was occurred. So it wasn't just the raw curdail. We'd run a set of procedures on top of that that would annotate it, update service names, do add perhaps tags that weren't there. And one of the challenges with tags in aws, they get recorded as what the service is. If you change a tag on a resource, it doesn't back propagate it for X number of months that maybe it was untagged or maybe it was mistagged or and often there were spelling errors, you know, number of different ways you can say data scientist or big data, you know, changed a little bit in spaces, punctuation, all that. So you know, having a process to kind of clean up and normalize those tags was also important. So we created that and over time we've actually evolved. We have kind of our own service catalog developer portal within Zillow group that also is kind of that place where somebody's going to create a new service and it's they define it there, they click on it and we drive a set of tags off of that and we have a terraform module that they can use when they're provisioning their resources that reaches back to Zodiac knows the name of that service and then we'll propagate all the other tags from team to business line to a bunch of those things so they don't have to type that in or mistype it. And we can make that data driven and adjust that over time. And then on the backside we have different set of ETL rules. We're able as, you know, two piece of teams get too big they split. All of a sudden, service was owned by one team's now owned by another team. How do you manage those tags? We're able to kind of do that in a constructive fashion to make sure there's always a team ownership so there's services don't get orphaned. We can kind of split where the budget and money is flowing to as organizations change and kind of deal with budget shifts, et cetera.
[22:40]
C
So you, you mentioned business lines. So you not only use your, your tax for, let's say, for technology pieces or for services that you've built, but also who you are internally charging back to.
[22:55]
A
Yeah, and it's almost not less of a chargeback. I'd say it's actually who owns this, and so they're the ones kind of provisioning it. Sometimes there may be somebody running for somebody else. We do do a bit of chargeback for some networking, shared resources, but for the most part, it's a team that's provisioned those resources and tag it with it so that dollar amount is flowing to them. But it's not like just that. And having a large organization with kind of all of our AWS bill going to a single place, we kind of break that up. It's not just the account basis or the whole bill. So, yes, we do slice and dice that based on these kind of business lines and even down to the team level.
[23:31]
B
And this comes down to that mental model piece, which is if you're not giving the folks who make the design decisions and the implementation decisions access to the data when you're filling up the petrol tank, if you're not seeing the mileage on the engine, how you expected to make good decisions, or how do you expect it to understand the ramification of a decision you may make in good faith? This is tightening that feedback loop really, really well.
[23:55]
A
Yeah, and that's, I'd say one of the bigger challenges and mental models, shifts for people to have is when you're moving, say, from on Prem to the cloud. Often on Prem, you don't see those costs. It's a fixed cost, the hardware's been built. Your engineering teams don't see any of that. They're, you know, they're provisioning things. Maybe somebody's aware that, oh, I can be more efficient or it matters how many instances there are VMs they're using, et cetera. But that comes right into your face once you actually move to the cloud and you're paying for those per second, per hour, and your bill is spiking, et cetera. And so it's to your point, how do you get that back to in front of the engineers who can fix that and are aware of it. And it's not just sitting at a high level in kind of your finance department or that you actually want to make it visible to the engineers provisioning the resources so they're aware of it, they understand the bottom line of how that's impacting it and, and empower them to be aware of how they could potentially improve it.
[24:47]
C
Is there the metrics, the cost metrics, do they get in front of all the designers and all the builders? I mean, do they have a sort of a real time view of what their piece of their world is that they're working on is actually costing at that moment?
[25:04]
A
We do. It took us a while to get there, but with this internal service catalog that we talked about where you provision the services, you actually have a team view and a per service view. And with that you can actually see kind of what your spend is based on those tags we're able to pull up. Additionally, something that was Inspired by how Amazon.com runs their AWS accounts. We created a system called Guardrails and it's a set of rules that basically instead of providing gates and then talk that was inspired to re invent talk was guardrails and not gates is how do you allow builders to create things but catch when maybe something's off and give them those guardrails to be able to fix it. And so we're also able to surface those guardrail tickets back in this service catalog. Developer portal. They get a sound out to people and they range from a whole set of categories to best practices, to security, to cost savings, to even perhaps maintenance where an EC2 instance is going to be rebooted. And those show up both in that developer portal as well. So we leverage JIRA for that. So they also get assigned off to a individual to be able to kind of track that. And then we have numbers of reports that we kind of report back on and kind of track. And depending on the severity, I may or somebody on a different team will reach out to individuals that perhaps remediate them quickly.
[26:18]
C
Oh, so that's mostly a manual process.
[26:23]
A
It's fully on. I'm just saying if you happen to see something that maybe is a larger dollar or not, you may want to reach out to make sure that somebody's eyes are on it and it's the right priority is being taken.
[26:35]
C
Yeah, the guardrail idea is really cool. Does it integrate into your terraform as well?
[26:42]
A
So it does not currently which would something we've talked about and leveraging perhaps open policy agent where it's basically how do you take those rules too? So you know, the best case is to be prevented in the first place or be able to catch it, um, just as big. Tying that into some of our CICD process and different things, we haven't quite gotten there. It's definitely something you've been talked about but it's you know, trade off of where we're engineering and spending our resources based on customer facing features versus some infrastructure, different things and what's doing well. And there, there is that learning too as your engineers embrace the cloud and start working in cloud native environment. If you're doing that well, they're also picking up what those best practices are and so those incidents become less and less and as well. So we're in a pretty good state that way but it's definitely something we continue to talk about.
[27:30]
C
Do your teams in one location or do you have remote collaborations as well?
[27:37]
A
Zillow is fully remote so we moved out of the offices at the beginning of COVID and have never gone back. We embraced what we call cloud hq which really opened up our ability to hire kind of across the entire country and instead of the few states that we are based in, in cities. So it's, you know, it's been a great opportunity for us and we're planning to stay that way unlike several other tech companies out there perhaps.
[28:01]
B
Now it's interesting you talked about obviously the optimizations and the improvements, et cetera. And as engineers, once we find something that works well, we think oh, we can, we can really go hard on this. And this is not new. I mean Donald Nook said years ago, optimization, premature optimization is the root of all evil. And, and I think you've got a great example where you were trying to do the right thing from a networking cost management perspective and it, it sort of backfired a little bit and I think it's a good lesson about how to balance this a little bit. Tell us a bit more about what you, what you thought you were doing, what happened and where the reality landed.
[28:39]
A
Yeah, A few years ago when Zillow was starting to really embrace Kubernetes and we're starting to leverage AWS EKS to kind of provision our kubernetes clusters, we were creating a set of new accounts that would kind of just run the clusters and be able to role switch into other places for other AWS resources and deciding what were the size of VPCs that were needed for those and how we might route traffic between them. We also knew that Kubernetes was very IP heavy and likes to consume a lot of them. So we basically provisioned a lot of VPCs with a kind of 1016 address for kind of basic peering, crossrouting and then had leverage. There's a 100.64 IP space. Amazon kind of encourages the leverage for the AWS, CNI and some of that. And so we provisioned that and decided that we would not route that anywhere and started using NATS to route the 164 address to the 10 space. And then, you know, we had set up a handful of VPCs in our clusters up there and started to leverage those. It was working fine, connectivity is working there. But over time we started to see like our bandwidth and NAT cost really start to increase. And what had happened was all the pods were getting provisioned in that164 space. And kind of our initial thought would be they'd typically just be talking to other services within themselves. It'd be ISTIO based traffic. You know, it'd stay within a similar AZ and different things. But of course that. Not of course, but that was just a misassumption that there's something off on databases. Oh, I need to call an API over here. Also just legacy traffic of things moving from kind of other legacy EC2 systems and that to the cluster take a while. So there's load balancers outside the cluster that people are calling and vice versa. And so that traffic through those NATs really started to escalate. And it was something like why are we paying for this? And what the thing we had realized and we had done this because we thought we'd consume all these IP space, IDRA consume all these IP addresses and we wouldn't have enough. But looking at actually what we were using within the clusters in the 10.16 space, we had plenty sitting there. And so it was a really simple change to kind of our allocation strategy of our node pools, say hey, use the 1016 space instead of the 164. All the pods as they recycle came up in there and stopped avoiding that nat. And it basically saved us around 15k a month in network and NAT charges by avoiding processing that. So very simple change to get there of something that we thought we'd need that we ended up really not needing. So to your point, that's kind of a premature optimization not knowing what our traffic patterns would really look like.
[31:08]
B
But I think that's part of, part of working in the cloud. And again, difference to being in an on premises situation is you can make the change. Like if you get it quote unquote wrong or I want to do it a different way, it's code, you make the change. You don't say, well next time when I build this in three years time, I'll do it this way. It's nice to have that exit hatch or that escape hatch to be able to make those changes and being alive to that idea of hey, I can, I should be paying attention to this because I could, I could make a change.
[31:36]
A
Yeah. And as you were kind of hinting there, it's also important to be a pay, pay attention that and make sure you have your systems observable and understand what's going on with them. So you're looking at not just cost, but it may be also logs of network traffic, bandwidth, how many IPs are being used there. So you want to have all that information available to you so you can make those kind of educated choices and be able to kind of do quick pivots and iterate on your infrastructure. You know, being in the cloud, like you said, is basically almost like writing code. It's super dynamic, you can iterate on it. Nothing's constant and fixed. It's not a hard set asset that you purchase. If something's not working or needs to be proved, you could quickly pivot and so take advantage of that.
[32:13]
C
Yeah. And often cost is the canary in the coal mine. If you see anything changing radically in I mean it's often harder to see whether did I use up all my memory on this instance before I failed over to the other one. Often it's just if you see true spikes in your cost that are not related to your business activity, I kind of like or comparing to things which you intuitively think should be in the same ballpark. I remember and this is a very old story in the earlier days of AWS or actually at Amazon retail, I think we ran 12 different search services for retail. I read it was for books and for other categories and things like that. And some of them were almost four or five times the cost of the others. Well, the ones that turned out, the ones that were more costly were actually running on 32 bits and just moving them to 64 opened off enough memory to reduce the number of instances and all these kind of things. But until you start, until you have a benchmark that you can put this against you, you still flying blank.
[33:30]
B
I think it's also interesting too, you both sort of touched on the point here, which is it's not just about monitoring the cost, it's the cost in relation to the business activity. So if, if, if cost is going up and the number of transactions users, what have you is going up as well.
[33:49]
A
Yes, that's a happy problem.
[33:51]
B
As designed. As designed. Because this is looking at the whole balance sheet. Whereas if cost is not tracking to growth or not on that same trajectory, then you'd ask them questions. Again, you can't just assume, but it's a trigger to look at things. But it shows you got to be cognizant of what the business is doing. This is the whole thing. It functions that tend to be not viewed well or not execute well are typically far removed from the business stakeholders, holders, whereas it is the business. We're all the same company doing the same stuff. We should be deeply aware of what's going on in our markets place.
[34:27]
A
I definitely encourage people to be curious and ask why, like you know, we're talking about, you see this spike? Why is there an extra instance run here? Why was this spike happen? Or how, you know, you have kind of ballpark of what you think a server should cost or like what how much spend on S3 should be in a given account. You're like, why is that four times more than I think ask why and dig into it a little bit and you often can find that answer and understand, oh, that's justified because of this and now you've learned something more about that service and why it should be that way. Or you're like, oh, somebody just misconfigured that. Or maybe they don't have a retention policy and we've been accumulating, you know, log data for 10 years that we don't ever need and we can just put a lifecycle piloting delete. Right. Like, but if nobody asks the question why, you won't catch that.
[35:08]
B
So yeah.
[35:09]
C
Or one of the first podcasts we did on this topic was with called V transfer and someone in the early days had put a retention period after someone had called the lead. We still keep it around for seven days just to be sure. Yeah, but it was a hard coded in the code and nobody ever challenged why that was there, you know, and just moving it to two days. No customer ever complained but you know, it saved them what about 20% of their storage cost?
[35:38]
A
Yeah, I remember reading that.
[35:41]
C
Yeah.
[35:42]
B
Now when we talk about cost and management of things, often the immediate reaction is a sort of a chilling effect on people thinking, oh, I can't do anything, I can't make a change. You know, you're stifling me. You're ruining my creativity, dude. But you have a value at Zillow called Think big, Move fast, which seems to be completely counter to what we're talking about. So how do we marry the concept of these things together?
[36:09]
A
We definitely encourage people to experiment and get features out there and like, you know, think big. And Mufaska, as kind of one of our core values says we also want to empower our engineers and teams and believe they're going to do the right thing. And so it's that shared ownership in that. So it's like being aware of your spending. So we often have conversations with somebody's thinking about leveraging a new AWS service for a new product that we're perhaps doing or especially now as we're getting into whole LLMs generative AI nowadays there's a lot of spend for spinning up things on Bedrock or perhaps using Transcribe and that and trying to understand what those costs are. So it's, you know, conversations we have is like thinking about what that bill may be at the end of the day based on how many requests or responses that we're going to be sending through it and understanding that, but also like then, you know, kind of mapping that back to what the business is trying to accomplish and you know, is it a POC to what scale and how do we kind of have the right guardrails on it so that it's not a open ended checkbook but that it's, you know, value add to the company that we're willing to spend. So we definitely don't, you know, we don't put those constraints on something. It's like you can't spend that. But we want people to be aware that there is a cost associated with it, making sure they're observing it.
[37:17]
C
Well then especially I think with Bedrock or with many different models, it's really good to realize what the costs are for each. I mean the biggest heavyweight model I think of Claude is $15 per million tokens where the smallest model is $0.15 per million token. Is the quality that much different for the particular task that you want to achieve? Of course the $15 one will give you better and more extensive results. But is that really what you need at that moment? So trading off quality versus cost there is important, I think.
[37:55]
A
Yeah, it's having that context exactly of what you need and how you're going to use it. And maybe you experiment with Claude or something for the bigger one. But then how do you fine Tune. And so it's that iterative approach and kind of reevaluating over time, like does that work? And you know, especially with generative AI right now, that whole ecosystem is changing so fast. Your decision that you maybe even made last week may be outdated tomorrow. So being able to be willing to go back and reevaluate and think that, but also don't get stuck in a dilemma of not being able to make decisions and like, oh, I have to constantly check in, is this the best engine? Like, you also want to produce and get your features out there, coming back.
[38:34]
C
To your think big and move fast principle. I think you also helped, of course there with your guardrails, because guardrails is sort of A post check 100%. Yeah, it is. What, what has happened over or what, what has. What is happening or what has happened or how has this been built? And then being checked against the guardrails allows you to, if you intervene again, if things are truly getting out of hand, for sure.
[38:59]
A
Yeah. And our guardrail system, it runs hourly, so it's, you know, depending on what it is, it can catch things pretty quick. Some tickets or rules that we have may take, you know, a few days based on kind of our policy and how they evaluate what's changed over time. But it is one of those safety valves that we're able to kind of like, oh, has something crossed it? What do we do?
[39:19]
B
So, so, Craig, as we, as we come to the end of the episode and we think about, I guess, the journey you've taken us through, how has your own mental model shifted in terms of observability and implementation and what would you recommend others do on their own career journey and how they should think about this. What's been useful to you that you want to share?
[39:40]
A
I'd say the biggest thing is making sure that you really democratize the data and get it to any and everybody's hands. Whether it's your cost data, it doesn't have to be a tight secret amongst your company, like share it with all your engineering teams and make it so they can easily get into it. I would say some of the tools that we used initially on to look at the data, there's a bit of a steep learning curve for people to use those that how do you slice the data, find my data to my service or get that, that I may have had that skill set or some other people, but like you want to really keep that barrier to entry to people being able to understand their data and be able to identify where they might be able to make changes and prove it. Obviously, you know that we built this guardrail system which is able to generate various tickets and sign things out there. Very clear what they need to do. But sometimes you don't have that and you're just like, hey, this is my spend. This is how I'm kind of slicing it. How do you empower people to make those decisions so that you don't have a single bottleneck or of a single engineer or somebody like myself? You really want everybody working on it and kind of being a shared ownership model?
[40:41]
B
I think it's a very sensible advice and it holds true. Craig, thanks so much for sharing your journey with us. It's been really fascinating.
[40:49]
A
Yeah, thank you for taking the time. It's great meeting both of you.
[40:52]
C
No, absolutely great storytelling and I think everybody can learn a lot from this.
[40:58]
B
Absolutely.
[40:58]
A
Thanks very much.
[40:59]
B
Yet another great story to share. This is certainly a series I think, that a lot of folks are learning from, which is great. Lots of frugality to be understood.
[41:10]
C
We still have the most creative jobs, and I think constraints can breed creativity. I mean, it forces us or human brains to live. I mean, AI can't fix this. This is something that only we as humans can do. It's sort of where you're being pushed into a corner, but you always find a way to box yourself out of.
[41:33]
B
It again, for sure. It's a beautiful thing. Of course we do. Speaking of observability, we do love to get your feedback. AWspodcast.com is the place to do it. And until next time, keep on building.