Summary8 min read

The Pragmatic Engineer – “What is a Principal Engineer at Amazon?” with Steve Huynh

Host: Gergely Orosz
Guest: Steve Huynh (17-year Amazon veteran, former Principal Engineer)
Date: July 9, 2025

Episode Overview

This episode provides a deep dive into the principal engineer role at Amazon, one of the toughest and most unique engineering positions in Big Tech. Steve Huynh, who rose from support engineer to principal over 17 years at Amazon, shares inside stories about Amazon’s scale, culture, technical challenges, and what it means to become (and thrive as) a principal engineer. The conversation is especially relevant for software engineers and leaders aiming to understand technical career ladders, large-scale engineering, and the reality behind Amazon’s internal reputation.

Key Discussion Points & Insights

Steve Huynh’s Amazon Journey

Career arc: Started as a support engineer, transitioned into software development, worked on projects including Search Inside the Book, Kindle’s launch, Prime Video precursor, Amazon Local, Restaurants, Tickets, and live sports streaming on Prime Video. (01:28)
Tenure: 17.5 years at Amazon, left about a year before the recording. (01:16)

Internal Mobility & Company Structure

Movement between teams:
- Early on, movement was restricted; VP/Director could block transfers, causing high attrition on “bad” teams.
- Policy changed to allow “freedom of movement”; as long as not on a performance plan, engineers could transfer more freely. (04:40–07:30)
Internal hiring preference:
- Internal hires are lower risk, familiar with company culture/process. (07:57)
- “Most of our hires have been internal... It’s a low risk hire.” (07:58)

Scale & Engineering Challenges at Amazon

Massive scale examples:
- Prime Video’s gateway page and retail homepages receive “tens or hundreds of thousands of requests per second,” each request fans out into “hundreds” more internal service calls. (10:01–11:36)
- Minor changes can inadvertently DDoS internal services.
  - “If you change a caching configuration... you’ve just browned out a critical service.” (11:36)
Brownouts vs. blackouts:
- Brownout: Service is reachable but only partially functioning; timeouts, partial/bad results, random 500 errors. (11:57)
- Recovery requires managing load after a dependency returns to avoid repeated outages. (14:42)

“You own that piece of software... you cannot write the software, hand it over to the testing team and then throw it over to the SRE team after you’re done.”
— Steve Huynh (16:06)

Performance, Latency, and Amazon’s Monolith Origins (17:36–26:35)

Latency directly impacts revenue:
- Amazon invested in logs/telemetry; discovered faster page loads directly correlated with increased gross revenue.
- “If you’re faster, you just make more money. It’s a pretty clear correlation. I think you would even go as far as to say it’s causation.” (17:36)
- Led to a culture of “why not 1ms?” performance targets.
Monolith evolution:
- Amazon started as a single huge C-based monolith (“vertical scaling”).
- Outgrew 32-bit binary/4GB limit, moved to service-oriented/microservices architecture.
Microservices tradeoffs:
- Microservices enable team autonomy and scalability but add complexity and latency.
- Ongoing challenge: Optimizing blocking calls, reducing dependencies, and gracefully degrading under load.

“In a world where you have to... the best performance that you can actually get is always going to be bounded by the number of web requests that you end up making.”
— Steve Huynh (21:43)

Advice for startups:
- Start as a monolith, break up only when it becomes unwieldy with developer headcount. (26:09)

The Principal Engineer Role at Amazon

Promotion Structure & Career Path

Career ladder:
- Junior → Mid → Senior (L6) → Principal (L7), with no “staff” level to bridge the gap.
Hardest promotion:
- “You have to do like 2½ [levels]” to jump from Senior to Principal.
- External brain drain occurs because many strong seniors leave for companies with more sane progression (Meta, etc.). (27:16–30:17)

“Principal is L7... at Amazon, that jump is so big because there’s no staff level in between.”
— Steve Huynh (27:40)

The Principal Engineering Community

Community features:
- Tight-knit, highly curated, based on “overly high standard.”
- In-person offsites (pre-pandemic), active principal Slack, presentations (“Principles of Amazon” series, internal and recorded for 20 years).
- “Everyone that was able to achieve that... there’s something exceptional about them.” (30:52–33:59)
Notable quote:
- “You could just scoop out five people and then put them into a room and the conversation is just... amazing, right?” (32:09)

Knowledge Sharing & Postmortems (Correction of Errors – COEs)

Blameless, open culture internally, hundreds of detailed internal COEs.
Learning from past outages is “part of the secret sauce.”
“You have this stream of disasters... and you just... learn so much from that.” (37:23)

Common Paradoxes & Realities of the Role

Bhavik Kothari’s (current principal) list of challenges:

Paradox of Belonging
- Part of all teams, yet of none; act as floating advisor, not embedded. (39:08–41:56)
Paradox of Freedom & Responsibility
- Given total autonomy (“assigned a direction, not a problem”), but accountable for enormous impact.
- “My manager was a VP... and he didn’t assign me work. He just set a direction.” (42:11–44:25)
- Principal can solve needs via code, architecture, process, or buying software—total menu of tools.
Bandwidth & Presence
- Overbooked with meetings; “My day looked like most people’s week... It looked like... a Tetris factory blew up.” (46:59)
- Must ruthlessly prioritize, cut noise, learn to say no; otherwise, burnout inevitable.
- “If I just went to all the meetings... I’d literally have no time to do the work.” (47:22)
Breadth & Impostor Syndrome
- Expected to be expert on everything—tech, AI/LLMs, policies, etc.—but risk assuming more expertise than reality warrants.
- “There’s this trap... you speak as an authority, even though you haven’t had the requisite time to ramp up on something.” (54:51)
(Bonus) Performance Reviews
- Principals pulled into calibration/performance reviews for large orgs, similar to managers but without direct reports. (50:46)

Amazon’s Engineering & Corporate Culture

Leadership Principles & “Secret Sauce”

Principled thinking > the content of the principles:
- “The meta-level thing is... these guys have principles that they won’t budge on.”
- “What does it actually mean to be principled and not bend when it could be really easy to do so? That’s the secret sauce.” (55:14–58:38)
Core principles felt most: Customer Obsession, Bias for Action, Ownership.
- “We’ll just burn money to delight a customer.” (56:37)
- “Just get stuff done; stop asking for permission.” (56:40)
- “You own your software; you do the operations, you own the bug count.” (56:42)
Writing culture:
- Six-page memos/tradition (“six-pagers,” PRFAQs) frame business and tech proposals.
- Study-hall meetings to read docs together, then discussion.
- “I spent on the order of one to four hours every day reading, while I was a principal engineer.” (59:06)
- Culture enables rapid onboarding and deep institutional knowledge.

Patents & Technical Achievements

Patent system:
- Principals often hand their key designs/writings to lawyers, leading to many software patents.
- Story of building one of the world’s fastest ticket sale platforms at Amazon Tickets by leveraging CPU cache and bit manipulation—real-world systems/applications of computer science. (61:25–66:37)

Notable Quotes & Memorable Moments

On Principal Engineer Impact:

“You’re assigned not a problem, not even a problem space. You’re assigned a direction.”
— Steve Huynh (44:25)
On Unfairness of Promotion:

“Some of the best engineers that I’d ever worked with were having such problems getting to principal engineer that they ended up moving... to other places where the progression was just sane.”
—Steve Huynh (28:46)
On Leadership Principles:

“What does it actually mean to be principled and to not bend when it could be really easy to do so. So that’s an amazing secret sauce of Amazon’s... It’s principled thinking.”
—Steve Huynh (58:38)
On Learning and Meta-Skills:

“How can I quickly learn skills that makes you... recession proof?... It’s essentially meta learning.”
—Steve Huynh (68:02)

Patent War Story

Ticket sales optimization:
- “What if you loaded all of that inventory into L2 cache on a CPU?... do bit manipulation to really quickly get contiguous seats.”
  (65:44–66:37)

Timestamps for Key Segments

| Timestamp | Topic | |-----------|-------| | 00:00 | Episode opening & approach to performance at Amazon | | 01:16 | Steve’s Amazon tenure and high-level job history | | 04:40 | Internal transfers, policy change, freedom of movement | | 10:00 | The scale of microservices and personalization | | 11:36 | Brownouts and system-wide consequences | | 17:36 | Latency’s effect on revenue and roots of Amazon’s architecture | | 26:09 | Monolith vs Microservices tradeoffs, advice for startups | | 27:16 | The uniquely tough jump to principal engineer at Amazon | | 30:52 | The principal engineering community & professional network | | 37:23 | Internal “correction of errors” (COE) and learning culture | | 39:08 | Principal engineer paradoxes (belonging, accountability) | | 44:25 | Autonomy and expectation of resounding impact | | 46:59 | Bandwidth challenge and overbooked schedules | | 54:51 | Breadth, authority, and humility (LLMs & tech trends) | | 55:14 | Amazon’s leadership principles and “principled thinking” | | 59:06 | The writing and reading culture (six-pager memos) | | 61:25 | Patents, defensive IP, and the Amazon Tickets system story | | 68:02 | Steve’s top career advice: meta-learning | | 69:33 | Favorite programming languages (Perl, Rust, Java) | | 71:19 | Recommended reading: Cal Newport’s “So Good They Can’t Ignore You”; DDIA, etc. |

Recommendations & Resources

Books:
- So Good They Can’t Ignore You by Cal Newport (career capital & skill-building)
- Designing Data-Intensive Applications by Martin Kleppmann (DDIA)
- AI Engineering by Chipwin (cutting-edge technical reference)
Steve Huynh online:
- YouTube channel and newsletter (links in show notes)
For deeper company engineering insights:
- Subscribe to The Pragmatic Engineer newsletter.

Takeaways

Amazon’s principal engineer role is hard to get, high in status and impact, but comes with paradoxes—high autonomy, expectation of large impact, and persistent bandwidth/focus stress.
Internal culture thrives on principled decision-making, writing and sharing knowledge, and blameless learning from failure (COEs).
Breadth of expertise and comfort with ambiguity are essential to thrive; mentorship, networking, and resilient systems thinking are paramount.
Amazon’s technical architecture and org structure have been shaped by scale-driven needs, and despite debates, starting as a monolith still makes sense for most startups.
Meta-learning—building the capacity to swiftly acquire new skills—trumps learning any one language or toolset, and is the best defense against career stagnation.

This summary reflects the conversational, transparent, and sometimes self-deprecating tone of both the host and Steve Huynh, aiming to inform and inspire ambitious engineers and tech leaders about the true nature of technical leadership at scale.

Loading summary

Transcript184 lines

[00:00]
Steve Hudin
If you're going to optimize for performance saying, why can't we be at 1 millisecond? Or why can't we be at 10 milliseconds and start from there? Instead of sort of saying, hey, let's try to decrease latencies by 50% or 25%, let's just start from what is the conceptually fastest thing that we could do. And that's actually how Amazon was created.
[00:17]
Podcast Host (Pragmatic Engineer)
Amazon's principal engineering level is unique in many ways across Big Tech. Steve Hudin was a software engineer at Amazon for 17 years and worked as the last four years as a principal engineer. Today we talk about the ins and outs of this role, why being promoted from senior to principal is so hard, even though Amazon usually has hundreds of principal engineering openings and thousands of seniors trying to get into these positions. The Amazon principal engineering community, the in person events, the Slack group, and the Principles of Amazon internal presentation series. Engineering concepts at Amazon are on reliability, such as brownouts and coe, correction of errors, and many more topics. If you're interested in understanding one of the hardest engineering levels to get into across Big Tech to get with stories of how Steve thrived in this position, this episode is for you. Subscribing on YouTube and on your favorite podcast player greatly helps more people discover this show. If you enjoy it, thanks for doing so. So Steve, welcome to the podcast.
[01:13]
Steve Hudin
Thanks for having me.
[01:14]
Podcast Co-host / Interviewer
How long were you at Amazon? 17 years.
[01:17]
Steve Hudin
Yeah, I was there for 17 and a half years and yeah, I just quit last year. So I've been basically a year doing other things now.
[01:26]
Podcast Co-host / Interviewer
And what were the things that you worked on while you were there?
[01:29]
Steve Hudin
You know, people always talk about my long tenure there, but you know, I feel like I've had like five or six jobs over that time period. I started off on, you know, a project called Search Inside the Book. I worked on the first Kindle launch. Wow. I worked on the precursor to Prime Video. I sort of like worked there at the beginning part of my career and then I sort of ended my career there. For the last five years of my time there, I worked in payments. I worked in Amazon Local, which was sort of our Groupon project when that type of business was looking like it was going to take over. I worked on Amazon Restaurants, I worked on Amazon Tickets, which is off Ticketmaster Clone, and then my last five years was working on live sports streaming on Prime Video.
[02:19]
Podcast Host (Pragmatic Engineer)
If you want to build a great product, you have to ship quickly. But how do you know what works? More importantly, how do you avoid shipping things that don't work the answer Statsig Statsig is a unified platform for flags, analytics, experiments and more, combining five products into a single platform with a unified set of data. Here's how it works. First, Statsig helps you ship a feature via feature flag or config. Then it measures how it's working, from alerts and errors to replays of people using that feature to measurement of top line impact. Then you get your analytics, user account metrics and dashboards to track your progress over time, all linked to the stuff you ship. Even better, Statsic is incredibly affordable with the super generous free tier, a startup program with $50,000 of free credits and custom plans to help you consolidate your existing spend on flax analytics or A B testing tools. To get started, go to statsic.compragmatic that is S T A-T-S-I G.compragmatic Happy building.
[03:21]
Podcast Co-host / Interviewer
This episode is brought to you by.
[03:22]
Podcast Host (Pragmatic Engineer)
Graphite, the developer productivity platform that helps developers create, review and merge smaller code changes, stay unblocked and ship faster. Code review is a huge time sink for engineering teams. Most developers spend about a day per week or more reviewing code or blocked.
[03:38]
Steve Hudin
Waiting for a review.
[03:40]
Podcast Host (Pragmatic Engineer)
It doesn't have to be this way. Graphite brings stack pull requests, the workflow at the heart of the best in class internal code review tools at companies like Meta and Google to every software company on GitHub. Graphite also leverages high signal code base Aware AI to give developers immediate actionable feedback on their pull requests, allowing teams to cut down on review cycles. Tens of thousands of developers at top companies like Asana, Ramp, Tekton and Vercel rely on Graphite every day. Start stacking with Graphite today for free and reduce your time to merge from days to hours. Get started@gt.devpragmatic. that is G for graphite T4 technology.devpragmatic.
[04:20]
Podcast Co-host / Interviewer
So that's a lot of different teams. Was it like how did you work out in so many teams? Is it just like there's a lot of internal transfers? Did you get bored? Was it just you followed your manager? How does it work inside Amazon? Because when people think about companies, people who have not worked on Amazon, they would kind of assume you go, you work there, you're on a team for like you know, four, five, six years? Clearly not the case.
[04:40]
Steve Hudin
You know, it depends a little bit on like corporate policy and then where you are with your career. I started as a support engineer, so sort of like operationally focused person. And then, you know, I was Basically like, I want to be a software developer. And so, you know, I think getting into the company was pretty difficult, but once I was there, sort of set that target and changed roles. And when I changed the role, you know, it was a natural time to move to another team. There's some, also some internal policy. So basically at Amazon, it used to be that you had to stay on a team for at least a year before you transferred. And if you wanted to transfer, like a senior manager or director or whoever up top could block your transfer. And what that ended up meaning was that like certain teams that were just terrible to work on, those teams actually had more than 100% attrition over the course of a year. Because you measured attrition with a year long time unit. Amazon did something actually smart at the corporate level. They basically said, okay, well, you have freedom of movement now. This sort of happened, I don't know, probably like 13 years ago, 10, 13 years ago. And so they said, you have freedom of movement now. A VP or a director can, can't block you. They can say, okay, well we need another month to get like a transition plan going. But essentially you have freedom of movement as long as you're not on a performance improvement plan. Which meant that certain teams were sources of high quality engineering talent and certain teams were sinks of high quality engineering talent. And it sort of created an internal marketplace for different roles. Now what that ended up meaning was that certain teams, they basically didn't want you to know what the policy was. They wanted you to sort of think that you were kind of stuck. But despite that sort of like local gamesmanship that was going.
[06:41]
Podcast Co-host / Interviewer
Yeah, like basically some managers didn't want.
[06:43]
Steve Hudin
Their best people to leave.
[06:43]
Podcast Co-host / Interviewer
Right, exactly. Let's just say it how it is.
[06:45]
Steve Hudin
But ultimately the strat, I think it's a, it's a great strategy because it put the like, if there was a team that was difficult to staff, the problem was on the management, it wasn't something that had to be, you know, bared by or born from the employee themselves. And so getting back to my own career journey at a very large company like Amazon, there are so many awesome things that are going on and I decided to just kind of go where my curiosity took me. Now, there were some times where there were reorgs or a line of business got spun down, but ultimately I think freedom of movement was one of the smartest things that Amazon did.
[07:31]
Podcast Co-host / Interviewer
And I think this is something that people don't really appreciate about some large companies. You know, not all companies are like Amazon and every company changes, Right. Like today, I'm assuming it will be hard to move as many teams within Amazon depending on where you are. You know, if you're in a satellite office where there's two teams, you can probably move on to the other team at max. But I think this is one of the underrated things of large companies. Like once you are in, it's almost always easier to get that job at another team from the inside.
[07:58]
Steve Hudin
Yes.
[07:58]
Podcast Co-host / Interviewer
Especially because you can talk to them. You know, this is, I talked with the Reddit mobile team and I asked like, oh, how, how can you get become a platform engineer on the mobile team? And they said like, well, you know, most of our hires have been internal. They just helped us out on hackathons. They come around, they commit stuff, we know them. It's, it's a low risk hire. I think it's just nice to remember that when you think of like a big company like Amazon or Meta or Microsoft, it's just so many small teams and once you're in, you actually have almost priority access to those teams if you play your cards right.
[08:28]
Steve Hudin
Absolutely. And you know, you might interview for that team, but it's, it's such lower stakes than an external interview. And you know, just all things being equal, would you rather take somebody that's, you know, internal and knows the culture, they know how software is developed within a particular context, or somebody that's just as good but doesn't, you know, hasn't been onboarded and I think ultimately you're going to pick the person that's internal, all things being equal.
[08:55]
Podcast Co-host / Interviewer
Yeah, it's just kind of like business rationality for the most part. So one thing about Amazon and about large companies like Amazon is people talk about externally about the scale and it's hard to imagine. But can you give us a sense of the scale that you've seen or like some tough engineering challenges that you worked on that would have been just really hard to work at a smaller startup?
[09:14]
Steve Hudin
Yeah, I think that's the thing that you just, you will not see at most other places is the scale of things. I'll give you a couple of examples. So, you know, prime is the exclusive club that everybody is a member of and in the US the shipping benefit is probably the most popular. But globally, prime video, it's the thing that people use the most with their subscription. And so if you think about our service oriented architecture and just loading up the app, the gateway page is the place where all of our requests come in. Right. And so it's just Like Netflix, it's this infinite scroll of carousels.
[10:02]
Podcast Co-host / Interviewer
So the Gateway page, is it the Amazon prime landing page?
[10:05]
Steve Hudin
Yeah, it's the landing page there. And so you're like, okay, cool, if let's say 95, 99% of all of your requests are coming from that page and that page needs to be personalized and you have a service oriented architecture with a bunch of microservices, one request to that page turns into, let's just say hundreds of downstream requests to different services. It might even be more than that. It's actually kind of hard to count. Yeah.
[10:34]
Podcast Co-host / Interviewer
And is this page right? Like all the stuff flowing, all personalized stuff.
[10:38]
Steve Hudin
So that's the retail one, but I was talking about the prime video one.
[10:41]
Podcast Co-host / Interviewer
The prime video.
[10:42]
Steve Hudin
But essentially it's the same thing. Yeah. And so same thing for the retail website as well. And so if you have one request sort of spidering out into two orders of magnitude more requests internally, you start to see really, really large scale for these microservices. So a microservice will have like a reverse proxy or a load balancer in front of it and you are sort of unironically talking about things like tens of thousands of requests per second or hundreds of thousands of requests per second coming into your service.
[11:13]
Podcast Co-host / Interviewer
So like the services that are like behind, you know, like there's the prime, there's all the things loading, they're spidering out, like making, you know, to render that one recommendation, for example, for, I don't know, the video that you would like, it'll make a lot of requests to different services. And then so when you're operating a smaller service inside of Amazon, suddenly you're going to be hit with what you just said, 10k, 100k requests per second, that kind of scale.
[11:37]
Steve Hudin
Exactly. And you will essentially be ddosing yourself. You're just like, okay, cool, let's change a caching configuration on some item details and turns out you've just browned out a critical service.
[11:56]
Podcast Co-host / Interviewer
What does brown down mean?
[11:58]
Steve Hudin
Oh, sorry, I'm using some jargon. So if you want to talk about availability, suppose you are ddosing a service or sending a lot of requests over to them. You can, you know, you can, you can just take them down. That would be like a blackout. Yeah. And so like you send a request, oh, you can't establish a connection immediately comes back, but there's, there's a type of outage where they brown out. So basically they're reachable, they might accept the connection, but you know, they'll essentially timeout or, or they might return partial results or, or bad results. Or the only thing that they do return is a, you know, 500 for some percentage or proportion of.
[12:37]
Podcast Co-host / Interviewer
After we waited a bun time for that. Yeah.
[12:39]
Steve Hudin
And so now we start talking about availability and resilience in the face of all of this ddosing that you're doing to yourself. And so the thing on top of scale that is going to really complicate things is your dependency chain. Right. And so your service is a dependency of some of the process that's going on. It depends on maybe aws, it may depend on another service. You know, how do you make sure that if, you know, suppose there's a failure for a primary dependency and that dependency comes back up, how do you make sure you don't just like inundate it with a bunch of requests as it's trying to recover? Yeah. And so you have all of these sort of like odd dynamics that occur. I used a brown out as something that is a perennial problem that we have, right. Where there's maybe a dependency on a base service like S3 or Dynamo DB or whatever it is, there might be some increased latency that may cause a chain reaction of a dependency going down. And then one of these sort of middle tier services would brown out. So what are like, you know, you're an owner of the services for your team and so then it's like, okay, what do we do in those situations? How do we know that they're browning out? What do we do in the face of a dependency outage? And then critically, if there is an outage and then the service comes back up, how do we make sure that we give it enough space so that it can breathe? So that as they're trying to recover from some sort of outage, we don't just take them down immediately again.
[14:18]
Podcast Co-host / Interviewer
And I guess for most of us who are not working right now on these services, these sound pretty cool in theory. But you're saying this was actually like, this is not theory. This actually was like, oh, this service is going down. We are literally having 100k requests per second and we're pushing that on to other three services with the same. Because we need to invoke three other services. One of them has browned out. What do we do now? How do we fix it?
[14:43]
Steve Hudin
Yeah, and I think for certain other large tech companies you can do best effort. Right. Which is basically like, hey, we're temporarily down. But you know, you can, you can, you know, you have some sort of degraded service. That makes sense. But if you're on say a website that does purchases, now we're talking about transactions or if you're in the prime video, like live video, streaming use case. Now we're talking about a football game that you're unable to see. And then when we recover, the game might be over. And so it's much higher stakes. And so I think the scale with transactional semantics. Right. That's actually the challenge that you're not going to see unless you sort of work for a payment processor or something like that.
[15:35]
Podcast Co-host / Interviewer
Yeah, I guess that real world pressure challenge, like you are losing money. I'm starting to understand why. I have noticed that startups love to hire from certain companies. They usually startups love to hire from other startups because it's similar environment from large tech companies. It's a bit of a maybe I'm generalizing. Obviously this will not be true 100% of the time. But for example, hiring from Google, a lot of startups are not as happy because the people coming from Google are used to having this amazing team around them, internal tools. But most startups love hiring from Amazon. And I'm starting to get a sense of why this actually is.
[16:06]
Steve Hudin
Yeah, I think that's part of the culture. You know, you, you get hired as a software developer and they hand you a pager. And before phone apps and things like that, it was like this pager from the 90s. It's really great because you have to operate the software that you write. If you actually, you cannot write the software, hand it over to the testing team and then throw it over to the SRE team after you're done. You own that piece of software?
[16:37]
Podcast Co-host / Interviewer
Yeah, yeah. At every team. Right. One interesting thing that we talked about yesterday over, over dinner with, with Casey Moratori is you said something interesting on how Amazon measured how on their retail website. I think it was retail, maybe Amazon Prime. The lower the latency of something loading, like a page loading, like a purchase tag or a purchase button loading, the more revenue they got and they start to measure and there was a linear linear correction as the faster it was, the more people converted and it seemed it had no end. And the question Kasey asked is like, okay, if this is the case, what would stop Amazon? Because you have the best technologies in the world, you have aws, you can build whatever you want to get the latency of the website down to let's say like 10 milliseconds or even 1 millisecond because if this goes up, you would maximize revenue. So can you tell me about how that thing like this measurement actually happened and why is Amazon's website still maybe not the fastest in the world even though it would generate so many more billions?
[17:36]
Steve Hudin
Right, yeah, well, there are a couple of questions embedded in there, but we'll start with the latency to gross revenue measurement. So essentially somebody way back when, because we invest in logs and telemetry, started tracking how much gross revenue we would make based off of the latency for detail pages, based off the latency of Gateway, based off of latency of the checkout pages, and noticed this dynamic where it's like if you're faster, you just make more money. It's a pretty clear correlation. I think you would even go as far as to say it's causation. And so there was this really big focus on latencies. I love the idea that if you're going to optimize for performance saying why can't we be at 1 millisecond or why can't we be at 10 milliseconds and start from there Instead of sort of saying like, hey, let's try to decrease latencies by 50% or 25%, let's just start from what is the conceptually fastest thing that we could do? And I think in a vacuum, the conceptually fastest thing that we could do is sort of like a monolith, which is how Amazon started, where you have a web server with all of your catalog information, so all of the items that are there and then transaction processing on the host. That would be the fastest way to run.
[19:01]
Podcast Co-host / Interviewer
And basically like a web request would be it opens the HTTP or HTTPs handshake, it hits the server. The server in an ideal world has everything cached or calculated, it sends it back. So the total latency would be the time for this request, the time to transfer that data. And based on your Internet speed, and that's it. That is the absolute. You cannot be faster than that.
[19:21]
Steve Hudin
I don't think so. Maybe there's some exotic sort of thing that maybe you can do.
[19:24]
Podcast Co-host / Interviewer
Some exotic protocol that I know predicts the future and like with udp sends it. But yeah, but this is your baseline.
[19:30]
Steve Hudin
I guess the optimal would be like zero click instead of like a one click checkout. Right? So we just send you stuff before you know you want it. That would be the, I guess the theoretical maximum. But you know, if, if you, if there's some sort of like web request, right? So some HTTP request and then some sort of like buy button, that would be the fastest. Right. And that's actually how Amazon was created. We, we bought this, you know, it was sort of the opposite of horizontal scaling is vertical scaling. We bought these big sunboxes and you know, we hacked up our own web server in, in C and you know, to scale up, we bought bigger hardware. And then when that didn't work, you know, we bought like six of these big boxes and that ran Amazon. And we ran that wave up until the early 2000s. And then what we realized we, we ran into a wall which was that, you know, when you, when you built the C binary, the binary could only be 4 gigabytes. And that was a hard limit based off of the 32 bit soft, the architecture that we're running on. Before, we could not get above 4 gigabytes. And so these product managers would come and just be like, well, just make a change for me right to the devs. And then they would just be like, I don't think you understand that this is a hard constraint. And so we, the size of the.
[20:45]
Podcast Co-host / Interviewer
Code or the binary code, the compiled one, it was there. And you had so much business logic by then that it just filled out four gigabytes.
[20:52]
Steve Hudin
Yeah, yeah. And you know, we had a distributed C build. So, you know, you could, you know, it would take many, many hours for it to compile and so we would distribute it across desktops. And it was this whole big. But we ran into that wall. And so what we decided to do, and I think this was super smart, was like to lean into service oriented architectures, right. And microservices. And when you break it down, a web service call is essentially it's a remote procedure call, right? So you have this execution pointer and then you're like, okay, well I need to do some computation or I need to gather some data. I'm going to turn in turn make a HTTP request downstream to another service. And then you can sort of chain those things together. And so getting back to the original thing about performance, in a world where you have to, because you have thousands and thousands of developers building this stuff, and the fact that you cannot have a monolith as big as Amazon retail past something that's sort of like circa 2002, Amazon size, you have to lean into remote procedure call. You have to say that there's a web service, the best performance that you can actually get is always going to be bounded by the number of web requests that you end up making. Whether it's the, the first order calls to say, go get the item details. But then also any blocking call that happens downstream by blocking call, we Mean.
[22:16]
Podcast Co-host / Interviewer
Like you need to wait for this to finish to get your data. Is it service that returns, I don't know, your top five most likely to buy things. It might need to make those, let's say five requests or just one request. It needs to wait for that before it can return.
[22:28]
Steve Hudin
Exactly. And you can do this telemetry stuff, you can do this observability stuff to figure out within that service call chain what the blocking call is. And you can get some, some, you know, some amount of visualization on it. And so then you can get down to the point where it's like, okay, if we're going to start from first principles, what's the least amount of latency that you can get for say like a web request or a checkout page call? You're going to run into like the absolute minimum. Right. And it's going to be based off of like, what are the required operations, you know, evaluation or transactions or whatever for that particular request. Yeah.
[23:06]
Podcast Co-host / Interviewer
And then basically, so as I understand, like, as it became a microservice, like more microservices and services, this was great for maintainability. And also, well, you first just solved the issue of the monolith size. And as we know with history, of course now teams could be more autonomous, they're not as dependent, they could do the APIs, but it was a trade off for latency. And now you had to go back and figure out the blocking calls, how to speed those up, how to do, I guess, trade off things like caching. You can have things fast, but it might not be as correct on the first one, or like just tricky UI where you don't show the data just yet, but it's coming and the user's sense, a sense of like, progress, those kind of things.
[23:48]
Steve Hudin
And it also, I think forces teams to really end product, to really say, okay, like, what is the strictly necessary processing that happens on this page? Some of the work that I was doing before I left Prime Video was basically like you have these really, really big, heavy gateway page, you know, or landing page requests. And you know, if you're in a situation with high load, can you preemptively reduce the amount of say, personalization that's going on to sort of speed up that page or you know, to increase the amount of like throughput that you're able to have. So to serve more customers? Can you do that in a smart way? Right. That sort of anticipates load that's coming onto the, to that page. Say if there's a football game coming up or something like that.
[24:36]
Podcast Co-host / Interviewer
Yeah, it sounds like these are just like a. They seem just hard to solve, but now you have to solve them. So it sounds like this, this kept you busy and not everyone else busy at Amazon to this date. Right. Like, is, is this, do you think? Is this, is this ongoing engineering challenge for Amazon? Because you know what, I would imagine the tricky thing being here is like, okay, you can optimize whatever you have, you can find the critical paths, but Amazon keeps growing. Right? Like there's new teams, new services, new everything coming on. So this thing will change all the time. It's an ongoing puzzle to solve.
[25:09]
Steve Hudin
Yeah, absolutely. Yeah, I think they definitely have a ton of work in front of them. Also, it's part of their ethos to really launch new lines of businesses really quickly. And so the ability for a team to go from zero to launch product within the confines in the context of a large corporate entity, I think that's part of the DNA that's there. So as long as they're planting seeds, as the sort of internal terminology is, I think that software developers will be in demand for quite amount of time.
[25:43]
Podcast Co-host / Interviewer
Yeah, I guess it's a good reminder that every now and then we have the monoliths versus microservices debate that it sounds like kind of just makes sense for a startup to start with a monolith. Like you can always do what Amazon did and you have the benefits of latency. Everything is in one place. Like, I'm sure there might be reasons to start with microservices to start with, but if you're a small team, I mean, even today, I don't think that argument changes. Right. Like Amazon got really big wins by starting with a monolith back in the day.
[26:10]
Steve Hudin
Yeah, absolutely. I think it just makes a ton of sense to start with a monolith, wait till it breaks, and then the part where it breaks is when you have like 50 developers working on the same piece of code. Once that sort of breaking point occurs, then you start to try to figure out how you can sort of break things up. But starting with a microservice architecture, especially when you're small, what a waste of time and energy.
[26:35]
Podcast Co-host / Interviewer
Totally.
[26:36]
Podcast Host (Pragmatic Engineer)
So you were a principal engineer at.
[26:38]
Podcast Co-host / Interviewer
Amazon and apparently I learned that most companies have different levels. And again, this principal engineer, some companies have staff level, but it's usually entry level, mid level, senior, and then you have staff, or in the case of Amazon, it's principal. I've learned that Amazon's principal level is both really hard to get into compared to a lot of other companies. And it's pretty special in some ways. So we'll talk about that. But can you tell me, like, how, how is the career kind of development? Because most people imagine, like, oh, it should be pretty straightforward. I spent like, I don't know, two years as a junior, two years as a mid, roughly, and two years as senior. Then I get to principal. How does it actually work at Amazon?
[27:17]
Steve Hudin
I think it's linear up until you hit principal. Right. So you join, you're a junior developer, you get promoted to mid. At mid, you're starting to influence the team, but then you get to senior and so now your expected impact is at the team level. And then there's this jump that you get to principal.
[27:37]
Podcast Co-host / Interviewer
Principal is.
[27:38]
Steve Hudin
It's L6, principal is L7.
[27:40]
Podcast Co-host / Interviewer
L7, yes.
[27:41]
Steve Hudin
Yeah. And so I think you really have to start with why is that jump so big? Because I think at pretty much any other company it's just a linear progression. There's nothing necessarily special about staff. You can just sort of go to that level, senior staff and then principal. But for some reason Amazon decided that they weren't going to have a staff level and I think they sort of couched it around having high standards. Basically, to get from senior to principal, you have to do like 2 1/2.
[28:13]
Podcast Co-host / Interviewer
Level jump from L6 to L7. Technically it sounds like one level, but at some other companies this might be like L8, L9 or L8 and a half.
[28:23]
Steve Hudin
Yeah. And so the hand wavy argument is like, hey, we have high standards and it means something to get to that level. It's like, fine. But I noticed that some of the best engineers that I'd ever worked with were having such problems getting to principal engineer that they ended up moving to Facebook or to Meta or to all these other places where the progression was just sane.
[28:46]
Podcast Co-host / Interviewer
Now they're staff or senior staff.
[28:48]
Steve Hudin
Now they're senior staff and principal and distinguished engineer at other companies. And so because we had high standards, we actually had this brain drain. And it wasn't a brain drain at lower levels, it was that the brain drain at sort of like the higher levels. And it's just an example of something where it's just like, why did you do that to yourself? And so that's the context for being a principal at Amazon.
[29:13]
Podcast Co-host / Interviewer
It's safe to say it's wicked heart, I guess, internally. Right.
[29:16]
Steve Hudin
So I'm colleagues with Ethan Evans and so we talk about what's the hardest promotion at Amazon. And I had made the argument that it was senior Engineer to principal. And he's like, yeah, that's hard. Actually the hardest One, Steve, is VP to senior VP because there's only eight spots or 10 spots for that and maybe 300 VPs that are all trying to get this. That's more of a supply and demand thing. I will say that at Amazon there is gigantic demand for principal engineers. And so there are roles that have been open for years, I think something on the order of like 13 months or 17 months or something like that to get an external hire to join as a principal engineer. But that metric is only calculated when the role is filled. And so probably there are hundreds of principal engineer openings at Amazon and there are thousands of senior engineers who desperately.
[30:15]
Podcast Co-host / Interviewer
Want to get there that would love to be putting in the work.
[30:18]
Steve Hudin
And so there's this sort of like, there's this tension, right? And I don't think you see that at the lower levels. I don't think that that's happening at senior or mid or junior. And so like that incongruity I think is super interesting.
[30:32]
Podcast Co-host / Interviewer
But once you do get to principal engineer, one thing that I've never heard any other company have is there is apparently a principal engineering community, which is, I've heard again from other people that it's tightly knit, it's actually special, it's actually just a really nice organization. Can you talk about that? So like, you know, once you, once you got in there, somehow, I don't know, was it blood, sweat and sears?
[30:52]
Steve Hudin
That promotion there is a community. I think it's actually really great. My own history. I went from support engineer to senior engineer in four years at Amazon. But then from senior to principal, it took me eight years. And I got promoted in Q1 of 2020. Turns out to be a consequential year for, in the industry, for the world.
[31:16]
Podcast Co-host / Interviewer
That was forceful, remote work started.
[31:18]
Steve Hudin
And so I got promoted. And everybody's like, congratulations. They used to have like a principal engineer off site where they just flew everybody into Seattle or nearby and then to sort of like, you know, mingle and to talk to other folks. That stopped during the pandemic. And then by the time the pandemic restrictions started leaving, the population of principal engineers had essentially doubled. That's still to say, like there are still hundreds and hundreds of openings for principal engineer. But then the sort of like off site community shifted over to the senior principals that I didn't have access to. But at the moment, the manifestation of the principal engineering community is essentially through the Slack channel, which is absolutely awesome. And then we had principal off sites for our local organization. So like Amazon Music Prime, Video Twitch, that sort of thing. Those meetups were amazing. So the reason they were is because of this high standard that Amazon had created. And so what it meant is that everybody that was able to achieve that, that overly high standard, there's something exceptional about them. They're super deep in a particular technology or they were associated with the growth of a really large line of business either within Amazon or externally. They were essentially leaders within the industry. And you could just, literally you could just scoop out five people and then put them into a room and the conversation is just, it's just amazing, right? And I would, I would sort of be like, I don't even belong here. Like, look at this guy, you know, he wrote a book on, you know, on, on a particular topic and this guy, you know, he, you know, he was, you know, a luminary in, in a particular field. And then this person just like, is an amazing code machine and can just write an entire application over a weekend and then you're like, what am I doing here?
[33:20]
Podcast Co-host / Interviewer
You know, I do wonder if that community might be coming back. Now I know you've left, but now Amazon is now in person because it sounds like a lot of the benefit was the in person part as well. Because this is what I never heard again, even before the pandemic, I didn't hear other companies say, for example, at Uber, I've heard that the senior staff engineers do get together every now and then, but it was very like roots. So, so it was bottoms up. But my understanding at Amazon actually invested. Not just, you know, some principal engineers saying, hey, let's get together, but also just kind of, you know, like making, making sure that, that that group really had something. Like I've, I think it's smart, I think more companies should do it, but I'm just not seeing it.
[33:59]
Steve Hudin
The investment was also in terms of headcount. So there are program managers and product managers essentially that are bringing the folks together.
[34:13]
Podcast Co-host / Interviewer
Awesome.
[34:14]
Steve Hudin
There's a wonderful series, it's called the Principles of Amazon series where principal engineers will just, they'll do a presentation and it's recorded. That's been happening for 20 years. And you know, we record everything that's there, but it takes work to actually.
[34:29]
Podcast Co-host / Interviewer
But that's an internal series that. And is that open to like everyone at Amazon or it's for the principals?
[34:36]
Steve Hudin
Oh, it's open for everybody at Amazon to consume.
[34:39]
Podcast Co-host / Interviewer
To consume.
[34:39]
Steve Hudin
And then, you know, there might be some senior engineers and stuff like that. That would make a presentation that's part of their promotion packet. It was be able to make an Amazon wide presentation on a particular thing. My point was though that that stuff doesn't just happen on its own. Like you have to like you need a program manager or multiple folks to sort of like herd the cats and to like schedule the off sites and to make sure that the, you know, the Slack channel doesn't go off the rails, right? And it's still useful and it's just not going to happen like grassroots with just like throwing a bunch of people into a room.
[35:14]
Podcast Host (Pragmatic Engineer)
This episode is brought to you by Augment Code. You're a professional software engineer. Vibes will not cut it. Augment Code is the AI assistant built for real engineering teams. It ingests your entire repo, millions of lines, tens of thousands of files. So every suggestion lands in context and keeps you in flow. With Augment's new remote agent. Queue up parallel tasks like bug fixes, features and refactors. Close your laptop and return to ready for review. Pull requests where other tools stall. Augment Code sprints. Augment Code never trains or sells your code so your team's intellectual property stays yours and you don't have to switch tooling. Keep using VS Code, JetBrains, Android Studio or even Vim. Don't hire an AI for Vibes. Get the agent that knows you and your code base best. Start your 14 day free trial@AugmentCode.com Pragmatic.
[36:04]
Podcast Co-host / Interviewer
I think these are the things, I mean we're now exposing a few of these things here and there, but some of these companies, like Amazon is a great example, there's more to the eye than what meets the surface. So once you're inside Amazon, for example, as an engineer, even if not a principal engineer, you now have access to the whole 20 years of principal presentations. Like when I joined Uber, I was amazed at how we had the RFCs available. Like I could read all historic ones. So I think there is and every company has its own. Of course once you're in there you have access to this like knowledge base, which it will just never be published. It cannot because it has, you know, business sensitive things etc. So I think as an engineer like you can just really just like, like be a sponge when you join. Especially one of the companies that, that is known to be a bit more open internally. Even if Amazon really interesting one because externally it's very close is my sense. They're very careful about what they share. For example, the post mortems for AWS is Very few are published externally, but internally they're all there. As I understand there as an ngo, you can access, you can learn from them. Like, really cool, real world loyal names.
[37:08]
Steve Hudin
Absolutely. You know, it is an open place internally and we are so selective about what we. I say we as though I still work there. But what they publish externally and, you know, the, the postmortems, we call them coes.
[37:22]
Podcast Co-host / Interviewer
It's a coe sounds right.
[37:24]
Steve Hudin
It's a correction of error. Yeah, it's, you know, it's this idea that, you know, you have like, holes in Swiss cheese and. And you have like a failure requires that there is a. There's a hole across layers. That's the best reading. Like, I would just subscribe to the email list where they were published internally. So you have this, like, stream of, like, of disasters that are going on within the company and you just, you know, you grab some popcorn and you pop open one of these coes, and you learn so much from that. And I think that that's. That's part of the secret sauce. The idea, and I don't know if it's like this for 100% of them, is that it's a blameless culture sort of thing. And so to really screw up requires that multiple people drop the ball. Yeah. And you learn so much from that. That sort of stuff. You know, the brownouts, you know, these lessons that you would learn from, you know, trying to recover from really large dependencies. Those things are immortalized inside some of these COEs. So there's some very famous outages that happened within Amazon, and, you know, they were an egg on our face. And we really, really learned those lessons through those postmortems. They're absolutely wonderful.
[38:35]
Podcast Co-host / Interviewer
As a principal engineer so far, we kind of glamorized the role, saying, you know, it is hard to get into, but once you're there, you have the community, you do this really impactful work. But one of the principal engineers at Amazon, who's still there, called Bhavik Kothari, he collected some things that are maybe not as glamorous or more challenging about principal engineering. He had five of these things, or five or six. I just want to go through with you and your take on them. So first he wrote, there is this paradox of belonging that you're part of all teams, yet you're part of none. What does that mean?
[39:08]
Steve Hudin
Yeah, no, so I. Bhavik was actually a peer of mine. We worked in Prime Video together. So he's an awesome dude. Yeah, there are all of these paradoxes and this paradox of belonging is a really interesting one. You work for the organization, you're working cross teams. So the senior engineer, you're embedded on a team and you own the team's architecture, the operations, the software development lifecycle and the design. But when you get to that next level where you're working across teams, you kind of operate in this weird layer where you're not on pager duty for a particular team, you have visibility across all of these teams that are there, you're helping to guide and make decisions, but you're literally not on the ground floor anymore. And so when you work with a particular team, you might call the senior engineers or the mid level engineers in and be like, hey, let's whiteboard some stuff. Let's try to figure out what's going on. You're not on the team, you're kind of this advisor that's sort of coming in, right? But then maybe a director or a VP would call you in and say, like, hey, what do I own? What's going on? Explain to me this outage, or tell me why we can't build this thing. And then you're trying to whiteboard the architecture and the system and you're trying to say like, hey, this is what's going on on the ground floor. But you weren't part of that team. So you're just sort of operating in this sort of strata where you don't really belong on a team. I'm an immigrant. I think you are as well. And my parents came from Asia. I'm not Asian, right. So when I go back to Asia, I'm definitely from the US and then growing up in this country is just like, I'm not quite an American. And so you sort of operate in this sort of area in the gaps where your identity is really defined by not being squarely in one of these predefined categories. So it's very similar to that. As a principal engineer, you're not on the ground floor, you're not checking in, you will check in code, but you're not necessarily part of that team, embedded on that.
[41:28]
Podcast Co-host / Interviewer
And even if you are, for a short time, it's usually a short time. And like tomorrow the director call you up and say, like, hey, Steve, we need you on this other team. They're in trouble. Move over like, yeah.
[41:38]
Steve Hudin
And you parachute in. And then, you know, then they're like, oh, who's this guy? You know, and then your, your director is like, what's going on? What happened during this outage? Why is, you know, why is the, why is the press writing about us? And then you're like, well, you know, here's what's happening on the ground floor, but you're not really embedded on that team.
[41:57]
Podcast Co-host / Interviewer
Which leads us to the next paradox that Bhavik said. He lists a few of the paradox, which is a freedom, responsibility. And he writes that you enjoy significant autonomy in being able to choose what you work on. However, there's an implicit expectation and accountability for resounding impact.
[42:12]
Steve Hudin
Yeah, so I reported to a VP right before I left the company.
[42:19]
Podcast Co-host / Interviewer
So they were your manager, basically.
[42:20]
Steve Hudin
Yeah, my manager was a vp.
[42:22]
Podcast Co-host / Interviewer
Oh wow, that's. I don't hear many companies having engineers report into VPs. Yeah, that doesn't seem very standard, you know.
[42:32]
Steve Hudin
And so the org that he owned, you know, I considered myself the tech advisor for that organization was about 450 people, 450 software developers. And what did our one on ones consist of? Right? Like when I, when I would have our one on one, it wasn't like, hey, here's, you know, he didn't assign me work. He wasn't like, hey, I need you to build this thing, I need you to design this thing. The context that he set was basically like, here's a direction, right, that you need to go. And the way that you can achieve that type of impact was up to me, right? So he might say something like, hey, availability is so important for live sports. We just signed billion dollar contracts with these sports leagues and so we need to increase our availability posture. And then I would be like, okay. And then I would go away and it would come back and I would be like, here's what I'm working on, right? That type of dynamic does not exist at the senior engineer below level where you're basically telling your boss what's happening.
[43:42]
Podcast Co-host / Interviewer
I was about to say that when you said my, my manager one on ones, he didn't tell me what to do. I'm like, most engineers would be like, sign me up. Like, I don't want, you know, we all hate micromanagement. But now when you're telling me like he would say like, oh, so we just signed a billion dollar contract, availability is important. And then stops talking. I'm like, that sounds uncomfortable. And basically like you're kind of expected a little bit to like understand what he's expecting even though he doesn't know. And then, and I'm assuming, you know, there's two ways of going, right? You go back on the next one on one and you say something and he was like, like Steve, like you're a principal engineer, this is not what I expect of you and you don't want that. Whereas this, you know, if you bring back the right things, it sounds like you really need to up level in like understanding how like these people think.
[44:25]
Steve Hudin
Absolutely. And so he's, you know, he's accountable to his boss as well. And you know, don't get me wrong, I didn't, you know, I had a, I owned aspects of availability. You know, there's a multi thousand person organization at Prime Video doing this stuff. But we own the live sports aspect of this. And you know, there are playback teams, there are, you know, recommendation teams, there are, you know, there's so many different teams that are there that had to really step up and make sure that availability was good. But he would say something like, hey, you know, what is our availability posture for certain aspects? And I would have to go and figure it out, like, what are we measuring, what are we not measuring? There's a deadline for, you know, the start of a season where we're expecting, you know, millions and millions of concurrent to come in. What can we do between now and then? Right. And then if we do write some software, like what, what is the highest leverage piece of software that we could create that would increase our availability posture. And so the way that I, I sort of describe it to people is you are assigned not a problem, not even a problem space. You're assigned a direction. You can solve the problem with code, you can solve the problem with system design and architecture, but you could also solve the problem say by, you know, I don't know, hey, maybe there's some off the shelf software we should purchase. Yeah, maybe there's a dev team that we should start to spin up right now whose job it is to do this particular thing. Maybe we've identified a piece of software and it's already been scoped that this team needs to go and build, but it's not a priority for them. Now we need to go and figure out like, you know, how we can get them to do it. Can we shuffle around resources, that sort of thing. And so the way I describe it is like there's so many more things on the menu that you can use to solve the problem. And I don't think people recognize that. They think that it's just, oh, when you're a principal, like you just like code a lot and it's just really complicated or do more meetings, you know.
[46:24]
Podcast Co-host / Interviewer
That sort of happens.
[46:25]
Steve Hudin
I mean, at the end of the day, don't get me Wrong. There's a ton of meetings that go on.
[46:28]
Podcast Co-host / Interviewer
Yeah, yeah. But this is. I think it's good to shine light because I also feel like once. It sounds like a big change, but I also kind of feel if you get good at this, you might not really want to go back to having a manager who's like, all right, here's a project we need to solve. Scope it out and which you can do. Right. That's cool. And now the next challenge that Bhavik said was, this all sounds great, but there's apparently bandwidth challenge. So it's. It's easy to become this, like, social resource where people just pull you into everything and you're breathing.
[47:00]
Steve Hudin
No. You know, I think I wish I had taken a screenshot, but, you know, I have my Outlook calendar. Right. So it's my schedule. My day looked like most people's week. So it looked like somebody had just like, blew up a Tetris factory. Like, there was like, I would have triple or quadruple booked on a Monday all through the day.
[47:19]
Podcast Co-host / Interviewer
So you would have the manager calendar as an ic. Yeah.
[47:23]
Steve Hudin
And it's absolutely crazy because for that large org that I was supporting, everybody just added me as optional. Or they might try to say, like, no, you're actually required for all of these meetings. But when you have a triple booked calendar and you're required for this stuff, you just learn that you're going to have to disappoint a lot of people. And so it's this sort of like, you know, this thing where it's like, it's almost easier to say no now that you're obscenely overbooked versus when you're a senior engineer. You're like, I don't have time to write code, but there's just barely enough time in between the cracks.
[47:59]
Podcast Co-host / Interviewer
Yeah.
[48:00]
Steve Hudin
And so I think that it's almost like when your schedule breaks, that's when you are finally freed because you know that you can sort of say no to stuff. But ultimately, if I just went to all of the meetings that everybody said that I would have to go to, I would be a professional meeting attender and I would literally have no time to do the work.
[48:17]
Podcast Co-host / Interviewer
And then Bhavik follows up on this next challenge, which is being truly present, and he writes, I think it's almost like, you know, he was sitting next to you. You find yourself physically present in one meeting while your mind is already racing against the next three.
[48:30]
Steve Hudin
You know, it's a really big challenge. You know, I pride myself on being a good communicator. And being present and when there are there are 20 things that are going on in the air or 100 things that are going on, it's just really, really difficult to to say single threaded. And what I ended up having to do is to, to sort of say like okay, I could do all of these things and they would be really impactful. But I just had to aggressively prioritize and say, you know, for the availability. I'm just looking at availability. There's all these other fires that are going on, which is disappointing because there, there's so many things that you know, you could be focusing on. It's super difficult. And so I work with a lot of people to try to get them to the next level and they say Steve, well I'm completely overwhelmed. There are like 20 things that are going on. And I tell them, do you think it gets easier when you get higher level? There's just going to be more and more things on your plate. Why wait until you burn out or you break? You can just start implementing these things now. So every high level tech I see, I know and managers included, they have a wonderful system in order to isolate signal and then cut out the noise. And if you don't have that, you literally won't survive. But it just at the principal level and above, it's just amplified that much more.
[49:49]
Podcast Co-host / Interviewer
I'm getting sense that a lot of the work as you do as a principal engineer, I mean there's huge amounts of software engineering and you need to be just really good at building resilience systems, learning about new technologies. For example today I'm assuming whoever's a principal engineer at Amazon, they're experts to just know everything about LLMs, trade offs characteristics, et cetera. Because there anyway but you also need to just do the skills that managers have which is managing your time, changing context, figuring out how to get that focus time. Contrary to popular belief, managers actually need focus time. So I will always try to carve out some time. But you're now doing it while your title is not manager. But actually it's, it feels like you combine a manager, a lot of manager responsibilities and a lot of experienced engineer and boom, you get the principal engineer role. Oh. The only upside is you don't need to do performance reviews for people. Congratulations, you've saved a little bit of that time.
[50:47]
Steve Hudin
Well actually during performance review season they pull the principal engineers in because if you're stack ranking people. Okay, cool. Well we'll need to take a look at their performance. So I reported to a vp, you Know, one of my peers was a director and he was basically like, hey, Steve, I would like you to show up to my performance review for my entire org of a hundred something people. And I'm like, I can't do that for you and for everybody else.
[51:13]
Podcast Co-host / Interviewer
Okay, so now, so now it makes sense why as a principal engineer, your compensation package will be similar to, like, is it a senior engineering manager or something like that?
[51:22]
Steve Hudin
Around that, around that.
[51:23]
Podcast Co-host / Interviewer
But basically, like, the job is. Has a lot of overlaps. Okay. The benefit is you're not the one delivering the performance tribute to the direct.
[51:31]
Podcast Host (Pragmatic Engineer)
Report, but you're doing all.
[51:33]
Podcast Co-host / Interviewer
Almost everything else or in terms of the effort I'm talking about. Yeah.
[51:37]
Steve Hudin
Okay.
[51:38]
Podcast Co-host / Interviewer
So having been a principal engineer for four years, what are the good things that you really, really liked about Amazon, specifically Amazon's principal engineer role? And what are some of the, you know, not so good or it could have been better things.
[51:51]
Steve Hudin
I mean, the, the great parts are you get visibility that you just couldn't possibly have at the team level. You know, within a large organization like Prime Video or wherever you're at, there are many thousands of people that are working within that organization doing so many things right. And typically the performance of these people is really high. There's so many different directions that are going on. And so to survive, you kind of have to look inward and you say, okay, well, here's my service boundary, here's all the software I own. I'm going to own everything within the sphere of ownership. Because you've built this wall up, you tend not to be able to see that broader picture. And so as a principal engineer, I think it's really awesome to be able to sort of like spelunk and be able to go to different teams and sort of see that broader picture. And I just don't. I don't see a way that you would be able to get that type of visibility. That's super interesting at a lower level. You know, I think the other thing is like, you know, whether it's warranted or not, you do get some amount of status. When you go to a meeting, people just listen to you. They listen to your harebrained ideas. And it's kind of nice because you don't necessarily have to prove yourself over and over again.
[53:03]
Podcast Co-host / Interviewer
It's a bit less professional. Not fights, but just establishing that you know what you're talking about.
[53:10]
Steve Hudin
Yeah, yeah. Now the bad things are, you know, there's a lot of folks that are really good in tech and being really effective as a principal engineer, but Then they also, you know, myself included, they're like, okay, cool. Well, that sort of makes me an expert in pretty much everything. And so you would get these principal engineers together. We had a weekly meeting. And so it'd be like, okay, if you wanted to talk about, like, establishing a constitution for a small island nation, all of a sudden they would just be like, well, like, here are the main considerations. Nobody has a background in government policy, but all of a sudden, like, just because you're sort of trained to do so, you start to, like, pitch in. You're like, well, actually, you know, maybe we should have two branches of government or three branches of government. And it just sounds like we would know what we're doing, but we don't. And so there's this trap, and again, I've fallen into it many times where you actually think you're an expert in one thing, but you're actually not. Right. And so take LLMs. There's a ton of folks that understand AI I left before it was sort of, like, allowed to use internally, but I think you can use it now. I'm not an expert in LLMs at all, but I do think that the expectation would be that you understand how they work. But then the expectations also like, hey, what should our policy be? How should we be thinking about this stuff? And I think that's fine for mature technologies. Potentially you can ramp yourself up for it, but as that particular landscape is changing so quickly, I think there's this sort of trap where you sort of stop. You speak as an authority, even though you haven't had the requisite time to ramp up on something.
[54:52]
Podcast Co-host / Interviewer
And you've been there for 17 years at Amazon. What are your favorite parts of the culture? There's a lot of things that there's a values that we all know, like the frugality, customer obsession. What were the things that you found to be the most interesting or the ones that had a lasting impact?
[55:10]
Podcast Host (Pragmatic Engineer)
And how did they change?
[55:11]
Podcast Co-host / Interviewer
How did Amazon change over 17 years? They must have changed.
[55:15]
Steve Hudin
No, I think the things I missed the most in the secret sauce. Yeah, the leadership principles are good, but I think the actual secret sauce there is principled thinking, right? Yeah. So there's invent and simplify and bias for action and all of this stuff. But ultimately, the thing that is amazing about those leadership principles aren't the specific stances that they took. So they decided that customer obsession is a big deal. They decided that bias for action is a big deal. All of these things. But really, if you looked at a meta level you'd be like, oh, these guys have principles that they won't budge on. I sort of think about it in terms of math and axioms. Like, you just take certain things to be true. You know, two lines that are parallel, if you extend them out to infinity, won't touch them and won't touch with each other.
[56:06]
Podcast Co-host / Interviewer
Yeah. You assume that's true.
[56:07]
Steve Hudin
Yeah. You don't prove that it's an axiom. And then based off of that, you're able to build a system of mathematics. Right. And so it's the same thing with the corporate leadership principles at Amazon. They basically said, okay, we are going to fix these things to be true. There are 16 or 12 or, I don't know, they just sort of bolt.
[56:26]
Podcast Co-host / Interviewer
16 and now they're 16.
[56:30]
Steve Hudin
But there are like four or five that are just really core to Amazon and we just fix those things to be true.
[56:37]
Podcast Co-host / Interviewer
Which ones were the ones that you felt were the most presence?
[56:41]
Steve Hudin
Customer obsession. We are absolutely customer obsessed. We'll just burn money to delight a customer. You can be in a meeting with a VP as an intern and you say, hey, that's a bad customer experience. It would be like a needle coming off a record. It would just be like, what? What are you talking about? Like, immediately. Right. You know, bias for action. So, like, just get some stuff done. Stop asking for permission. Just like, go and do it. Right. Ownership. It's just like you own your software, you run the, you know, you do the operations, you know, you own the bug count, all of this stuff. Right. So those are the ones that are like, those are fixed. And then you start layering things on top of it. And I think it's really great. But, you know, you could take Amazon and you could have like the, you know, evil goatee version of Amazon, which is just sort of the opposite of those things, and that would still be a really valid and awesome company. So you could say, okay, well, what's the opposite of customer obsession? It's not customer obsession or like not being customer obsessed.
[57:36]
Podcast Co-host / Interviewer
I think it's, you know, like being about your staff.
[57:39]
Steve Hudin
Yeah.
[57:40]
Podcast Co-host / Interviewer
Which is Google.
[57:42]
Steve Hudin
It could be like, hey, we really care about our people above everything else. Or it could be, you know, let's not mince around it. We care about top line or bottom line revenue. Yeah, that's totally valid. Right. And then you could just fix that. You wouldn't, you can't prove that, you know, being, you know, staff focused is a bad thing. You just build that and then, you know, a certain set of things will happen. Like Great things are going to happen and then like not so great things are going to happen. Those not great things that happen, you can try to mitigate them, but you can't fix them because you have started with this principled approach to thing.
[58:15]
Podcast Co-host / Interviewer
Yeah, yeah, it all goes like everything has. Yeah, I see what you mean, but I think what you're saying is like, it might be less about what the specific principles are. I mean, Amazon has theirs and we know about them, but it's just sticking to them and not keeping wiggling. Because if you keep wiggling, it's like, what was the point? Right Then you're going to have a really mediocre, truly not standout company, whatever you do.
[58:38]
Steve Hudin
What does it actually mean to be principled and to not bend when it could be really easy to do so. So that's an amazing secret sauce of Amazon's people. Look at the leadership principle. I'm like, no, it's principled thinking.
[58:49]
Podcast Co-host / Interviewer
Another thing, a lot of this, honestly, from what I understand talking to you earlier and some other people, a lot of it probably comes from Jeff Bezos being from the top down, being very principled on not giving, not saying, we will do this, whatever it takes. Sounds like it was customer session initially and then some other things.
[59:06]
Steve Hudin
Yeah, yeah, absolutely. And he was an absolute genius when it came through. So I'm a Jeff Bezos fanboy for sure. It just worked. Another thing that's Amazon's secret sauce is just the writing culture. And so I spent on the order of one to four hours every day reading while I was a principal engineer and we had a standard format, it was a six page memo. And that would be our business strategy, that would be a system design that would be what we called the prfaq. So a press release and frequently asked questions for a new line of business or a new initiative. And everybody was sort of constrained to this six page format and everybody just produces documents in that format for whatever they need to do. And so when I would try to get up to speed on a particular thing, I would just be like, give me your six pagers, give me all your documents. And I just got really, really good at just reading these documents to get up to speed, which was a self fulfilling and virtuous cycle, which is just like, okay, well now I need to express myself. And so I will write a six pager and that will set the context for whatever we're working on. We'd go to a meeting, you would read the six pager and it was just super great to just actually Just have people do study hall at the beginning, part of a meeting where everybody just gets fast forwarded and then you have a really great discussion at the end. What an amazing culture that I think that almost every other company should replicate if they could. But I think the difficulty would be like, you actually have to be disciplined and actually have a reading culture and principle, then have a reading culture and then actually value writing.
[60:55]
Podcast Co-host / Interviewer
Yeah. I almost wonder if, unless it comes from the top, some of these things might just be really, really hard to do.
[61:00]
Steve Hudin
Yeah.
[61:01]
Podcast Co-host / Interviewer
One thing that I figured is we're in your studio right now and you have a lot of these blocks and I asked them what they are. Are they for promotions or projects or whatever?
[61:11]
Podcast Host (Pragmatic Engineer)
They're for patents.
[61:12]
Steve Hudin
Yeah.
[61:14]
Podcast Co-host / Interviewer
And this is for patent number 10,000. 10,000,824. 964. Can you tell me about why you have these, how they come about, what you needed to do for them?
[61:25]
Steve Hudin
So the highest order bit is like, for better or for worse, there are software patents that exist. Amazon, they'll say that basically the reason they have them is defensively because other people will assert that, hey, you're in violation of our patents or our iPad, and then we'll use them reactively. Okay, fine, but you're also in violation of these other things. And so there is a culture of trying to make sure that we protect ourselves in that way. But there's the other part of software patents which is basically like, hey, can you really patent math or whatever. And so what I learned over time is that I'm just a really bad IP lawyer, even though as a principal engineer, I might cosplay, as somebody that really understands software path patents. Right. At the end of the day, what we would do is we would take our important six pagers and we would hand them over to the legal team and then they would just be like, oh, this stuff is really interesting, let's explore that. And so it turned into this awesome thing where we just had ready inputs to go into that particular system.
[62:32]
Podcast Co-host / Interviewer
A writing culture, turns out, has a bunch of benefits.
[62:35]
Steve Hudin
Exactly. And I think that the. But there's this sort of like, the concept is called like the curse of knowledge, which is essentially like if you understand something, you discount how long, like how easy that concept is. And so it's just like you don't get it, you don't get it, you don't get it and then you get it. And then you're like, oh, that's trivial. Right. Even though there could have been, it could actually be novel or it could actually be interesting. And so what ends up happening is that you would just throw these documents over to the lawyers and then they would basically be like, oh, this stuff is great. And you would just be like, well, that's just regular software development. Or that's just the context and domain that we were living in. It turns out that there's some interesting stuff. This particular patent I'm proud of. So there's a system design interview question that seems to be popular right now, which is like, design Ticketmaster. Right. And so I work on Amazon tickets and we ended up shuttering that business, but. But we ended up building one of the world's fastest ticket selling systems in the world. We could do many, many orders per second. So the use case is basically at T0, that's for a really big ticket on sale. That's when the maximum amount of demand and requests are coming in and you want to sell out all of your ticket supply as quickly as possible. The problem is, I think one where you have seated concerts. And so when you purchase a ticket, you know, most of the time with the system design stuff, it'll be like general admission or it won't be a high ticket on, you know, like one with a bunch of demand. You have to find contiguous seats.
[64:12]
Podcast Co-host / Interviewer
Yeah. So the really, really quick. Next to each other.
[64:15]
Steve Hudin
Yes, exactly. And so, you know, it's, it's actually really hard. Like suppose it was a SQL database as your backing store. Like how do you come up with a SQL query that's just like, hey, give me a. The best four tickets within this particular price range that are sitting next to each other?
[64:34]
Podcast Co-host / Interviewer
Yeah. Now you're thinking, so this is a real world thing where you want to be as efficient as possible in terms of resource usage. May that be maybe you want to minimize your CPU or memory, depending on what you have, I assume. And you need to do this quick, as rapidly as possible to give this to people. Okay, so now we're talking about a problem that is. Seems like pretty novel in some ways, right?
[64:57]
Steve Hudin
Yeah. And so I did this patent with a senior principal. I was a senior engineer at the time. But the idea is what is the theoretical maximum speed by which we could show this inventory to people? And it turns out that even if you have a high ticket on sale, you only have thousands of tickets at the end of the day. So instead of making a request to like a backend that would conduct some sort of search across the space, what if you actually inverted it and then you basically had each of the individual hosts have like Some view on the entire arena or venue that was there. And you loaded up all of that availability and inventory into like L2 cache on a CPU.
[65:45]
Podcast Co-host / Interviewer
Yeah.
[65:45]
Steve Hudin
Because it's actually not that many. So if you have this compact replication.
[65:48]
Podcast Co-host / Interviewer
Yeah, well, the cache was pretty big.
[65:49]
Steve Hudin
Yeah. Then what you can do is you can do bit manipulation to like really, really quickly get contiguous seats that are there. And then what you do is you can like send in that particular request and try to like reserve those particular seats.
[66:05]
Podcast Co-host / Interviewer
Now there's a logging problem which is.
[66:07]
Steve Hudin
Much more tractable than like, hey, there's, you know, 2 million people that have just hit your on sale.
[66:16]
Podcast Co-host / Interviewer
You got to search for each of them.
[66:18]
Steve Hudin
Yes. So the, the inversion of that ordering process by which you like actually send out the inventory to the individual. No. And then load it up into CPU cache and then just do bit manipulation and then try to lock that resource from the individual nodes. That was the basis of this particular patent. Awesome.
[66:37]
Podcast Host (Pragmatic Engineer)
That's clever.
[66:38]
Podcast Co-host / Interviewer
And that sounds like some people are always asking like, oh, on my job I don't use the algorithm stuff or any of the formal methods. Sounds like there are some uses of it. Especially when you're trying to figure out what is it that when you just taking away from the pattern, just having a problem like this and saying what is the theoretical limit that we can do, what is the fastest possible to answer that? You probably want to have access to these tools. So it's not always the time and effort to actually get into these things.
[67:10]
Podcast Host (Pragmatic Engineer)
So what are you up to now.
[67:11]
Podcast Co-host / Interviewer
That you've left Amazon a year ago after like 17, 18 very long years.
[67:17]
Steve Hudin
You know, I'm just, you know, I'm just making content. I'm just sort of living the dream there, you know, making YouTube videos. Started up a newsletter. I have a discord community and yeah, just.
[67:28]
Podcast Co-host / Interviewer
Yeah, and we're going to link all of those below. I actually like got to first know you before we started talking. This was probably a few years ago from your YouTube videos which are, you know, you shared a lot about like Amazon things, software engineering things and just like your general thinking. But yeah, your newsletter is a new one, so we'll link it in the show notes below. It's always a good way to keep in touch and also, you know, like on your YouTube channel. Awesome. So as closing, I have some, some rapid questions, so I'll just ask and you just shoot what comes to mind. What is career advice that greatly helped you in your path?
[68:02]
Steve Hudin
Yeah, I mean this is, I, you know, I talk a lot about this, it's kind of like, oh, what's your favorite food or your favorite movie? It's just like there's so much there and it's hard to pick one. What I would say is instead of saying like, hey, what's the technology that I should learn that's really going to, you know, make my career, you know, solid, instead sort of flip it around and say like, how can I quickly learn skills that makes you, that makes you sort of like recession proof, right? That that sort of makes you valuable. It's essentially meta learning. It's like, how can I learn something faster and faster? If that's your focus, then you'll always be, you'll never have a problem finding a job and you'll never have a problem progressing in your career. Now some of the skills may be difficult to find resources on online, but you know, I think if you just sort of think about like what's a valuable skill that if I knew right now would, you know, make my job search easier or would like make me, you know, perform better on the job and then just sort of thinking about acquiring that skill as quickly as possible.
[69:08]
Podcast Co-host / Interviewer
And do it now, like, don't wait.
[69:09]
Steve Hudin
Yeah, people tend to postpone themselves. They'll be like, oh, well, I'll start when everything is lined up. But to begin, you just need to begin when you start something, only then will you know what you need to do instead of saying, oh, I need to get everything that I need to do first before I start.
[69:28]
Podcast Co-host / Interviewer
You'll use a lot of programming languages. Which one's your favorite and why and which one do you dislike most?
[69:34]
Steve Hudin
Yeah, I have like, obviously there's no perfect programming language. What I would say is like, I really enjoyed Perl and nobody would ever give that answer. But I just like this concept of like there's just so many different ways to do it. It's like it's a write only language. Like you can't read anybody else's Perl and it's actually one of the languages that uses up the most power. It's like the least efficient. It's interpreted. It's just like terrible.
[70:04]
Podcast Co-host / Interviewer
Most of booking.com still runs out or some of it.
[70:07]
Steve Hudin
Yeah, Amazon's back end was, you know, for a long time. It still might be, you know, sort of like Pearl Mason is sort of like web technology bolted onto Pearl. But I just kind of like it. I just feel like I can express myself and there's just like, there's just however you'd like to express yourself, you can. It Also looked like an ASCII factory blew up sometimes. And so it's just like, it's, it's, you know, now that it's on a podcast, you know, I wouldn't really, you know, advertise that fact. The best programming languages right now I think Rust is pretty interesting, so I might, you know, pick that up. At the end of the day, I really love the boring languages. So Java, for all of its stuff, it's verbosity and I think it's just a great language, like a JVM based language that has essentially great library support and a bunch of stuff written for it, but it's just super boring. Maybe it's just because I'm from Amazon and we do this enterprise stuff, like it's a fine language.
[71:07]
Podcast Co-host / Interviewer
And then I see you have a large bookshelf here. You also read a lot, especially at Amazon, although most internal documents. What is a book that you would recommend? Something around software engineering that you enjoyed and it cannot be that book, it.
[71:20]
Steve Hudin
Can'T be your fucking. What I would say is I'd just given the advice about meta learning and career growth. I think that most software developers should read a book by Cal Newport. It's called so good they can't ignore you. And so the concept there is around career capital. So like, what are the skills that are in the most demand? And if you can just like learn those skills, then you become in demand and then you know, from there you can choose what type of lifestyle that you'd like. You know, you can also like sort of lean into, you know, some of the science of meta learning. So deliberate practice, spaced repetition, that sort of thing. In terms of like tech books, I think the new AI engineering book by Chipwin is amazing. I think ddia, so the design of data intensive.
[72:10]
Podcast Co-host / Interviewer
So good a new version is coming the end of the year actually.
[72:13]
Steve Hudin
I'm excited about that. I think that'll be pretty good. But at the end of the day, you don't want one book on your bookshelf, you want 50 books on your bookshelf. And so I think within a particular subgenre of tech books, you know, I'd have recommendations there.
[72:29]
Podcast Co-host / Interviewer
But Steve, this was great, awesome. Really enjoyed it.
[72:33]
Steve Hudin
Yeah, great. Thanks so much for having me.
[72:35]
Podcast Host (Pragmatic Engineer)
Thanks a lot for Steve for sharing all these details. Although Amazon's principal engineering level feels surprisingly difficult to get promoted to, I have yet to hear of such a strong principal engineering community than what Amazon builds and keeps investing in. This community itself could be a reason.
[72:49]
Podcast Co-host / Interviewer
Enough to consider the company after the.
[72:51]
Podcast Host (Pragmatic Engineer)
Principal plus level should you have the opportunity to do so. For a deep dive into Amazon's engineering culture, including the details on compensation, career ladders, performance reviews, and engineering processes, check out the Pragmatic Engineer Deep Dive linked in the show notes below. If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. This helps more people discover the podcast and a special thank you if you leave a rating. Thanks and see you in the next one.