Summary8 min read

Lenny's Podcast Summary

Episode: AI Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Date: October 23, 2025
Host: Lenny Rachitsky
Guest: Chip Huyen

Overview

This episode dives deep into the nuts and bolts of building real-world AI products, led by Chip Huyen—veteran AI engineer, educator, author of AI Engineering, and hands-on contributor at Nvidia, Netflix, and Stanford. The conversation aims to demystify the essential concepts and practicalities of AI engineering, debunk common myths, and share battle-tested advice for product leaders and engineers seeking to create robust, impactful AI-driven applications.

Topics covered: the difference between pre-training and post-training, fine tuning, RLHF (reinforcement learning with human feedback), retrieval-augmented generation (RAG), evals (evaluation frameworks), organizational AI strategy, and the evolving role of engineers in the GenAI era.

Key Discussion Points & Insights

1. Misconceptions about Building Great AI Products

[04:39–06:48]

Many believe the key to improving AI apps is “staying up to date” with the latest models, frameworks, or database tech.
Chip’s viral LinkedIn chart:
- What people think improves AI apps: new tech, news, fine-tuning, model comparisons
- What actually improves AI apps: talking to users, better data, reliability, optimizing workflows, better prompts.
Chip:

“Why do you need to keep up with the latest AI news? There’s so much news out there... If switching from one unproven technology to another is costly and doesn’t move the needle, why bother? ... Talk to users.” ([05:28])

2. Essential AI Training Concepts

[07:34–12:49]

Pre-training: The foundational model is trained on massive datasets (e.g., all internet text) to predict the next word/token.
Fine-Tuning/Post-Training: Adjusting the model for specific use cases with targeted data (often more important now than pre-training since most models are already strong).
Language Modeling Analogy:

“It’s all about encoding statistical information about language. Like if I say 'my favorite color is…', 'blue' is more likely than 'end of table.'” ([09:35])
Why “tokens” matter: Tokens are more granular than words but more meaningful than characters, balancing vocabulary size and meaning.
Sampling strategy:

“Sampling strategy is extremely important—how you pick the next token affects both creativity and correctness of output.” ([11:54])

3. Supervised vs. Reinforcement Learning (RLHF)

[15:20–19:15]

Supervised learning: Training models on labeled data (spam/not spam, good/bad answer).
Reinforcement learning with human feedback (RLHF): Comparing model outputs, with feedback (preference) provided by humans or AI, used to “reward” better outputs.
Chip:

“It’s easier for humans to give comparisons than absolute scores... RLHF is about using this feedback as a signal to nudge models towards more desirable behavior.” ([17:07])
Industry trend: Data labeling is a big business but risky—startups depend on a handful of customers (frontier labs), which creates uncertain economics.

4. Evals: Evaluating AI Product Quality

[22:24–31:52]

Evals (evaluations):
- For app builders: Are my LLM-powered features good enough?
- For model builders: Is my model improving at specific tasks?
Do you need evals? Chip’s pragmatic take:

“To win, you just need to be good enough and consistent—not perfect. Sometimes engineers want to invest in evals to improve from 80% to 82%—but two engineers could instead launch a new feature and move the needle more.” ([24:39])
When evals are essential:
- At scale, or for business/value-critical uses (failures can cause catastrophe)
- When product’s competitive edge is its quality or performance
How many evals?: Focus on the core use case (“main path”), not every tiny feature. Number varies greatly based on product breadth and risk.
Eval design is creative: Several levels of checks—input queries, content breadth, overlap, depth, quality, relevance, etc.

5. RAG (Retrieval-Augmented Generation)

[31:54–37:46]

What is RAG?
- Supplementing models with relevant external data at inference time.
- Originated when adding Wikipedia context improved question-answering performance.
Key to RAG success:
- Data preparation beats tech choice:
  
  “In a lot of companies, the biggest improvement in RAG comes from better data preparation, not agonizing over which vector database to use.” ([35:07])
- Chunking & metadata:
  - How you split documents, add summaries/metadata, create hypothetical questions for better context retrieval.
Docs written for humans often need augmentation for AIs.
- Add annotations, clarify scales or references that humans infer intuitively.

6. Organizational & Productivity Pitfalls in AI Adoption

[39:30–43:32]

Types of GenAI tools:
- Internal productivity (e.g., coding agents, knowledge chatbots)
- Customer/partner-facing (e.g., sales chatbots—easy to measure ROI)
AI adoption struggle:
- “We buy tools for everyone, but few use them much.”
- It’s hard to measure productivity gains, especially coding tools.
- Different organizations see different impacts, e.g.:
  - Some report top-performing engineers get the biggest AI productivity boost ([47:39])
  - Others find senior engineers most resistant to AI tools.

7. The Changing Role of Engineers

[49:39–55:04]

Senior engineers become more valuable as reviewers/system-thinkers and in defining good practices, not just code generators.
Companies are already shifting org structure:

“We’re preparing for an era where a small group of strong engineers create processes and review code, and AI/junior engineers generate much of the code.”
But concern: how will junior engineers develop “senior” understanding if entry-level work is automated away?
System thinking and debugging:
- AI excels at contained, well-defined tasks, but debugging cross-component issues or systemwide reasoning still requires human know-how.
“Coding is just a means to an end—CS is really about system thinking, using code to solve actual problems... AI can automate tasks, but knowing how to tie those skills into solutions is hard.” ([51:33])

8. ML Engineer vs. AI Engineer

[56:05–57:04]

ML Engineer: Builds/trains models.
AI Engineer: Integrates and leverages existing models as services to build products.
Entry barriers to “AI engineering” are dropping—possibilities for applications have exploded.

9. Predictions for the Next Few Years

[57:40–66:23]

Org structures will blur:
- Product, engineering, and even marketing will converge (evaluations, user understanding, system design are ever closer).
Automation and job shifts:
- Companies will question what should or shouldn’t be automated; team roles will shift accordingly.
- Separation between junior and senior engineering value may widen.
Post-training will matter more:
- Major improvements likely to come from fine-tuning, evaluation, and application-layer innovation rather than fundamentals of base models.
Multimodality is next:
- Text models are “solved”; audio, video, and especially voice are still hard, with new challenges (e.g., latency, interruption management, regulatory needs).
“Voice is an entirely different beast. We need to sound natural, manage latency, handle interruptions—much harder than you’d think.” ([65:36])
Test-time compute: Sometimes running the model longer or generating multiple answers at inference delivers better performance—without changing the base model.

Memorable Quotes

On why user focus matters:

“If you talk to users and understand what they want or don’t want... you can actually improve the application way, way, way more.” (Chip, [00:00])
On overengineering:

“If you adopt a new technology... you would be stuck with it forever. Maybe you want to think twice about over-committing to new tech that hasn’t been tested.” (Chip, [05:28])
On the goal of evaluation:

“The goal of eval is to guide product development... it helps you uncover where products are doing well and where they’re not.” (Chip, [27:54])
On being pragmatic with evals:

“You don’t have to be absolutely perfect. To win, you just need to be good enough and be consistent about it.” (Chip, [24:39])
On the value of system thinking:

“Coding is just a means to an end. CS is about system thinking—using code to solve actual problems. AI can automate stuff, but knowing how to tie these skills together to solve a problem is hard.” (Chip, [51:33])
On data labeling company risks:

“It’s very lopsided—a small number of frontier labs need a ton of data, with many companies racing to supply it. I’m not bearish, but the economics are uncertain.” (Chip, [21:03])
On building with GenAI:

“We’re in an idea crisis now. We have all these cool tools that can do everything from scratch... yet people are stuck, they don’t know what to build.” (Chip, [68:34])
On discovering AI product ideas:

“Spend a week paying attention to what frustrates you. That’s where your best ideas come from.” (Chip, [70:46])

Notable Sections & Timestamps

[04:39] The Viral Chart: What actually improves AI apps
[07:34] Pre-training vs. Post-training & Fine-Tuning
[15:20] Supervised vs. Reinforcement Learning (RLHF)
[22:24] Evals: How and When to Evaluate AI Features
[31:54] What is RAG and why data prep matters most
[39:30] AI Adoption in Companies: Productivity vs. Hype
[47:39] Different Responses to AI Coding Tools
[51:33/55:04] The Need for System Thinking and Debugging
[56:05] ML Engineer vs. AI Engineer: A New Role
[57:40/64:16] Predictions: The Next Few Years for AI and Product Teams

Tone & Style

Chip brings a technical yet grounded, pragmatic approach, often challenging hype cycles and reminding builders to focus on data, workflow, and real user needs. The episode balances approachable analogies (Sherlock Holmes, favorite colors), concrete organizational war stories, and hard technical analysis.

Key Takeaways

Don’t get sucked into tech hype—solve real user problems first.
Fine-tune and evaluate with purpose—don’t over-optimize what doesn’t matter.
Organizational success in AI requires cross-functional mindset, not siloed teams.
Future differentiation comes from application layer, not just model size or base performance.
System thinking and problem-solving trump rote coding in the GenAI era.
Keep generating “microtools” that address specific, real frustrations for real users.

Further Resources & Where to Find Chip

Connect with Chip on LinkedIn, Twitter, and (soon) Substack.
Book recommendations: The Selfish Gene, From Third World to First.
Chip is open to receiving impactful book suggestions and invites engagement from listeners.

Loading summary

Transcript110 lines

[00:00]
Chip Huan
One question I get asked a lot and a lot is how do we keep up to date with the latest AI news? Why? Why do you need to keep up to date with the latest AI news? If you talk to the users and understand what they want or they don't want, look into the feedback then you can actually improve the application way, way, way more.
[00:15]
Lenny
A lot of companies are building AI products. A lot of companies are not having a good time building AI products.
[00:19]
Chip Huan
We are in an ideal crisis now. We have all this really cool tools. You have to do everything from scratch. You can have your design, it can have your write code, you can have your website. So in theory we should see a lot more. But at the same time people are somehow stuck. They don't know what to build.
[00:33]
Lenny
All is AI hype. The data is actually showing. Most companies try. It doesn't do a lot, they stop. What do you think is the gap here?
[00:38]
Chip Huan
It's really hard to measure productivity. So I do ask people to ask their managers, would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra headgear? Almost everyone. The managers could say headcount, but if you ask VP level or someone who manage a lot of teams, they could say one AI assistant. Because as managers you are still growing. So for you having one hr head scalp is big. Whereas for executive maybe we have more business metrics that you care about so you actually think about what actually drive productivity metrics for you.
[01:11]
Lenny
Today my guest is Chip Huan. Unlike a lot of people who share insights into building great AI products and where things are heading, Chip has built multiple successful AI products, platforms, tools. Chip was a core developer on Nvidia's Nemo platform. An AI researcher at Netflix, she taught machine learning at Stanford. She's also a two time founder and the author of two of the most popular books in the world of AI, including her most recent book called AI Engineering, which has been the most read book on the O'Reilly platform since its launch. She's also gotten to work with a lot of enterprises on their AI strategies. And so she gets to see what's actually happening on the ground inside a lot of different companies. In our conversation, Chip explains a lot of the basics, like what exactly does pre training and post training look like? What is rag? What is reinforcement learning? What is RLHF? We also get into everything she's learned about how to build great AI products, including what people think it takes and what it actually takes. We talk about the most common pitfalls that companies run into where she's seeing the most productivity gains and so much more. This episode is quite technical, more technical than most conversations I've had, and is meant for anyone looking for a more in depth conversation about AI. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. And if you become an annual subscriber of my newsletter, you get a year free of 16 incredible products, including Devin, lovable, replied, Bolt, N8N, linear, superhuman, descript, Whisper Flow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Raycast, Japier D and Mobin. Head on over to Lenny'snewsletter.com and click product Pass. With that, I bring you Chip when after a short word from our sponsors, this episode is brought to you by DScout Design Teams today are expected to move fast, but also to get it right. That's where DScout comes in. Descout is the all in one research platform built for modern product and design teams. Whether you're running usability tests, interviews, surveys or in the wild field work, descout makes it easy to connect with real users and get real insights fast. You can even test your Figma prototypes directly inside the platform. No juggling tools, no chasing ghost participants, and with the industry's most trusted panel plus AI powered analysis, your team gets clarity and confidence to build better without slowing down. So if you're ready to streamline your research, speed of decisions and design with impact, head to DSCout.com to learn more. That's DSCout. The answers you need to move confidently did you know that I have a whole team that helps me with my podcast and with my newsletter? I want everyone on that team to be super happy and thrive in their roles. Justworks knows that your employees are more than just your employees. They're your people. My team is spread out across Colorado, Australia, Nepal, West Africa and San Francisco. My life would be so incredibly complicated to hire people internationally, to pay people on time and in their local currencies, and to answer their HR questions 24. 7 But with Justworks, it's super easy. Whether you're setting up your own automated payroll, offering premium benefits, or hiring internationally. Justworks offers simple software and 24. 7 human support from small business experts for you and your people. They do your human resources right so that you can do right by your people. Justworks for your people Chip, thank you so much for being here and welcome to the podcast.
[04:34]
Chip Huan
Hi Lenny, I've been a big fan of the podcast for a while, so I'm really excited to be here. Thank you for having me.
[04:40]
Lenny
I want to start with this table chart that you shared on LinkedIn a while ago that went super viral. And I think it went super viral because it hit a nerve with a lot of people. And let me just read this and we'll show this on YouTube for people that are watching. So it's this very simple table you shared of what people think will improve AI apps and what actually improves AI apps. What people think will improve AI apps. Staying up to date with the latest AI news. Adopting the newest agentic framework, agonizing what vector databases to use. Constantly evaluating what model is smarter, Fine tuning a model and then you have what actually improves AI apps. Talking to users, building more reliable platforms, preparing better data, optimizing end to end workflows, writing better prompts. Why do you think this hit such a nerve with people? And just what if you had to boil it down? What do you think is, what do you think people are missing about building successful AI apps?
[05:29]
Chip Huan
One question that get asked a lot and a lot is that how do we keep up to date with the latest AI news? And I'm like, why? Why do you need to keep up your dick with the latest AI news? I know it's not very counter intuitive, but I just. So much news out there. A lot of people also ask me questions like how do I choose between two different technologies? Like maybe like recently like MCB versus like agent to agents, right? Like protocol. And it was like which one is better? Or like this or that and think it's a ser question you should ask them is like first like if how much of the improvement could you get like from like optimal solutions versus non optimal solutions, right? And sometimes they were like actually it's not much, right? And I was like, okay, if it's not much improvement, why do you want to spend so much time debating something that doesn't make a much difference to your performance? And another question they ask is like if you adopt a new technology, like how hard it would be to switch that out to another. And sometimes it was like, oh, I think it could be like a lot of work switching it out. And I was just like, let's say here's a new technology, it hasn't been tested by a lot of people and if you adopt it, you would be stuck with it forever. Like do you actually want to adopt it, right? And maybe you want to think twice about, about like over commit to like new technology that hasn't been better tested.
[06:49]
Lenny
I love your just broader Advice is just simple like talk to build successful AI, talk to users, build better data, write better prompts, optimize the user experience versus just like what is the latest and greatest? What's the best model to use right now? What's happening in AI? Let me follow this thread of this idea of fine tuning and basically post training. There's all these terms that people hear in AI and I think this is going to be a really good opportunity for people to learn what we're actually talking about since you actually do do these things, you build these things, you work with companies doing these things. And there's a few terms I want to sprinkle in through the conversation. But let's start with this one. What's the simplest way for someone to understand what is the difference between pre training and post training and then just how fine tuning fits into that. Just what fine tuning actually is.
[07:35]
Chip Huan
Disclaimer. I don't have full visibility into what this big secretive frontier labs are doing, but right from what I heard, I think it's one is supervised fine tuning. When you have demonstration data and you have a bunch of experts, okay, here's a prop, right? And here is what the answer should be like and you just train it to simulate, emulate what the human expert could be like. And that's also what a lot of people would open source models are doing as they do it by distillation. So instead of having human experts should write really good, selling great answers to prompts, they get very popular famous good models to like generate the response to it and like getting this trained smaller model to emulate. So, so sometimes you see people. So that's because like some I, I really appreciate open source community by the way. But like going from like have been able to train a model that can emulate a existing good model. It's very different from like being trained a good model, like an output form existing good model. So it's a big step there. So yeah, so we have my supervised fine tuning and another thing that's like very big. I'm not sure you have guests talking about it already, but like reinforcement learning is like everywhere.
[08:52]
Lenny
Let's pause on that because I would definitely want to spend time on that. And that's such a cool topic that's merging more and more in my conversations. But just to even summarize the things you just shared, which I think is really, really important stuff. So the idea here is a model essentially this algorithm piece of code that someone writes and say the frontier models are feeding it just like the entire Internet of content. And basically it's trying to test itself on predicting in all of the. In. Across all that data, the next word. Essentially, token is a simpler way, is the correct way of thinking about it, but a simpler way to think about it is like the next word in a. In text. And as it gets it wrong, it adjusts these things called weights, essentially, just like. Is that a simple way to think about it? Even though that's. Even. That's just like very surface level.
[09:36]
Chip Huan
So I think of language modeling as a way of encoding statistical information about a language, right? So let's say that we both speak English, so we kind of get a sense of like, what is more statically likely. Like, if I say my favorite color is, then you would say, okay, that should be another color. Like, the word blue would be much more likely to appear than the word like, end of table. Right? Because statistically blue is more like it comes up to. My favorite color is. So it's a sign. Get. It's a way of encoding statical information. So like, when language modeling, when you train a large amount of data, like, you see a lot of languages, a lot of domains, so it can tell, like, okay, UBI size is standard. Then if user do the prompts, and it would come like, with the next most likely token. So by the way, it's not a new idea, actually a video. So this idea comes very, very old, like, from the 1951 papers, like English entropy. I think it's like Claude Shannon. It's a great paper, and I think it reads a story I really like. It's from. Did you read Sherlock Holmes, by the way?
[10:39]
Lenny
Yeah, I read a few Sherlock Holmes books. Yeah.
[10:41]
Chip Huan
Yeah. So this is story of when Sherlock Holmes was using this statical information to have shown a case. So he was getting. So this is. This story. There is somebody left a message with a lot of stick figures. So Sherlock Holmes was like, okay, he knows that in English the most common letter is E. Then the most common stick figure must be E, right? And then he goes. He stopped like that. He was in his soul. So the. The code. So I think it's language. So in a way, it's like simple language modeling, right? But instead of like, at a word level, he does this as like, like character level. And token is something in between. Right? A token is not quite a word, but it's bigger than a character. So let's say we. We say token because it helps us, like, read what. Help us reduce vocabulary. Because with character is like, smallest amount of, like, vocabulary, right? So M5 has a 26 character. But words can have like millions and millions. Right. Whereas tokens, you can be able to get the sweet spot between the two. So let's say that we have the new word, like how to say podcasting? Right, let's say it's a new word, but you can divide in a podcast and ink so people understand. Okay, podcast, we know the meaning. We know that English is a verb, like gerund, whatever it is. So we know the word like podcasting. So that's why it's a token comes in. But yeah, that's like the pre tuning is basically like encoding statistical information of language to help you predict what is most likely. I think most likely it's the most simple way of doing it because it's more like building a distribution of like. Okay, so next token could be more like 90% of the time. It could be like a color. Like 10% of the time could be something else. Right. So it based on distribution. So language would like pick, like depending on your sampling strategy, like do you want it to always pick the most likely token or do you want it to pick something more creative? So I think my sampling strategy, I think is something extremely important. So can have you boost the performance in a huge way and very, very underrated.
[12:50]
Lenny
Okay, awesome. So essentially a model is just code with this whole set of weights. Essentially the statistical model that has learned to predict what comes next after certain words and phrases.
[13:03]
Chip Huan
Yeah.
[13:04]
Lenny
And then post training and fine tuning specifically is doing that same thing. So pre training you get like GPT5. Fine tuning is someone taking GPT5 and doing the same sort of thing, adjusting these weights a little bit for specific use cases on data that they find is necessary to do their very specific use case. Is that a simple way to think about it?
[13:25]
Chip Huan
Yeah, I think weights is like functions. Right? So let's say just like maybe it has a functions of like, maybe Lenny's height is maybe like 1x, like 1x plus something, like 2x, like 1 and plus something is a weight. Right. So you change it until you fit the correct data, which is like my height and your height. Right. So you can think of the weight is just like a weight, like the function. So you like chain adjust the weight so they can fit the data, which is the training data.
[13:55]
Lenny
Awesome. Okay, so we're talking about pre training, post training, fine tuning. Is there anything else here that's important to share about? Just like what? This is exactly what people need to understand about these parts of training.
[14:06]
Chip Huan
So the vast majority of time we don't touch on like pre training model. Like as users we don't use already done for us. Yeah. So I think actually it's pretty fun like process like when my friend's training model is like try to play with the appreciating model and they're horrendous. They're like saying things. It's like we was like oh my gosh. It's like yeah, it's crazy. So it's very interesting to look at like how much of like post training can change the motor behavior. Yeah. And I think that's where like a lot of time is that a lot of people are spending energy on nowadays in Frontier Lab is on like post training. Because pre training. I think so pre training have been used to like increase the general capacity of of a model capabilities of a model. And it depends on. You need a lot of data and model size to increase the model capabilities. And at some point we are actually have quite maxed out on Internet data. Right. And people like text data max out. I think a lot of people are doing with other data like audios and videos and everyone's trying to think of like what is the new source of data. But if we're like post trading but method of course of is this more of like everyone can have very similar pre training data is that post training is where they make a big difference nowadays.
[15:21]
Lenny
This is a good segue to. You talked about supervised learning versus unsupervised learning. I love we're getting into this. By the way, this is super interesting. So you talk about labeled data. Basically supervised learning is AI learning on data that somebody has already labeled and told it here is correct versus incorrect. For example, this is spam versus not spam. This is a good short story. This is not a good short story. We've had the CEOs of a lot of these companies that do this for labs. Mercur and Scale Handshake. There's micro, there's a few others. So is that essentially what these companies are doing for labs, giving them labeled data, high quality data to train on?
[15:58]
Chip Huan
It is in a way but I think it's more like a part of big equations. So there are a lot more different components than that. So that's why I was talking about reinforcement learning. I'm not sure if you're a CEO that you interview bring up like that term. So the idea is that you want people to like. So like let's say you have a model give the model like a prop. Right. And it produce an output. Right. You Want to buy? Like you want to reinforce or encourage the model to produce an output that is better. Right? So like now comes to like how do we know that the answer is good or bad? Right? So each and you will realize on like signals. So one way to get like a first one, good or bad is like human feedback, right? They happen to have two responses. You can, okay, this one's better than the other. And we do that is because like as humans we tend to. It's very hard to give like concrete score. But it's easier to do comparisons. Right? Like if you ask me, okay, give this song a score. I'm not a musician. Like I don't know like how hard it is. Like it's like yeah, I don't know, like what like now 10 I go to six, you know. And like if you ask me again a month from now and I completely forgot it was okay, maybe now seven. Only four. I don't know. But then if you ask me, okay, here are two songs and which one would you prefer to play for the birthday party? I was like, okay, I can pretty prefer this song. So like comparison is a lot easier. So say yeah, so you have human feedback and then you use this human feedback to train a reward model. So you like tell like which like so and then this reward model help you like. Okay, it's a model that produces response. Is the robot model can score is this good or bad? New charge and bias toward like producing better model the better responses. Another way is that you can instead of using a humans you can use like AI, right? Like as a response. Yes. Or good. Good or bad. Right? Or in fact the thing is that people are very big on nowadays like verifiable rewards which is like natural. So basically they give it a math problem and then math solutions. Like it's a model output solution, you know that okay. It's a expected response should be need for H2 and it doesn't provide for A2s and it's then it's wrong, right? It's not a good response. A lot of time people using this human laborers should produce how to say expert questions and expected answers and in the ways that design systems that verifiable so that the models can be trained on.
[18:21]
Lenny
Okay, I'm really glad you went there. This is essentially RLHF reinforcement learning with human feedback. Which is exactly what I wanted to also talk about. Right?
[18:29]
Chip Huan
Yeah. So I think it's like it's general, it's like. It's a way of learning. It's like Training is conversational learning. And whether it learn from human feedback or like AI feedback or like verifiable rewards, I think they say you say it's just different way of like clipping signals.
[18:45]
Lenny
Awesome. Yeah. We had the cfanthropic on the podcast and he talked about their version of RLHF, which is AI driven reinforcement learning. I love the way you phrased it where you basically you want to help the model, you want to reinforce correct behavior and correct answers. And this is the method to do it. Whether it's say an engineer seeing an output from a model, being like, no, here's how I would code it differently. And then training and it's training a different model that the original model works with to tell it am I correct or not correct. Is that right?
[19:15]
Chip Huan
Yeah, I think that's a way of, of looking into it. And I think that space is so exciting nowadays because there's so many like domain expert tasks that the model, like that model developers want model to do well on. Right. Let's say you're like accountant, right? Like maybe you want to use a model to have accounting task. So I need a lot of like accounting data, like examples for my accountants. So you need to hire a lot of them to like do it. Or if you want a physics problem to do and legal questions and stuff or engineering questions or somebody was telling me they want to do using coding to source scientific problems and not just coding to build product, which is another different whole realm of things. And I also like using very specific toolings. I'm not sure what apps you use, but maybe for a dating app or QuickBooks or Google Excel, they are very specific tool specific expertise that you want the model to learn. So they need a lot of like humans experts in this area to like create data to treat them. And it's a massive thing. It's like people because everyone wants a lot of data and like wants slabs at like unlimited budget. But whether I think this is also like a little bit of low key, interesting economics, I'm not sure you've talked to like the guests about. I thought it's very interesting to think about because it's very lopsided, right? Because they're only very small numbers of frontier labs and they want a lot of data. And there's a massive amount of startups or companies providing data. So you can see this company is just startup doing data labeling. They have maybe massive ar, but you ask them, okay, so how many customers you have and there could be very small numbers. I'm not Sure, I saw you smiling.
[21:04]
Lenny
Yeah, yeah, we chat about that.
[21:06]
Chip Huan
Yeah. So I'm like a little bit like, look, me, I'm easy, right? I have a company that's growing like crazy, but it's like heavily dependent on like two or three companies. And at the same time, like, if I was this company, Frontier Labs, what could be the right economical things for me to do? Right now I want a lot of startups, I want to have a lot of providers so I can pick and choose. And then these providers can also like to compete each other to lower the price. And it's so dependent on music regardless. So I feel like. Yeah, so this economics, the whole economics is very interesting to me and I'm curious to see how it plays out.
[21:42]
Lenny
What I'm hearing is you're bearish on the future of these data labeling companies because as you said, they don't have a lot of leverage over pricing because they have so few customers and there's so many people getting into the space. So basically, even though they're some of the fastest growing companies in the world, you're feeling like there's, there's a challenge up ahead.
[22:00]
Chip Huan
I'm not sure if I'm bearish on it. I think, I'm curious because I think things have, has a way of work out in ways I don't expect. So I think that maybe these companies, they have a lot of data. Maybe they wouldn't be able to use that to like, have some insight that helps them, like, stay ahead of the curve, you know? So I don't know a very fair answer.
[22:24]
Lenny
Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast. This is the other piece of data content these companies share that AI labs really need. Can you just talk about what an eval is, the simplest way to understand it, and then how this helps models get smarter.
[22:42]
Chip Huan
So I think people approach eval, I think they're like two very different problems. One is a app builder.
[22:49]
Lenny
Right?
[22:50]
Chip Huan
Right. Like, can I say have an app that do, like maybe a chatbot. Very simple. And it's the first thing that came to my mind and I want you to know is chatbot is good or bad. Right. So I need to come up a way with, like, evaluate this chatbot. Another thing is I think of this as a task specific event design. So let's say I'm a model developer and I want to make my model better at curve writing. Right. And it was like, okay, but how, how, how do I even measure Code writing. Right. So I would need someone to, like, okay, understand curve writing and think about, like, what makes good story. Like, what makes a story good. And then design the whole data set and then criteria to evaluate career writing. So. Yeah, so. So I think there's that. I think it's like more like eval design that is very interesting. Come up with criteria, comma, work, guideline, how to do it, and then also, like, train people, like, how to do it effectively. So I guess in aquas, I think evar is really, really fun because it's extremely creative. I was looking at, like, different evolms people built and was like, wow. Like, it's just not dry at all. It's just like super, super, super fun.
[24:01]
Lenny
We had a whole podcast on evals with Hamil and Shreya, and. And that's exactly what they talked about. It's just. It's actually really fun to create evals for. For companies especially. So let's still dig into that one a little bit more. There's this kind of debate online that I don't know how big of a deal this debate is, but it feels like people spend a lot of time thinking about this, this idea of, do we need evals for AI products? Some of the best companies say they don't really do evals. They just go on vibes. They're just like, is this working well? Can I feel it or not? What's your take on just the importance of building evals and the skill of evals for app AI apps, not the model companies.
[24:39]
Chip Huan
You don't have to be like, absolutely perfect. I think to win, you just need to be like, good enough and being consistent about it. Okay, this is not a philosophy I follow, but, like, I have worked with enough companies to see that play out. So when I say that, why come you don't need evaluate. Let's say you are like an executive, right? And you want to have a new use case. So here's a use case you started out, we built, and it's like, it works well, right? The customers are somewhat happy. You don't have the exact metric for it. But, like, the traffic keeps increasing. Like, people seem happy, People keep buying stuff, right? And now here's our engineer coming, like, okay, we need eva for it. And so NSI was like, okay, how much effort do we need to put into eva? And they were like, okay, maybe like two engineers as much as much and they could maybe would improve. So angle, okay, so how much expected gain can I get from it? And the engineer would be like, oh, maybe you can improve it from like 80% to like 82%, 85%. Right. And I was like, okay, but you take that two engineers and be able to launch a new feature, then it could give me like so much more, like, improvement. Right? So, so I think it's like one of them is like eva. Sometimes people think of eva. It's like, okay, this is good enough. Just don't touch it. Like, if you do spend a lot of energy on eva, I would like only incremental improvement where it expands the energy on like another use case and maybe get it good enough that you can vibe check it. Right? So. So I do think just like, maybe like that's a debate. It's about. I do think just like a lot of time people just like get things to the place when it's like, okay, good enough, people run. But. And then, but of course it's like there's a lot of risk associated with it because if you don't have a clear metric, you have good visibility to household applications or models are performing, it might do something very dumb or it can cost you. Like, I know some. Something like crazy can happen. So. Yeah, so. So. So I do think EVA is very, very important. If you have, if you operate at scale and where, like, failures can have like catastrophic consequences, then you do need to be very tyrannical about like, what you print in front of the users, understand different failure modes, like, what could go wrong. And also maybe in a space when that like it's is a feature, the product is as a competitive advantage. Right. You want to be the best at it. So you want to have like a very strong understanding of where you are and where you are with the competitors. But it's just something that's more like a low key. Okay. Just like something it's like, okay, that's not the core, but it helps with our users. Then maybe you don't need to be so obsessed or like Tyrico about it and say, okay, that's good enough for now. And if it fails, then it fails. Okay. I know it's like, it's so terrifying. But yeah, I think it's all about the question of return on investment. I'm a big fan of eval. I love running eval. And that says, like, I understand why some people would choose to not focus on eval right away and choose, like, bringing on new functionalities instead.
[27:32]
Lenny
Awesome. That is a really pragmatic answer. What I'm hearing is evals are great, very important, especially if you're operating at Scale, but pick your battles. You don't need to write evals for every little feature. Something that Hamill and Shreya shared is that people need just like, I don't know, five or seven evals for the most important elements of their product. Is that what you see or do you see a lot more in production that people build and need?
[27:54]
Chip Huan
I don't think of like just a fixed number on like the evals. Like what was the goal to eval? Right. The goal of EVA is to guide the product development. So like you see eva, because I think I'm a big fan of eva, is that it helps you uncover opportunities, whether product are doing well. So sometimes I've seen it very obvious. Okay, so, okay, we look at the API and we realize it's like, okay, it performed really poorly on this like specific segment of users. And then we look into it, it's like, okay, what, what, what, what, what's wrong with it? And it turns out it's like we just like don't have a good messaging to it. So like we should like just focus on the things that we doing. PO can improve significantly. Yeah. So I kind of. The number of evolve is really pants. Like we have seen product with like hundreds of different metrics. People are going crazy. This is because that product is general. Have different EVAR for, I don't know, verbosity. Have one evar for user sensitive data and another is for length. But has a number of. Okay, let's just pick a good example, concrete example research. So you have the application, you have build a model to do deep research for you. Right, like, okay, like have a prompt. Let me say, okay, do me a comprehensive research on own Lenny's podcast and help me like sort of like propose, like show me report on what kind of topics he's interested in, what kind of videos could get the most views or like what topics that he's missing on that he should be covering. Right. Like have this kind of like prompt. Then how do you evaluate the result? Right. I don't think this like one metric that would help. Maybe it's just like, maybe you have like 100. I think somebody has a benchmark and is to get like a hundred expert, like write a bunch of prompts and I go through like all the, all the answers on AI and I do it and it's like it's extremely costly and slow. Right. But if you might have something else. For example, like one way I was thinking about it, I was talking to a friend about it and one way is just like how do you produce the result of the summary? Right. At first you need to do like gather information. And to gather information you need to do a lot of search queries. You gather, graph the search results and then some of the search results you aggregate. And then maybe say, okay, I'm still missing on this. You have to do another route and another route and easy enhancer summary. So every step of the way you need evaluations. You don't need to end to end. So maybe if it was a search query, you may first think about like, okay, now I write five search queries. Am I looking to like, how good are these search queries? Like, do they like as they like similar to each other? Because in the five search queries that are very similar, like, okay, Lenny podcast last month lend a podcast like two months ago, right. It's not, it's not very, very exciting. But like if the query is a podcast, like the, the keywords are like more, more diverse, right. And then look at the results of the, of the search query. Let's say you enter the search query like Lenny Parscat data labeling. And then they come up with like 10 pages, 10 results. And then you come up with like, oh, Lenny podcast on. I don't know, I don't know, like Frontier Labs. And you have like 10 results and my look is a different web page. Like how much of them overlapping? Like I, we are we doing both like the breadth, like getting a lot of page, but also like do we have depth and also like have relevant? Because we come up with a search queries that are completely irrelevant to the original prompt. So I feel like every aspect of it, it would need a way of evaluating. Right. So I don't think it's just like how many evolve should I get? But like how many evars should do. I need to get a good coverage, a high confidence in my application's performance and also to help me understand like where it is not performing well so that I can fix it.
[31:43]
Lenny
Awesome. And I'm hearing also just especially for the very core use case, like the most common path people take in your product is where you want to focus.
[31:53]
Chip Huan
Yeah. So yeah.
[31:55]
Lenny
Okay, let me. There's one more term I want to cover and I want to go in a somewhat different direction. Rag. People see this term a lot. Rag. What does it mean?
[32:05]
Chip Huan
So RAG stand for Retrieval Augmented Generations. And also not specific to GenApp AI. So the idea is just like a lot of questions we need context to answer. So I think it came pretty. Oh, I think it's from the paper 2017. So someone was like. So they realize it's like for a bunch of like benchmark when the question answering benchmarks they realize it's like okay, if we give the model information about the questions then the answer can be much, much better. So what they do is they try to retrieve information from Wikipedia. So for questionable topics it's like retrieve that and then into context and like answer it does much better. So I feel like it sounds like a no brainer, right? I mean like obviously so, so I think that's what racket as a simplest sense is just like providing the model with a relevant context so, so that it can answer the questions. And, and that's where like things get like really more interesting. Because traditionally when it started out rag is mostly like text. So so we, we talk about like a lot of ways like how to prepare data so that the model can retrieve effectively. Let's say that's like not everything is a Wikipedia page, right? Wikipedia page is pretty contained and you know, okay, everything about it is about a topic. But a lot of times you have documents extremely long, right? And like they have a weird way of like structures of documents. Let's say that you have documents about Lenny podcast, right? And in the, in the future, in the beginning of documents like from now on podcast wouldn't refer to Lenny's podcast, right? So let's say somebody in the future is like okay, tell me about Lenny, right? Lenny's work. And because as a references document does not have the term Lanny, you just don't know. You might not read through it and the document is long enough that it chunk into a different part. So like the second part doesn't have the, the word manic so you cannot retrieve. So you have to find a way to process data so that makes sure that it can retrieve the information is just relevant to the query even though it might not immediately like obvious that is related. So it will come up with like only thing of I think like contextual visual like giving a chunk of the data the relevant like maybe a summary metadata so that it knows or like some people use it like as hypothetical questions. It's very interesting like for even this chunk of like documents I must generate a bunch of questions that these chunks can help answer so that when I have a query it's okay, does it match any of the like hypothetical questions so I can, it can fetch it. So it's very interesting approach. Okay, so maybe before I go to the next thing I just want to say this like data preparations for rac is extremely important. And I would say just like in a lot of the companies that I have seen, that's like the biggest performance in their rack solutions coming from like better data preparations, not agonizing over like what red database to use. We've got variable database. Of course it's very important to care about like things like latency or like if you have like very specific access patterns like read heavy or write heavy, of course it's like it matters. But in terms of pure quality answers, right? I think the data preparation is just.
[35:08]
Lenny
Like hands out when you say data preparation, what's an example to make that real and concrete for us to understand?
[35:13]
Chip Huan
So one way is mentioned as in you have chunks of data so we can think about how big of each chunk should be, right? Because if it's sort of think about it's a context you want to maximize, maybe you can. It's very simple example. Right now you want to retrieve like a thousand words, right? So if H chunk's data is too long then. So if a data chunk is long, then it's more likely to contain more relevant metadata, so you can retrieve more. But if it's too long like, then you have a thousand words. And so chunk is like a thousand words. You can reach one chunk. So it's not very useful, but it's too short. Then you can retrieve more relevant information like. Oh, so it can retrieve a wider range of like documents and chunks, but at the same time each chunk is too small to contain relevant information. So we have like very nice like chunk design, like how big each chunk should be. You add like contextual information like summary metadata, hypothetical questions. Somebody was telling me just like a very big performance it got, is that from rewriting their data in the question answering format. So like instead of having like. So they have a podcast, right? Instead of just chunking the podcast, you just like reframe, rewrite it into like, here's the question, here's answers, like, and produce a lot of them. It can use AI for that as well. So that's one example of data processing. A lot of example we, I see is like for people helping, like using AI to have like specific tools and documentations, right? And a lot. And we write documentation, usually tell our documents. Documentation today is written for human reading. And AI reading is different because it's different because humans, we have common sense and we know what it is. So things are like for human experts, they have a context that AI doesn't quite have. So somebody told me that the big change they have is let's say that you have a function, the documentation for maybe the library. And this library says okay, the output of this one is like maybe talking for like I know some crazy term, maybe some temperature or something under grab it should be like 10 or minus 1. And as a human expert maybe understand the scale, like what one in the scale mean but like for AI just really doesn't understand what that means. So actually have like another annotation layer for AI it's like okay, good temperatures equal one meant like that. It's not like the absolute temperature, it's like absolutely sitting with the scale over there. So it's just saving all this data processing to make it easier for AI to retrieve the relevant information. To answer the questions, this episode is.
[37:46]
Lenny
Brought to you by Persona, the verified identity platform helping organizations, onboard users fight fraud and build trust. We talk a lot on this podcast about the amazing advances in AI, but this can be a double edged sword. For every wow moment, there are fraudsters using the same tech to wreak havoc, laundering money, taking over employee identities and impersonating businesses. Persona helps combat these threats with automated user business and employee verification. Whether you're looking to catch candidate fraud, meet age restrictions, or keep your platform safe, Persona helps you verify users in a way that's tailored to your specific needs. Best of all, Persona makes it easy to know who you're dealing with without adding friction for good users. This is why leading platforms like Etsy, LinkedIn, Square and Lyft trust Persona to secure their platform. Persona is also offering my listeners 500 free services per month for one full year. Just head to with Persona.comlenny to get started. That's with Persona.comlenny. thanks again to Persona for sponsoring this episode. Awesome. Okay, so you've talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things. I want to spend a little time here because a lot of companies are building AI products. A lot of companies are not having a good time building AI products. Let me ask a few questions along these lines of what you've learned working with companies that are doing this. Well, one is just I guess in terms of AI tool adoption and adoption general within companies. There's all this talk recently of just like all this AI hype. The data is actually showing. Most companies try, it doesn't do a lot. They stop. And so there's all this just like maybe this isn't going anywhere. So in terms of just adoption of tools And AI within companies.
[39:31]
Chip Huan
What are you seeing there for gen AI in company? I think there are two type of gen AI toolings that have been. I have seen like one is to like internal productivity, right? Like have coding tools like chatbot internal knowledge. Like a lot of big enterprises have some kind like a rapper around like models but like with access like maybe some different type of rack solution. I think we talk about data okay. Like text based rack. We haven't talked about like agent rack or like haven't like multimodal rack yet. But it's like yes, there's a whole very exciting area around that. Yeah. So like basically to allow the employees to like access internal document. Somebody. Somebody asks like okay, I'm having a baby. What could be the maternal or partner policy, right? Or like am I having this operations with a health benefit, like cover that or like I want to interview, I want to like refer my friend. What could be the process for that? So a lot of this like having chatbot internal chatbot to help with internal operations. And another things. Another category is more like customer facing so or like partner facing so product customer support chatbot is a big one. If a hotel chain you might have like a booking chatbot which is like somehow massive. Like a lot of booking chatbot. Because I guess I do have this theory of like a lot of applications companies pursue because they can measure the concrete outcome and I feel like booking on a sales chatbot is very clear, right? Like what's a conversion rate right now with a chatbot with human operators and what could be conversion rate with a chatbot. And somehow I think it's like very clear outcomes and companies are easier to buy into these solutions. So a lot of companies have that customer facing chatbot. So yeah, so that is another category of tool and I think that for customers or external facing tools because people are driven to. People are driven to choose applications with clear outcomes. So the questions of adopting them is really based on whether they see the outcome or not. Of course it's not perfect because sometimes the outcome can be bad. Not because the idea or the application's idea itself is bad. It's just because I don't know the process of building it is not that great. So it's tricky for the internal adoptions of toolings, internal productivity. That's where it gets tricky. I would say like a lot of companies, what they think of AI strategy. Like I think of AI strategies have like usually have very. Have like two key aspect, right? It's like use cases and the second is talent. You Might have a great data for great use cases, but you don't have talents and you cannot do it. So a lot of time at the beginning with Gen AI and it's still. And sometimes I'm really admire a lot of companies for that. It's just like, it's like, okay, we need our employees to be very gen AI aware, like very AI literate. Right. So what they do is I start like maybe like adopting a bunch of tools for the team to use. They have a lot of upscaling workshops. Like they encourage learning. I think it's like a really, really good thing. And it's also willing to spend a lot of money into like adopting like giving people like ChatGPT subscriptions, personal subscriptions, cloud code subscriptions, subscriptions, like to get the employees to like to, to be more AI literate. And that's the thing is like a lot of the secretary come to me and say, okay, we spend a ton of money on this tooling, but then we don't see. Because you can see the usage. It was like. But people don't seem to use them as much. And what is the issue? So, so yeah, so I think that that is, that is tricky. Yeah.
[43:20]
Lenny
What do you think is the issue? Is it just. They're not, they're like, they don't know how to use them. Like, what do you think is the gap here? Do you think we'll get to a place of just like, wow, work is completely different because of AI. For a lot of companies, the main.
[43:33]
Chip Huan
Thing is like it's really hard to measure productivity again. So I talked to a lot of people on their website. First of all, Honest Sample is coding, right? A lot of companies not using coding agents or coding AI coding. And I was asking, I was like. And was like, do you think that it helps with your productivity? And a lot of times the question is very hand wavy. It's like, okay, I feel like it's been better, right? And I said okay, because we have more PRs, we see more code and then immediate correctness. Okay. But of course, code number of live code is not a good metric for that. Right. So it's really, really tricky and it's something funny. So I do ask people to ask their managers because I work with like usually VP level, so they have like multiple teams under them. So I asked them like, okay, do you ask a manager like, okay, would you rather have access? Would you rather have give everyone on our team like very expensive coding agent subscriptions or you get an extra headcal? Right, let's say it's like maybe like. And almost everyone could say the measures could say headcount. But if you ask VP level or like someone who manages a lot of teams, they would say this. Like they could want AI assistant assistance tools. And the reason is that people say like, okay, because as managers, right, because you are still growing, like you're not as a level when you manage hundreds of thousands of people. So for you like having one hr hash count is like, is big. So you want that not for productivity reasons, but because you just want to have more people working for you. Whereas for executive you care more about like the maybe you have more like business metrics that you care about so you actually think about what actually drive productivity metrics for you. So yeah, so it's tricky. And I think that's like the question of productivity. I'm not sure it's like fundamentally some people are more productive, but it's just like we don't have a good way of measuring productivity improvement. Another thing is also varies widely. And I think that people do tell me that they notice different buckets of employees, like different reactions to AI assistive tools. Like first of all, I'm keep going back to coding because it's big and it's easier to reason somehow. So it says I have different reports. One team would tell me that. One of people tell me, okay, amongst all his engineers, he thinks it's like senior engineers would get the most output, would be more productive because like okay, so that person very interesting. So he actually divided his team to like three buckets. But he didn't tell them obviously. He was like, okay, here's more like currently like best performing, average performing and lowest performing. And then there's a randomized trial. So they give like half of each group like access to like to like cursor. And then he was noticed like over time. I was like, okay, so something funny like the group that get the biggest performance boost, like in his opinion, who's very close in his team, there's the biggest boom boost like the senior engine, like the highest performing. So the highest performing engineer get the biggest boost out of it. And then the second group is just like the average performing. So his opinion is like, okay, the highest performing engineers are so more proactive. They will say no how to solve problems. So AI helps them solve problem better. Whereas the people who are lowest performing, they only don't care much about work. Right? So like this easiest to just like go on autopilot, get it to like Jared like that code and just like do it and I always just don't know how to do it. As another company, however, they told me just like actually senior engineers are the one most resistant to like using AI assistance tooling. Because they said it's like okay, but AI, because they are more opin and they have very high standard, it's like okay, but AI code Jericho just sucks. So just like very, very resistant in using this. So I don't know, I haven't quite been able to reconcile very different reports on that yet.
[47:39]
Lenny
This is so interesting. So just to make sure I'm hearing the story. So there's a company work with that did a three bucket test with their engineering team where they created three sorts of groups. The highest performing engineers, mid performing engineers, lowest performing engineers. And gave some of them. So they gave some of them access to say cursor. Was it cursor or what did they give them access to? It was cursor.
[48:03]
Chip Huan
I think by that it was cursor.
[48:04]
Lenny
Okay, cool. And so within.
[48:06]
Chip Huan
I didn't work with them. This is more like a friend company.
[48:08]
Lenny
Okay. It's a friend's company. So did they give like half of the higher performing engineers cursor and half not or how did they do the split?
[48:15]
Chip Huan
Yeah, so like they give like half of the entire company but like half for each bucket. Yeah. And then they observe the difference in like productivity.
[48:21]
Lenny
I see.
[48:22]
Chip Huan
Yeah.
[48:22]
Lenny
So how do they even do that? They're just like, okay, you get cursor. You don't get cursors. How did they do that? That's so true.
[48:27]
Chip Huan
Yeah. I didn't get the mechanics of it, but I was like at respect here for doing a randomized trial.
[48:33]
Lenny
That is so cool.
[48:34]
Chip Huan
Yeah.
[48:34]
Lenny
Okay. Wow. How large was this engineering team? Was it like hundreds of people?
[48:38]
Chip Huan
It's not that large. It's about like maybe 30. 40.
[48:43]
Lenny
Yeah, 30 to 40. Okay.
[48:44]
Chip Huan
Yeah.
[48:45]
Lenny
Wow. Okay. So they found that the highest performing engineers had the most benefit from using AI tools. And then behind them was the middle tier engineers and the worst performers or the lowest performers.
[48:59]
Chip Huan
But it's not the same everywhere. Like some companies different.
[49:04]
Lenny
Right. This other example you shared of just senior engineers in this one example are most resistant to changing the way they work, which I get. I do feel like the most valuable people right now other than ML researchers and AI researchers like yourself are senior engineers because it feels like junior engineers are just like so much of this is now done by AI, but. But an engineer that knows what they're doing, that understands how things work at a large scale with AI tools Just basically like infinite junior engineers doing their bidding feels like an extremely valuable and powerful asset.
[49:39]
Chip Huan
Yeah, I definitely like really appreciate as you see companies like we appreciate engineers who are, have a good understanding of the whole systems and be able to have good problem solving skill, are thinking holistically instead of locally. When our company have seen the way they work as they told me, we're completely different now. And so they actually restructured engineering or so that they get more senior engineers should be more in the peer review because they get sort of written guidelines on what is a good engineering practices. Whereas a process would be like maybe like okay, so they've write like a lot of like processes on how to work well and then they, and then they have more, more junior engineers just like produce code and, and like submit PR but senior engineer more in the reviewing case. So I think it's, it might be prepared for the future. So another company actually told me something very similar. So they can't prepare for the future when they only need a very small group of very, very strong engineers to create processes and reviewing code to get into production. But get AI or junior engineers to produce code. But then the question becomes just like how does one become a very strong.
[50:53]
Lenny
Right, that's right, that's right.
[50:55]
Chip Huan
I feel like, yeah, So I don't know what's the process I was thinking about? Like, yeah, no one's thinking about it.
[51:02]
Lenny
It's just, it's a problem. We won't have any more in 10, 20 years there'll be no more engineers because no one's hiring junior engineers. Although I could make the case junior engineers. People just getting into computer science right now are just native AI native. And in theory, you could argue they will become really good really fast if they're curious, aren't just, you know, delegating learning and thinking to AI, but learning how to actually using it to learn how to code well and architect correctly. Like you could argue they will be the most successful engineers in the future.
[51:33]
Chip Huan
I do think that what I mentioned is I load into architect. I think I group that in my system thinking. I do think it's very important skill because I think AI can help automate a lot of disjointed skills. But knowing how to utilize these skills together to solve a problem, it's hard. So there's a webinar between Mehran Sami was my, one of my favorite professors. He was a chair of the curriculum as a CS department at Stanford. So he spent a lot of time thinking about CS education. Right. Like what, what, what should students learn nowadays in the area of like AI coding? And then the other person is like Andrew Ng, which is of course is like a legend in the AI space. And Nehra Sami person like Sami said something very interesting. Like he said like, a lot of people think that CS is about coding, but it's not. Coding is just a means to an end. CS is about system thinking, using coding to solve actual problem. And problem solving will never go away because what AI can automate more stuff, the problem just gets bigger. But the process of understanding what caused the issue and how to design step by step solution to it will always be there. So I think an example of, of like, I actually have a lot of issues with like AI for like in the way of like is debugging. So I'm not sure you use a lot for coding. But like something I noticed and also seen from my friends, it's like it is pretty good when you have very clear, well defined tasks. Maybe write documentations, fix specific features or like build an app from scratch, right? Like doesn't have to interact with a large existing code base, but it added something like a little bit more complicated. Get it maybe require interaction with other components and stuff. It's usually like not that good. And for example, I was using AI to deploy applications and it was testing out a new hosting service I was not familiar with. It was like, okay, usually they form me. So what AI does give me is like confidence to try a new tool. Like before with AI is like trying new tools has written out documentations for the beginning. But I was like, okay, just try it out and learn. So I was testing out this new hosting service and it kept getting a bug. It was like very, very annoying. And it was like, okay, I asked car codes like fix it. And it keep giving me like it keep changing the way like maybe change the environment variable, fix the code, maybe I change from the function to this function, maybe change the language, maybe it doesn't process JavaScript well, I don't know, whatever. And it didn't work. And it was like, okay, that's it. I'm just going to read the documentation myself and see what's wrong. And it turns out it's like I'm on another tier. Like the feature I want did not is not available in this tier, right? So I feel like, okay. So the issue with cloud is just trying to focus on fixing things from a very, a different component, whereas the issue is from a different component. So I think, I think of like, okay, be understanding how different components work together and where the source of issue might come from. You need to, you need to give a holistic view of it. And it's made me think it's like okay, how do we teach AI like system thinking like that. I think I have all the human experts having right like very much people call this scaffold just like okay, for this kind of problem, look into this, look into that, look into that and then stuff. So I think that could be one way. But that's what made me think it's like how do we teach humans like system thinking. Yeah, so yeah, so I think it's very interesting skill. I do think it's very important.
[55:04]
Lenny
That's exactly the same insight Brett Taylor shared on the podcast. He's the co founder Sierra, he created Google Maps, he was CEO of Salesforce, Quip a few other things and I asked him just like should people learn to code? And his point is exactly what you said which is learning. Taking computer science classes is not about learning Java and Python. It's learning how systems work and how code operates and how software works broadly. Not just here's like a function to do a thing. One thing that I wanted to help people understand. You wrote this book called AI Engineering which is essentially helping people understand this new genre of engineer. And you have this really simple way of thinking about the difference between an ML engineer and an AI engineer which has a really good corollary to product managers now of just like an AI product manager versus non AI product manager. The way you describe it and fill in what I'm missing is just ML engineers built models themselves. AI engineers use existing models to build products. Anything you want to add there.
[56:05]
Chip Huan
One thing I really dislike about writing books is that it has to defy like this and I think it's like no definitions will be perfect because there always be like edge cases. But yeah, in general I think it's like gen AI as a service like models a service like when somebody build the models for you and the based model performance is a pretty strong so. So it's like it's enable people to just like okay, now I want to integrate AI into my product. I don't need to learn for grand design is even though knowing that could really help but. But yeah it's like it makes the entry barrier really low for people who want to use AI to build product and at the same time AI capabilities are like so strong. It's like it's also like increase like the possibilities like the type applications that AI can be used for. So I think like yes, both Entry barriers like super low. And I said demand for like a applications like a lot bigger. So I feel this is very, very exciting. It opens up like a whole new world of possibilities.
[57:04]
Lenny
Yeah, it's like now you don't have the time, you don't even have to spend time building this AI brain now you could just use it to do stuff. Such a, such an unlock. Okay, maybe just a file question. You get to see a lot of what's working, what's not working, where things are heading. I'm curious just if you have to think about in the next two or three years just where things are heading, how do you think building products will be different? How do you think companies working will be different? If you had to think of maybe the biggest change we expect to see in the next few years in terms of how companies work, I, I think.
[57:41]
Chip Huan
In a lot of organizations they don't move that fast, right. But at the same time they also move faster than I expected because again I think it's like bias, like I don't work with dinosaur, companies don't care. I think a lot of executive who come to me are like very forward looking. So maybe for me I'm very biased towards like organizations is like move fast. So, so yeah. So I think one big change I see is like in organizational structure. I think it's like a lot of value place in like, so before we have like a lot of disjointed team, like we have very clear like engineering team, product team. But then there's a question of like who should write eva, right? Like who should own the metrics. And it turns out it's like EVA is not a, it's not a, it's not a separate problem, it's a system problem. Right. Because you, you need to look into different components, how they interact each other. You need to use the behaviors because you need to know what users care about so that you can, so that you can like write, write, evaluate, reflect what users care about. So all of that like you can sort it from like you look at your different component architectures, place guardrails and stuff. So it's just engineering. But understanding users is like what product, right. So because of like a lot of things, any extremely important. So like Zakat brings product team and engineering team, even marketing team, user acquisition, very close to each other. So yeah, since in a way people are structuring, so that's more communications between previously very distinct functions. Another thing is I also see as teams of course think about what can be automated in the next Few years and what cannot be automated. And I see that people already shedding actually it's a little bit like scary to think about it. But I also think it's like the team, the web top is like okay, this is between you and me. But we like got rid of these functions right like for a lot of things like previously outsourced for example. Like traditionally it's a business outsourcing the smart core to them. And like can be done with like not can be more systemized systematized. So with that you can actually use AI to automate a lot of that. And so as a separation people think more of like what is the value of junior engineers or senior engineers how to restructure engineering for that. So yeah, so I do definitely think that is one thing to success organization people are just moving pieces around and thinking about use cases, whether you need to spin out new use cases and who would lead a new effort. And yeah that is one big change. Another thing in terms of AI I think I'm not sure how true this is. I guess I'm also on the camp of thinking that it has merit. It's a camp of okay base models we have probably not quite maxed out but we are unlikely to see really strong, crazy strong model. So like you remember like when we have like GPT, right? GPT2 which is a big step up like an order of magnitude like, like better than like GPT. And then GPT3 which like much, much bigger. GPT4 much much bigger. And I of course I have GPT5 but like is GPT5 like that scale of like much bigger like a step jump compared to like the previous. I think it's a debatable. Right, so, so I think that it's like we have disappointment like the base model performance improvement is not going to be like mind blowing and it was in the last three years. So I think there's a lot of improvements when I see in the post training phase, in the application building phase and yes also I think that's where I feel I would see a lot of improvement there. I also very interested in multimodality. So we've seen a lot of text based but I think there's a lot of audio videos use cases that is very very exciting. And I think audio is not quite as soft as one thing because I do work with a couple of voice startups and when you talk to think about voice it's an entirely different beast. So let's say you have chatbot, right? If we go from a Text chatbot to voice chatbot. It's like the concerns are completely different because now with voice chatbot, right, we need to think about like latency because having multiple steps. First like text, like voice to text, text to text, and text question into text answer, and then like, and then text to voice answer, right? So it's like multiple hops and like latency become very important. And there's a question like, what does it make to sound natural? So for example, like people think like in AI and humans, when humans talk to each other, if I say you try to interrupt me and it's like chip, that's right, I would pause and I try to hear you out, right? But sometimes even myself say some word, like acknowledge. When I shouldn't stop, I just continue. So the question of force interruption, whether it's like, should I stop or not, it's a big in what perceived as like natural conversations. And there's also regulations, right? Because like a lot of time people want to build AI chatbot. Voice chatbots that sound like humans try to like trick users into thinking they're talking to humans. But it's of like maybe potential regulation saying like, okay, you have to disclose to users when you talk if the, if the bot is talking to is human or AI. So I think just like there's a whole space, I think it's not quite as sol as you think, but it's not quite like an AI foundation model problem, right? Because like a human interruption detection is actually a classical machining problem. It's a different framing of like, you can build classifier for that or like the question of like, let the c a massive enduring challenge, not an AI challenge. Of course it can be an AI challenge because we are trying to build a voice to voice model. So instead of having like having to first transcribe the voice from me into text and then get a model generous text answer and get another model to turn from text to speech. You can send the voice to voice directly. So that is something we're working on. But it's very hard. Even audio, I think of it, it's easier than video because video have both image and voice. It's already pretty hard, so I think it's a lot of challenges in that space.
[64:16]
Lenny
That was an awesome list of things. Let me run back real quick. So what you're predicting in the next few years, things that will change in the way we work. And these actually resonate with so many conversations I've had on this podcast. So just kind of doubling down on where things are Heading one is the blurring of lines between different functions. Instead of just like engine design engineering, everyone's going to be doing a lot of different things now. Two is just more of work being automated with agents and all these AI tools and just in theory, productivity going up. Third is shifting from pre training models to post training, fine tuning and things like that. Because to your point models maybe are slowing down in how smart they're getting. Although I'll point folks to the chat with the co founder of Anthropic. He made a really good point here. He's like we're really bad at understanding what exponentials feel like when we're in the middle of that. And also models are being released more often so the difference between them we may not notice because they're just happening more often versus GPT3 came out like a year, I don't know after JPT2. So maybe true, maybe not. And then the fourth point you made is this idea of multimodal investing in multimodal experiences. I cannot wait for chat GPT voice mode to get better at interruption. Like exactly what you're saying. I'm just like talking to it and then someone makes a little sound, it's like okay. And then you have to. And then it's like. And then it stops talking. So annoying.
[65:36]
Chip Huan
I'm shocked that we don't have better voice assistant at home yet. I think I have been testing out a bunch. Honestly I keep hoping oh my God, Zach would be the one and then I know how many of them I just like had to get away because they're not that good.
[65:49]
Lenny
I think it's coming. I hear it's coming. Anthropic's working with someone that I don't know if it's launched or not yet.
[65:54]
Chip Huan
Yeah, I sorry want to bring back to what you mentioned about like your guest like from Anthropic mentioned about the performance improvement. I think there's a big change. I think this difference between a model based capability. So I'm talking about the pre trained model versus the perceived performance. So let's say I'm actually thought about. Are you familiar with the term test time compute?
[66:20]
Lenny
I don't think so. Help us understand.
[66:23]
Chip Huan
So the idea is okay, you have a fixed amount compute, right? So you're going to spend a lot of compute on pre tuning or tuning the model. Pre tuning and then I've spent a lot of some compute only five tuning and the ratio like pretreating and push tuning compute is like crazy varies different between different lab and also like Senzen has a spend compute on like generating inference when I have a trims and five tunnel model and now you want to like serve it to users. So I might type a question, a prompt and it's like Jared, like do inference like and that requires a compute. And I said people in discussion of like should I spend more compute on like pre tuning or fine tuning or inference, right? Because like inference and people found out as like test time compute. So like spending more compute on inference is like called like test time like compute. Like the strategy of like just allocating more resources compute resource to generate inference when actually bring better performance. And how does that do it? Like let's say you have a math questions, right? And maybe instead of the steric 1 Answer I can get like 4 different answers and say okay, whichever is the best according to some standard. Or like okay, have four answers and then maybe like three of them say 42 and one of them says like 20 okay, three of them in agreement. So the answer should be 42. Right? So like just people shouldn't generate a bunch of it. Or another thing is like a lot of time like reasoning thinking is just like be able to like generate more thinking tokens. I spend more time thinking before showing the final answers. It's like require more compute but it's only giving more better performance. So I think it's like from the user perspective when the model spend more time exploring different potential answers, thinking longer it can give you much better final answers. But the base model itself does not change. Does it make sense?
[68:17]
Lenny
Yes, that does. Absolutely. That is a good corollary to Ben Man's point. Yeah Chip, we covered a lot of ground. I've gone through everything I was hoping to learn and more before we get to a very exciting lightning round. Is there anything else that you wanted to share? Anything else you want to leave listeners with?
[68:34]
Chip Huan
So I do work with a few companies that does this thing of like they want employees to come up with ideas. So it's a big debate on what is a better way for AI strategy. Should it be top down or bottom up? Should executive come up with one or two killer use case and everyone allocate resource to that or should give engineers and PMs and smart people come up with ideas. And I think it's just a mixture of both. So some companies it was like okay, we hire a bunch of smart people, let's see what they come up with. And they organize more like hackathons or in internal challenge to get people to to build product. And one thing that I noticed is like a lot of people just like don't know what you built. And it shocked me. Like, why? I feel like we are in some kind like an idea crisis right now. We have all this really cool tools you have. You like do everything from scratch. I can have you like design it, can have you like write code, you can have your website. So in theory we should see a lot more. But at the same time people are like somehow stuck. Like they don't know what to build. And I think it's like maybe it's a lot of had to do with like, maybe like society expectations because we have gone through, we have gone into this phase of like specializations, like people like very highly specialized and people are supposed to do like focus on one thing really well instead of being a big picture and we don't have a big picture of you. It's hard to come up with like ideas of what you build. So I know what, like when I work with this company, I think this hackathon, like we do work out like how come up with a guideline like how to come up with ideas. And usually what we think of is like, okay, like one tip is like go look from the last week, right? Like for a week just like pay attention to what you do and what frustrates you. And when something frustrates you think like, is there anything we can do? Is there like, can it be done a different way so it's not frustrating? And you can talk like people can swap accept sub no teams. And I even see they come on frustration. Maybe just something you can think about is just to build something around that. So yeah, so I feel like just like notice how we work thinking of ways to constantly ask questions like how can this be better? And then I just build something to address the frustrations. I think it's a good way to just learn and adopt AI.
[70:46]
Lenny
I think people have felt exactly what you're describing every time they open up one of these vibe coding tools where you could just describe anything you want. And like, I don't know what do I want? And, and I love this very tactical piece of advice. Just like what frustrates you. Just pay attention to where you're frustrated. For example, I just built a very cool little vibe coded app. I was working on a newsletter post inside Google Docs and I, I pasted all these images into the Google Doc from screenshots and stuff and. And then I forgot, oh yeah, you can't take images out of Google Docs. It's like this hotel California Experience where you can paste stuff into it very hard to get images back out. So I just went to all the vive coded tools and just build an app that I can give you a Google Doc URL and let me download all the images automatically and it worked amazingly well and I made it really cute and I'll. I'll link to it in the show notes.
[71:32]
Chip Huan
Oh, I would love to see that. I do. I'm very bullish on using AI Just create like micro tools. It's just something just like make your.
[71:39]
Lenny
Life a bit easier and 100%. I feel like that's one of the main ways people are using these tools. Just like a little niche problem they have with that chip. We've reached our very exciting lightning round. I've got five questions for you. Are you ready?
[71:55]
Chip Huan
Yeah, always. No, no, no. Depends on how hard the questions are.
[71:58]
Lenny
They're very consistent across every guest so I imagine you've heard them before. First question, what are two or three books that you find yourself recommending most to other people?
[72:10]
Chip Huan
O I'm really terrified of like book recommendations because I feel like what books a person should read really depends on what they want and where they in life and where they want to get to. But like there's several books that I do think is really changed the way I think I see the world. So one thing is a selfish gene that's like to understand it actually change it actually helped me with a question like whether I want to have kids or not because it's like understanding more of like. Yeah, a lot of our functions. The way we operate is your functions of our genes and genes want to do one thing was like to procreate. So. So yes, in a little way. But also like the book also propose another thing is like so everyone wants to live forever, right. And maybe it's not like consciously but subconsciously we do, we do want that. And I said two ways. Like one is like via genes like genes. One is just like going to continue forever. But also there are two ideas. I think there's something going to meme it's just like being a boy if you have some ideas out there and then it's like last for a long time. That's on boy to like live on. I know it's like it's a little bit like abstract but I thought it's very interesting. These are the books I really, really like is like from like it's a book from Singaporean previous. I think he's also going as a father of Singapore. I don't know, like I'm not sure what's her title it but like he did. So he was the one who left Singapore from. He's changed Singapore from a third world country to a first world country within 25 years. And I have never seen any country leaders spend so much effort into like putting down his thought on like how to build a country like, like that. And as I talk a lot about like public policy, like how to like create policies that encourage people to do the right things that is good for the nations. And I started talking about like foreign affairs, foreign policies, like deliberation of like the country with other. So it's a really good book to think about for me. It's like system thinking but like it's a different kind of system which is country which a lot of us don't get a chance to like ever experiment in our life. So it's good to learn about that.
[74:13]
Lenny
What was the name of that second book?
[74:15]
Chip Huan
It's called like From Third to First World Fashion. I think I have it somewhere here.
[74:20]
Lenny
Yeah, there it is. Show and Tell. That's. That's awesome. I definitely want to read that. That's a really good tip. I've heard a lot about just the impact he's had and I've seen all these videos on Twitter of just his really wise insights into how to build a thriving society. And clearly it works.
[74:35]
Chip Huan
How does he have time to write such a thick book? It's like insane.
[74:39]
Lenny
That is. Claude, please summarize. I'm just joking. By the way, Selfish Gene, I also absolutely love that book. That is such a good choice. It's such an under the radar kind of book that really changed the way I see the world as well. So really good pick. Okay, next question. Do you have a favorite recent movie or TV show you really enjoyed?
[74:56]
Chip Huan
So I watch a lot of movie and TV shows as a research because I working on my first novel and I recently sold it. So I'm interesting. Like what makes. It's a drama, it's not a science fiction or anything that like tech people usually read. So it's, it's very like, I know it's a very out of the left field of left field and like very so it's almost like reading, watching TV to see like what kind of stories become popular, trying to understand the trope and stuff like that. So I'm not sure if the audience.
[75:27]
Lenny
Like well, what's one, what's one that taught you something about writing?
[75:31]
Chip Huan
I think I like Yamsi Palace. It's a Chinese TV show.
[75:37]
Lenny
Cool. Okay. I haven't heard that one on the podcast before. Okay, cool. Next question. Do you have a life motto that you often think about, come back to when you're dealing with something hard, whether it's in work or in life?
[75:51]
Chip Huan
This sounds very nihilist. I think the same, like, in the end, nothing really matters. Usually I think of, like, in the grand scheme of saying that in a billion years, nothing will. Like, no one will ever be there. I think, okay, someone will argue with me about that. So I go to say, like, so my theory is, like, in a billion years, like, none of us will never exist. Like, so, like, whatever. Like, messy things, like, crazy things we do, or, like, how bad we do it. I mean, no one would be. Remember. Wouldn't be there to remember it. And I think in a way, it's like, it sounds scary, but it's very liberating because it just allows me, okay, let's just try things out, right? Like, why does it matter? And there's a story of, like, recently. So we have some family member who passed away recently. And I was talking to my dad because I couldn't be home for that. I was asking my dad, like, okay, say anything I can do to make the person, like, something like, comfort. So anything you can get that person. And my dad was just like, what can he possibly want at this moment? Like, this made me feel like, at the end of life, like, there's nothing that can bring you, like. Like, material can bring you joy. There's no, like, money, no product, nothing. And in a way, it means, okay, what really do I really care about at the end of the day? So I guess it's like a thing about it. It's like, okay, maybe I fail it. Maybe I don't get a contract. Maybe the things, like, at the end of life, like, I don't think that actually really matters. So in a way, it's like, it's quite liberating.
[77:15]
Lenny
I know you said it might be nihilistic. This is what Steve Jobs shared, too, in one of his most famous speeches. Just, we will all die someday, so don't take things so seriously. And it is freeing, absolutely. It just makes you appreciate every moment, every day. You have just like, yeah, let's just do something hard and scary. Okay, final question. You talked about how you're writing a novel. Most people in tech have never written something creative and fiction. What's just, like, one thing you learned in the process about how to write better stories, better fiction?
[77:46]
Chip Huan
A lot of time when we read, we get tripped up by some small things. So I think like I want to do creative writing because I just want to go a better writer. And it tells us like, maybe try my. Like a different audience could help me like become better at like anticipating what this different type of audience would want to hear and like what they care about. So it's a way for me to get a. So I think about writing or like even like any kind of like content creations. It's about like predicting the user's reactions, right?
[78:15]
Lenny
The next token. Just kidding.
[78:17]
Chip Huan
Yeah. So, so like you do a podcast, it's like, okay, what kind of things that the users could find engaging, right? And I find it's like a little bit like. And a lot of companies like you have like launch a product, you have a narrative coming out, say, okay, what guide? How do we position this product in a ways of like users would want, right? So I feel like I have done technical writing for a while and I felt like I have had some experience like trying to predict what engineers would want to hear all I care about. But then I don't have an experience like this completely different type of audience. So that's what I want to do like creative writing, writing story. And that's why I was doing a lot of research on my question. I mean going to research which I enjoyed a lot, like watching a lot of dramas, I just see like what people like. So one thing that I care about is just like I think I learned it's like well like emotional journey. It's from my editor, right? So like when we write something that we care about, like how. How users would feel like across the. The story, like we want something in the beginning, right? We want something just like we need to have a hook so that people continue reading. But we also don't want too much of like drama because we'll get like too tired, right? Like, because like you're emotionally exhausted. Like, because it's like you're being like emotionally manipulated like a lot of time. So it give like emotional, emotional journey. Maybe have like some, some climax or like some. Something more chill. Like maybe like I also care about. Another thing that I didn't realize is like for me for technical writing, you entirely focus on the content. Like the argument is very impersonal, right? Like it like for example, like people like ML compilers, like, doesn't matter if they like the person telling them about compiler or not, right? Because it's just like objective like. But like for novel people care about like character likability day. So, so like in, in the first versions of my story, it makes the characters like a little bit more like very, very logical, very rational, and just does everything just like very rationally. And then the feedback I got is, I have a very good friend, Reddit. And he was, he's an amazing person. He's a great person. And he was like, chip, I'll be honest, I hate that person. So it doesn't matter as a story. It's just like, the person is so unlikable. Just say he doesn't want. So it's a second version and makes that person, the character, not more likable. Like what. How she makes a character more likable is that you put in some vulnerability, like some of the time. Okay, maybe this person like, has setback because sometimes we can relate to it. See, in a lot of ways, like, it's very interesting. It's like a lot of it is like, yeah, a lot of it is. It's about like, understand the emotional bit, like how the users feel not just about the story, but also about the characters.
[80:50]
Lenny
That is so interesting. Wow, I learned a lot more there than I thought. That was awesome. Really good example, Chip. Two final questions. Where can folks find you online if they want to reach out and maybe work with you or maybe even just share the stuff that you offer? If folks want to reach out and then how can listeners be useful to you?
[81:08]
Chip Huan
I'm like ama, social media, LinkedIn, Twitter. I don't post a lot, but I keep telling myself that I should do more because I quite like the conversation with. With readers. So I'm sure about to start a slight. A substack. So I have like a placeholder for substack right now and I'm thinking of doing it for my system thinking because I think it's a very interesting skill. And so like thinking of doing a YouTube channel on book reviews and basically books that help you think better. So I think it's. The first book I'm going to review is probably like this book because it's like my favorite book growing up and I have been like, keep on reading it. So, yeah, so how can it be helpful? Like, send me books that you like, books that help you have changed the way you think or change you the way you do anything. So I would appreciate it.
[82:00]
Lenny
Amazing. I'm excited to read that book. Chip, thank you so much for being here.
[82:05]
Chip Huan
Thank you so much, Lenny, for having me.
[82:07]
Lenny
Bye, everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts Spotify or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show@lennyspodcast.com See you in the next episode.