Podcast Summary: "Why AI Evals Are the Hottest New Skill for Product Builders"
Lenny's Podcast: Product | Career | Growth
Host: Lenny Rachitsky
Guests: Hamel Hussain & Shreya Shankar
Date: September 25, 2025
Episode Overview
In this deeply tactical episode, Lenny interviews Hamel Hussain and Shreya Shankar, creators of the #1 AI evals course. Together, they demystify the emerging and crucial discipline of "AI evals"—the art and science of evaluating AI applications to systematically improve product performance. Drawing on their experience training thousands of product managers and engineers, they walk through practical frameworks, concrete examples, best practices, and common pitfalls. The episode is a hands-on primer designed to get any product builder up to speed on why and how to do evals, with a tone both friendly and deeply data-driven.
What Are Evals? The Big Picture
[05:48] Hamel Hussain:
"Evals is a way to systematically measure and improve an AI application. ... It really is, at its core, data analytics on your LLM application, and where necessary, creating metrics around things so you can measure what’s happening and then you can iterate and do experiments and improve."
- Evals = systematic measurement of AI application quality
- Not just tests; includes exploratory data analysis, error classification, prioritization, and ongoing improvement
- "Unit test" is only a tiny part—most is understanding user experience, ambiguities, and real-world application breakdowns
[08:29] Lenny:
"Is it like unit tests for code?"
[08:35] Shreya Shankar:
"Unit tests are a very small part of that very big puzzle."
Key Concepts and Process
1. The Real-World Example: Nurture Boss (AI Assistant for Property Managers)
[10:06–22:45]
- Hamel demos analyzing logs/traces from an actual AI product, "Nurture Boss."
- Traces: Detailed logs of user–AI interactions, including system prompts, tool calls, and agent responses.
- Error Analysis: Manually inspect logs and write “open codes”—simple notes marking what went wrong.
- Example: AI missed a customer opportunity for human handoff, produced janky conversation flows, hallucinated a virtual tour.
[19:19] Hamil Hussain:
"You just write a quick note... Should have handed off to a human."
[21:56]
- Keep note-taking quick and informal—don't try to perfectly classify everything immediately.
2. Open Coding, Axial Coding, and Categorization
[33:54] Shreya Shankar:
"The purpose axial code basically is just a failure mode. ... Our goal is to get to these clusters of failure modes and figure out what is the most prevalent, so then you can go and attack that problem."
- Open Codes: Freeform, first-pass notes (“janky conversation,” “hallucinated answer”).
- Axial Codes: Synthesized categories (e.g., handoff issues, process violations, output formatting errors).
- Use LLMs (e.g., Claude, ChatGPT, Gemini) to help cluster and categorize after initial labeling.
[42:22] Shreya Shankar:
"...open codes have to be detailed. Right. You can't just say janky because if the AI is reading janky, it's not going to be able to categorize it."
3. Quantifying and Prioritizing Errors
[44:40–48:31]
- Create a pivot table ("dumb and simple" is often best) to count error types.
- Prioritization: Not all error types are equally important—fix the most business-critical and prevalent ones.
- Some issues can be fixed simply (e.g., clarify a prompt), others warrant ongoing monitoring and evaluation.
4. From Analysis to Automated Evals: Code-based and LLM-as-Judge
[48:46–52:09]
- Code-based evaluator: For simple, obvious checks ("is response valid JSON?").
- LLM as Judge: For nuanced, subjective, or fuzzy errors (e.g., “should this have been escalated to a human?”). Here, an LLM is prompted with specific criteria to pass/fail a trace.
[52:16] Hamel Hussain:
"What we have is a LLM-as-judge prompt for this one specific failure ... you want to make it binary. Because we want to simplify things. ... Is this good enough or not? Yes or no."
- Best Practice: Avoid Likert scales or subjective scores; go for strict binary passes/fails to avoid ambiguity.
[60:56] Shreya Shankar:
"If you're a product manager and the person who's building the LLM judge eval has not done this, they're saying, like, oh, it agrees 75% of the time... go and ask them to go fix that."
Core Best Practices
- Start with error analysis, not tests ([44:40] onward): Don’t rush into writing evals or buying tools. Look at your data.
- Be the “benevolent dictator” ([25:12, 26:40]): Designate a domain expert (often a product manager) to own initial labeling and classification.
- Use AI for synthesis, not for subjective judgment ([24:04]): LLMs can cluster notes, but can’t replace human context in spotting subtle UX/product errors.
- Prioritize based on “theoretical saturation” ([30:29]): Keep reviewing traces until you stop finding new failure types.
- Automate where possible, but stay in the loop ([32:02]): Use LLMs and code to scale, but never fully hand off judgment.
- Iterate on your categories and prompts ([43:15]): Refine as you learn; let the open/axial codes evolve.
Common Misconceptions and Debates
Top misconceptions ([84:30]):
- You can "just automate evals" with AI:
"Can't the AI just eval it? That's the most common misconception. And people want that so much that people do sell it, but it doesn't work." – Hamel Hussain [84:30]
- Thinking evals are just pre-defined unit tests: In reality, much of the value comes from human-in-the-loop analysis.
- Evals replace product management (PRD) processes: Evals are iterative, data-grounded complements, not a substitute.
Controversy:
- Some AI leaders claim “vibes” (just using the product a lot) is enough—especially for tools where the builder and power user are the same (e.g., code agents).
- Shreya and Hamel argue this works in narrow domains, but is impractical or even dangerous in complex, user-facing applications.
- A/B tests vs. evals ([76:24]): A/B tests are another form of evals—both require systematic metrics. But without prior error analysis, A/B tests can miss or mis-prioritize key issues.
Concrete Advice for Getting Started
Steps to Build Effective AI Evals
- Sample and Review Traces
- Do manual analysis (at least 40–100 traces) and label observed failure modes.
- Synthesize Failure Modes (Axial Codes)
- Count and Prioritize Errors
- Design Automated Evals
- Use code for objective errors
- Use LLMs as binary judges for subjective errors
- Validate Your Evals
- Check human–AI agreement; refine until alignment is high, especially on rare edge cases.
- Integrate and Monitor
- Add evals to your CI pipeline and production monitoring. Track on dashboards.
Tips & Tricks ([86:37], [87:44]):
- Don’t be afraid of the data; “You’re going to find ways of actionable improvement.”
- Use LLMs to help organize and synthesize, but not to replace your own judgment.
- Iterate; error analysis and eval design are ongoing processes.
- Build custom tools if needed: It's easier than ever to build dashboards or annotation interfaces (using LLMs themselves!) to reduce friction.
- “If you see something wrong, go fix it”—don't fetishize the eval suite ([90:15]).
Time Investment ([90:44]):
- 3–4 days for initial analysis/labelling and setup; after that, 30 minutes a week for ongoing updates.
Notable Quotes & Memorable Moments
- [00:03] Hamel Hussain:
"It's the highest ROI activity you can engage in."
- [24:55] Shreya Shankar:
"Number one pitfall right here is people are like, let me automate this with an LLM."
- [53:02] Shreya Shankar:
"Expert curated content on the Internet ... here's your LLM judge evaluator prompt. Here's a one to seven scale. ... Oh no, now we have to fight the misinformation again."
- [61:45] Lenny Rachitsky:
"This is like the purest sense of what a product requirements document should be. Is this eval judge that's telling you exactly what it should be."
- [62:58] Shreya Shankar:
"You're never going to know what the failure modes are going to be upfront, and you're always going to uncover new vibes that you think your product should have."
- [70:19] Shreya Shankar:
"I think everyone is on the same side. I think the misconception is that people have rigid definitions of what evals is."
- [81:59] Shreya Shankar:
"...they don't correlate with math problem solving, sorry to say."
Lightning Round (“Fun” Section Highlights)
- Hamel: "Keep learning and think like a beginner."
- Shreya: "Always try to think about the other side's argument. ... We're all much stronger together than if we start picking fights."
- Top Product Discovery: Both love Claude Code (despite, or because of, the “built on vibes” meme!)
- Favorite process: Shreya—error analysis; Hamel—removing friction to look at data; both see it as energizing and fun.
Final Words: Where to Learn More
- Hamel: haml.dev
- Shreya: Google “Shreya Shankar” for her website and contact; “AI evals for engineers and product managers” for the course
- Course perks: 160-page book, 10 months of AI “coursebot” access, active Discord
How to be helpful:
- Ask questions, share real-world successes, write about your learnings so others can benefit (they encourage more teachers in the field!).
Summary Takeaways
- Great AI products require systematic, human-in-the-loop "evals"—the top new skill for product builders.
- Evals go far beyond "unit tests" and require direct engagement with user data and product experience.
- AI is a powerful assistant in synthesizing and automating evals, but cannot replace human judgment.
- The process: manual trace analysis ➔ categorization ➔ error quantification ➔ targeted automation ➔ ongoing monitoring.
- Avoid magic bullets and “just buy a tool” mentality—start with data, not dogma.
- Evals are not only for debugging—they drive real, actionable improvements and product success.
For anyone building AI products—this episode is definitive listening.
Key Timestamps
| Segment | Timestamp | |---------------------------------------|------------| | Introduction & controversy | 00:00–01:09| | What are evals? Definition & context | 05:07–08:29| | Real-world eval walkthrough | 09:56–22:45| | Coding, categorizing, and iterating | 25:12–44:40| | Types of evals (code vs. LLM-Judge) | 48:31–53:02| | Validating evals and PM’s role | 57:38–62:58| | Controversies and misconceptions | 69:57–80:01| | Tips, tricks, and getting started | 86:37–91:55| | Lightning round & closing thoughts | 98:04–END |
