Podcast Summary
Episode Overview
Podcast: How I AI
Host: Claire Vo
Episode: "Evals, error analysis, and better prompts: A systematic approach to improving your AI products"
Guest: Hamel Husain (ML engineer)
Date: October 13, 2025
In this episode, Claire Vo and Hamel Husain dive deep into the practical, systematic approaches for debugging, evaluating, and improving AI-powered products. The conversation explores foundational concepts for PMs and AI builders, actionable workflows for error analysis, writing impactful evals, and iteratively optimizing prompts and system instructions. Hamel also shares a behind-the-scenes look at his personal AI-enabled workflow for running his business with tools like Claude and GitHub repos.
Key Discussion Points & Insights
1. The Foundation: Data-Driven Quality in AI Products
- Direct User Data is Key:
Both Claire and Hamel emphasize the importance of examining real user interactions with AI products. Synthetic inputs and internal tests often miss the edge and ambiguity found in live data.- "When I'm testing my own AI, I ask it good questions and I spell correctly... But when you see a real user input like this, you actually look at what users are prompting your AI with, you realize it's very vague." – Claire (10:37)
- The Concept of Traces:
Hamel introduces "traces"—detailed logs of multi-turn AI interactions, including system prompts, user messages, tool calls, and responses.
(05:33–10:30)- Example: In a leasing assistant, users submit jumbled queries like "hello there, what's up? two four month rent," and AI responses don't always align with intent.
2. Manual Review: The Hidden Superpower
- Start with Open Coding (Error Analysis):
Hamel stresses manually reading and annotating a sample of traces, focusing on the most upstream (root) error in each interaction, a process called "open coding."- "It's dumb, but it's accessible to everybody and it works." – Hamel (14:30)
- "Assistant should have asked follow up questions about the question—what's up with four month rent? Because it's unclear user intent." – Hamel, annotating a trace (15:51)
- Prioritization via Counting:
After annotating 100+ samples, categorize and count recurring errors to discover systemic issues.- "Counting remains powerful. And so you can count these issues... Now we know what we should be working on." – Hamel (17:20)
3. Systematic Error Analysis Workflow
- Steps:
- Collect and log traces from real user sessions.
- Randomly sample and manually annotate traces, noting the first “upstream” error.
- Categorize error notes into themes (often assisted by LLMs).
- Tally counts to discover and prioritize key problem areas.
- Example Outcomes:
For Nurture Boss, main issues included transfer/handoff confusion and faulty tour scheduling.
(20:10)
4. Writing High-Value Evals
-
Identify Key Evals from Error Analysis:
Write evals focused on actual, high-impact problems (e.g., transfer failures or tour scheduling), not generic metrics.- "Before you get into all that stuff, you need to have some grounding in what eval you should even write because there's infinite eval." – Hamel (23:26)
-
Reference-vs-Subjective Evals:
- Reference-based: Use code/tests if you can check for objective criteria (e.g., user IDs leaking in output).
- Subjective: Use LLM-as-judge, but only for narrow, binary, task-specific criteria.
-
Validation and Trust:
Always validate LLM-judged evals against a hand-labeled sample to avoid metric “drift” and loss of trust.- "The worst thing you do as a product manager is start showing people evals, and then at some point the people's perception of the product...doesn't match the evals." – Hamel (30:20)
- See paper: "Who Validates the Validators?"
5. Building Better Prompts and System Instructions
- Iterative Prompt Engineering:
Use errors surfaced by evals to iteratively change, test, and optimize prompts.- Simple fixes can yield big wins: e.g., including today’s date in prompts to enable “tomorrow” queries.
- "Most people shouldn't get into fine tuning. But if you do all this eval stuff, fine tuning is basically free...those difficult examples where your AI is not getting it right, that's exactly the stuff you want to fine tune on." – Hamel (34:39)
- No Magic, Just Systematic Work:
There’s no secret prompt formula—improvement comes from structured, repeated, data-driven tweaks.
6. Analytics & Agentic System Mapping
- Analyze Agent Hand-offs:
Build tools like transition matrices to track agent/tool interactions and spot breakdowns.- Handy for both error diagnosis and feature discovery/roadmapping. (38:15–40:34)
7. Hamel’s AI Business Stack: Claude, Gemini, and Repos
- Workflow Examples:
- Dedicated Claude projects for copywriting, legal, client proposals, FAQ generation, course content.
- Gemini models preferred for extracting information from YouTube/video content.
- Personal mono-repo on GitHub holding all prompts, notes, code, and resources for use with AI tools.
- "Obviously it should go in a repo…and prompts and tools to actually do something with that." – Claire (47:00)
Notable Quotes & Memorable Moments
- On error analysis:
"I'm sure our listeners expect some, like, magical system that does this automatically. And you're like, no, man. Just spend three hours of your afternoon, go through, read some of these chats, look at some of them with your human eyes, put one sentence notes on all of them..." – Claire (10:33 & 23:18) - On building trust in metrics:
"The worst thing you do as a product manager is start showing people evals, and then at some point the people's perception of the product...doesn't match the evals. They're like, hey, it's broken, but the evals are showing that it's good. And that's the moment people lose trust in you." – Hamel (30:20) - On writing prompts:
“There’s no magic prompt engineering tricks. It’s really like… there’s a lot of experimentation you should engage in.” – Hamel (34:39) - On maintaining control:
"I don't want to be locked in, right, to any one provider. And so this [GitHub repo approach] is how I do that, essentially." – Hamel (47:52) - On division of labor:
"A lot of times the product manager is the subject matter expert... the more you can do, the better. At some point you probably need a data scientist when it gets advanced. But those three roles—AI engineer, AI product manager, data scientist—are still operating on this problem, especially as you scale." – Hamel (48:27)
Timestamps for Key Segments
| Timestamp | Segment | |------------|----------------------------------------------------------------| | 04:29 | Foundations: Looking at Data for AI Product Quality | | 05:33 | How to Analyze AI Traces—Real Example Demo | | 10:37 | Importance of Real User Data vs. Synthetic or “Happy Path” | | 14:30 | Systematic Error Analysis: Manual Coding & Categorization | | 17:20 | Counting & Prioritizing Issues | | 20:10 | Example Results from Nurture Boss Trace Analysis | | 23:26 | Impact of Manual Error Review—Clients are Delighted | | 24:33 | Moving to Writing Domain-Specific Evals | | 27:45 | Automated (code-based) Evals versus LLM-as-Judge Evals | | 30:20 | Dangers of Untrustworthy evals and the need for validation | | 33:14 | Research: Who Validates the Validators? | | 34:39 | How to Actually Improve System Prompts | | 38:15 | Analytics for Agentic Systems, Transition Matrices | | 41:34 | Hamel’s Workflow: Claude Projects, Gemini for Video, GitHub repo| | 47:00 | "Second brain" concept applied to AI work via GitHub | | 48:27 | Who Should Be Doing Annotations? Division of Labor | | 51:25 | Back Pocket: Practical Prompting Tips (especially for writing) |
Actionable Takeaways
- Manual trace review, annotation, and categorization is the single most high-leverage activity for AI product quality.
- Transform error notes into specific, prioritized work via counts and focus on the top issues.
- Write binary, task-specific evals—validate LLM judges against human ratings before trusting at scale.
- Expect to iterate: prompt improvement is trial and error, aided by good error analytics—not "magic."
- Maintain your own "second brain"—a repo of prompts, notes, and examples that integrate with your toolchain.
- Engage PMs, engineers, and SMEs directly in annotation and quality work; invite operational experts, too.
Resources & Where to Find Hamel
- Personal site: hamel.dev
- Twitter/X: @HamelHusain
- Maven course: AI evals and error analysis (featured on Lenny's List)
- Blog posts referenced in episode (see hamel.dev for links)
This episode is a must-listen (or must-read) for anyone shipping AI-driven products and looking for a reality-based approach to quality, reliability, and prompt iteration—free from hype and grounded in hands-on, systematic practice.
