Podcast Summary: The Growth Podcast
Episode: The PM’s Role in AI Evals: Step-by-Step
Host: Aakash Gupta
Guests: Hamel Hussain & Shreya Shankar (AI evaluation experts)
Date: July 11, 2025
Episode Overview
This episode explores the pivotal role of AI evaluations (“evals”) in building reliable, high-quality AI products. Host Aakash Gupta interviews industry experts Hamel Hussain and Shreya Shankar, whose evaluation frameworks are trusted by companies like OpenAI and Arise. The conversation demystifies AI evals, discusses their strategic value for Product Managers (PMs), presents step-by-step practical guidance, and shares hard-won lessons from real-world AI products—including GitHub Copilot, RAG systems, and more.
Key Discussion Points & Insights
Why AI Evals Matter for PMs
- Injecting Product Taste: Evals let PMs “inject their taste and judgment directly into the critical path of the AI product” ([00:02], [02:09] - Hussain).
- Iteration: Evals provide structured ways to rapidly gather feedback and improve products.
- Scalability: Well-designed evals can systematize PM judgment across products and teams.
- Quote:
“When you build that foundation of evals, you have immense leverage... It's a really quick way to exert lots of influence over the process and in a good way.”
—Hamel Hussain [03:50]
What Are Evals?
- Definition:
“An eval is some systematic measurement of some aspect of quality... what varies in an eval is the criterion (e.g., conciseness, accuracy) and how you measure it.”
—Shreya Shankar [05:29] - Products typically require 3–10 different evals; no one metric suffices.
- Evals codify “vibe checks”—making internal PM taste explicit and scalable.
The Case for Binary Criteria
- Why Binary?
Binary pass/fail rubrics outperform 1–5 scoring for both humans and LLMs.- Calibrating numerical scales is difficult; binary forces clarity.
- LLMs perform reliably with binary judgment, less so with nuanced ratings.
- Quote:
“Binary judgments force you to make a pass/fail decision. And for the vast majority of people, that's the right choice.”
—Hamel Hussain [09:00]
Human Evals vs. Automated Evals
- Human “vibe checking” does not scale—evals operationalize subjective judgment so it can be shared and automated.
- LLMs as judges are feasible if criteria are concrete and binary; complex, subjective, or vague rubrics make automation unreliable.
- Quote:
“Your vibe checks are very important, but they don’t scale... Evals let you translate those checks into something concrete.”
—Shreya Shankar [07:47]
The Scientific Method in Evals
- Skepticism is vital: Be “skeptical of everything and do lots of experiments” ([15:00] - Hussain).
- Always validate LLM judge performance against your labeled ground truth.
- Measure and iterate on both the evaluator (model) and your own interpretive rubric.
- Memorable Analogy:
“It’s like playing whack-a-mole without evals... you keep hammering problems but don’t make progress.”
—Hamel Hussain [15:00]
Error Analysis: The PM's Superpower
- The critical skill for PMs in AI evals is “error analysis”—systematically reviewing outputs, quantifying failure modes, and turning learnings into ongoing product improvements ([00:56], [56:11]).
- Inspired by social science research (“grounded theory,” “open coding”):
- Start with freeform notes
- Cluster insights into failure modes
- Prioritize by frequency/impact
Domain-Specific Evals: No One-Size-Fits-All
- Off-the-shelf tools and generic metrics (e.g., “hallucination score”) rarely suffice for real products.
- PMs must define what “good” looks like for their own users and context.
- Quote:
“The foundation model labs... are very much focused on the general purpose benchmarks... but they’re at the same level as everyone else when it comes to domain-specific evals.”
—Hamel Hussain [42:51] - This domain-specificity creates a moat for startups that “operationalize taste.”
Evals as the True Moat
- “Evals are the moat for AI products... Truly nothing else.”
—Shreya Shankar [46:20] - Well-implemented evaluation pipelines enable rapid model swaps, easy fine-tuning, and defensible product quality.
- Your eval suite—with tightly aligned LLM judges and actionable metrics—forms the deepest competitive advantage.
When to Use Prompting, RAG, and Fine-Tuning
- Prompting: First line of improvement; communicate explicit requirements to LLMs.
- RAG (Retrieval-Augmented Generation): Use when LLM needs external/contextual information.
- Fine-Tuning: Only after exhausting prompting/RAG; expensive and requires ongoing maintenance.
- Framework: The “Three Gulfs” model explains when each method is appropriate ([53:07]):
- Gulf of Specification: Is the desired behavior clearly described? Prompting helps here.
- Gulf of Generalization: Does the model lack capacity or context? Use RAG/fine-tuning.
- Gulf of Evaluation: Can you measure/assess if the LLM met your goals?
Overfitting & Safe Practices
- Danger: Overfitting by designing evals/prompting based on the same test data.
- Solution: Always have a hidden test set for evaluation only ([35:15]).
- Suspiciously high accuracy (near 100%) is usually a red flag ([36:21]).
- Differentiate between regression (must-pass) and aspirational evals (show headroom for improvement).
Case Studies and Real-World Examples
- GitHub Copilot: Success required upfront investment in eval systems and automated “test harnesses” ([25:39]).
- Airbnb (pre-LLMs): Evals in ML carried over directly; robust evaluation essential for stochastic systems ([19:49], [21:27]).
- Search vs. LLMs: LLMs have different context/salience patterns than humans; necessitates different retrieval and evaluation strategies ([22:36]–[24:47]).
Lessons from the Field and the Importance of Interfaces
- Best teams custom-build labeling/annotation interfaces for high-quality feedback ([38:22]).
- The practice of “benevolent dictators”: Assign a single accountable evaluator to prevent committee paralysis in binary labeling ([64:23]).
- Interface design is a bottleneck (see chapters 10–11 of their course/book for in-depth guidance).
- Quote:
“PMs are like, vital in building AI products… We’re not going to have successful AI products across different domains unless we have good AI PMs.”
—Shreya Shankar [63:42]
Roadmap for Mastering Evals
(as taught in their course/book—see detailed breakdown at [56:11] onward)
- What is evaluation?
- Understand LLM strengths/weaknesses
- Error analysis (grounded theory, open/axial coding)
- Designing & validating LLM judges
- UI/interfaces for efficient labeling
- Multi-turn and RAG evaluation strategies
- Productionizing evals (CI/CD, automation)
- Cost optimization
Meta-Lessons: Creating Popular AI Courses (and Why They're Ending Theirs)
- Evals are a niche but high-value topic; smaller, focused, and well-structured cohorts are more effective ([86:27], [87:09]).
- The course’s value comes from addressing immediate real-world PM/pain points, not “timeless content” ([83:45]).
- Marketing, guest speakers, constant student feedback, and relentless iteration all contributed to their success ([90:09], [90:55]).
- Quote:
"It's like telling people to eat their vegetables. It's not really that popular... much easier to talk about agents. But at some point I just didn't care. We need to create the category."
—Hamel Hussain [87:09]
Notable Quotes & Memorable Moments
-
Binary vs. Ratings for LLMs:
"[1–5 scaling is] a smell of intellectual laziness... Binary forces you to be clear about what you want."
—Hamel Hussain [11:36] -
Error Analysis Value:
“A sizable portion of my clients... do the error analysis part and they're like, great, we're done. This is so much value.”
—Hamel Hussain [64:23] -
PMs as Leverage:
“PMs are like, vital in building AI products... We need this to really realize the vision of AI products changing people's lives.”
—Shreya Shankar [63:42] -
Evaluations as Moat:
“Evals are the moat for AI products. Truly nothing else.”
—Shreya Shankar [46:20] -
Hill Climbing (and Overfitting):
“If you're getting 100% accuracy in your evals, it's likely your evals are worthless because they're providing no signal.”
—Hamel Hussain [36:21]
Timestamps for Core Segments
- Evals’ strategic value for PMs: [00:02], [02:09]
- What is an eval? [05:29]
- Why binary matters: [09:00]
- LLMs as judges & challenges: [09:59], [14:14]
- Error analysis as skill: [56:11], [59:29]
- Avoiding overfitting in evals: [35:11], [36:21]
- Domain-specificity & moats: [42:51], [46:20]
- Prompt/RAG/fine-tuning framework: [53:07]
- Designing effective interfaces: [73:33]
- Course philosophy & business model: [77:35], [86:27], [87:09]
- PMs’ critical role: [63:42]
Where to Find the Experts
- Both are active on X (formerly Twitter) and publish blogs.
- Shreya Shankar: Email at shreyashankar@berkeley.edu
- Hamel Hussain: DMs open on X; details via Google search
Final Takeaways
- Evals are the backbone of iterative, reliable, and differentiated AI products.
- PMs should take the lead in defining, analyzing, and operationalizing deeply aligned evals.
- Do not rely on generic tools or metrics; your product’s success (and moat) depends on bespoke, well-crafted evaluations.
- Error analysis is not just a phase, but an ongoing intuition-engine building exercise for every AI PM.
- Systematic, credible evaluation is what turns “vibe checks” into high-velocity, scalable, world-class AI products.
For more resources and actual frameworks, visit the guests’ newsletters, blogs, or course reader.
