Summary8 min read

Podcast Summary: "Why AI Evals Are the Hottest New Skill for Product Builders"

Lenny's Podcast: Product | Career | Growth
Host: Lenny Rachitsky
Guests: Hamel Hussain & Shreya Shankar
Date: September 25, 2025

Episode Overview

In this deeply tactical episode, Lenny interviews Hamel Hussain and Shreya Shankar, creators of the #1 AI evals course. Together, they demystify the emerging and crucial discipline of "AI evals"—the art and science of evaluating AI applications to systematically improve product performance. Drawing on their experience training thousands of product managers and engineers, they walk through practical frameworks, concrete examples, best practices, and common pitfalls. The episode is a hands-on primer designed to get any product builder up to speed on why and how to do evals, with a tone both friendly and deeply data-driven.

What Are Evals? The Big Picture

[05:48] Hamel Hussain:

"Evals is a way to systematically measure and improve an AI application. ... It really is, at its core, data analytics on your LLM application, and where necessary, creating metrics around things so you can measure what’s happening and then you can iterate and do experiments and improve."

Evals = systematic measurement of AI application quality
Not just tests; includes exploratory data analysis, error classification, prioritization, and ongoing improvement
"Unit test" is only a tiny part—most is understanding user experience, ambiguities, and real-world application breakdowns

[08:29] Lenny:

"Is it like unit tests for code?"

[08:35] Shreya Shankar:

"Unit tests are a very small part of that very big puzzle."

Key Concepts and Process

1. The Real-World Example: Nurture Boss (AI Assistant for Property Managers)

[10:06–22:45]

Hamel demos analyzing logs/traces from an actual AI product, "Nurture Boss."
Traces: Detailed logs of user–AI interactions, including system prompts, tool calls, and agent responses.
Error Analysis: Manually inspect logs and write “open codes”—simple notes marking what went wrong.
- Example: AI missed a customer opportunity for human handoff, produced janky conversation flows, hallucinated a virtual tour.

[19:19] Hamil Hussain:

"You just write a quick note... Should have handed off to a human."

[21:56]

Keep note-taking quick and informal—don't try to perfectly classify everything immediately.

2. Open Coding, Axial Coding, and Categorization

[33:54] Shreya Shankar:

"The purpose axial code basically is just a failure mode. ... Our goal is to get to these clusters of failure modes and figure out what is the most prevalent, so then you can go and attack that problem."

Open Codes: Freeform, first-pass notes (“janky conversation,” “hallucinated answer”).
Axial Codes: Synthesized categories (e.g., handoff issues, process violations, output formatting errors).
Use LLMs (e.g., Claude, ChatGPT, Gemini) to help cluster and categorize after initial labeling.

[42:22] Shreya Shankar:

"...open codes have to be detailed. Right. You can't just say janky because if the AI is reading janky, it's not going to be able to categorize it."

3. Quantifying and Prioritizing Errors

[44:40–48:31]

Create a pivot table ("dumb and simple" is often best) to count error types.
Prioritization: Not all error types are equally important—fix the most business-critical and prevalent ones.
Some issues can be fixed simply (e.g., clarify a prompt), others warrant ongoing monitoring and evaluation.

4. From Analysis to Automated Evals: Code-based and LLM-as-Judge

[48:46–52:09]

Code-based evaluator: For simple, obvious checks ("is response valid JSON?").
LLM as Judge: For nuanced, subjective, or fuzzy errors (e.g., “should this have been escalated to a human?”). Here, an LLM is prompted with specific criteria to pass/fail a trace.

[52:16] Hamel Hussain:

"What we have is a LLM-as-judge prompt for this one specific failure ... you want to make it binary. Because we want to simplify things. ... Is this good enough or not? Yes or no."

Best Practice: Avoid Likert scales or subjective scores; go for strict binary passes/fails to avoid ambiguity.

[60:56] Shreya Shankar:

"If you're a product manager and the person who's building the LLM judge eval has not done this, they're saying, like, oh, it agrees 75% of the time... go and ask them to go fix that."

Core Best Practices

Start with error analysis, not tests ([44:40] onward): Don’t rush into writing evals or buying tools. Look at your data.
Be the “benevolent dictator” ([25:12, 26:40]): Designate a domain expert (often a product manager) to own initial labeling and classification.
Use AI for synthesis, not for subjective judgment ([24:04]): LLMs can cluster notes, but can’t replace human context in spotting subtle UX/product errors.
Prioritize based on “theoretical saturation” ([30:29]): Keep reviewing traces until you stop finding new failure types.
Automate where possible, but stay in the loop ([32:02]): Use LLMs and code to scale, but never fully hand off judgment.
Iterate on your categories and prompts ([43:15]): Refine as you learn; let the open/axial codes evolve.

Common Misconceptions and Debates

Top misconceptions ([84:30]):

You can "just automate evals" with AI:

"Can't the AI just eval it? That's the most common misconception. And people want that so much that people do sell it, but it doesn't work." – Hamel Hussain [84:30]
Thinking evals are just pre-defined unit tests: In reality, much of the value comes from human-in-the-loop analysis.
Evals replace product management (PRD) processes: Evals are iterative, data-grounded complements, not a substitute.

Controversy:

Some AI leaders claim “vibes” (just using the product a lot) is enough—especially for tools where the builder and power user are the same (e.g., code agents).
Shreya and Hamel argue this works in narrow domains, but is impractical or even dangerous in complex, user-facing applications.
A/B tests vs. evals ([76:24]): A/B tests are another form of evals—both require systematic metrics. But without prior error analysis, A/B tests can miss or mis-prioritize key issues.

Concrete Advice for Getting Started

Steps to Build Effective AI Evals

Sample and Review Traces
- Do manual analysis (at least 40–100 traces) and label observed failure modes.
Synthesize Failure Modes (Axial Codes)
Count and Prioritize Errors
Design Automated Evals
- Use code for objective errors
- Use LLMs as binary judges for subjective errors
Validate Your Evals
- Check human–AI agreement; refine until alignment is high, especially on rare edge cases.
Integrate and Monitor
- Add evals to your CI pipeline and production monitoring. Track on dashboards.

Tips & Tricks ([86:37], [87:44]):

Don’t be afraid of the data; “You’re going to find ways of actionable improvement.”
Use LLMs to help organize and synthesize, but not to replace your own judgment.
Iterate; error analysis and eval design are ongoing processes.
Build custom tools if needed: It's easier than ever to build dashboards or annotation interfaces (using LLMs themselves!) to reduce friction.
“If you see something wrong, go fix it”—don't fetishize the eval suite ([90:15]).

Time Investment ([90:44]):

3–4 days for initial analysis/labelling and setup; after that, 30 minutes a week for ongoing updates.

Notable Quotes & Memorable Moments

[00:03] Hamel Hussain:

"It's the highest ROI activity you can engage in."
[24:55] Shreya Shankar:

"Number one pitfall right here is people are like, let me automate this with an LLM."
[53:02] Shreya Shankar:

"Expert curated content on the Internet ... here's your LLM judge evaluator prompt. Here's a one to seven scale. ... Oh no, now we have to fight the misinformation again."
[61:45] Lenny Rachitsky:

"This is like the purest sense of what a product requirements document should be. Is this eval judge that's telling you exactly what it should be."
[62:58] Shreya Shankar:

"You're never going to know what the failure modes are going to be upfront, and you're always going to uncover new vibes that you think your product should have."
[70:19] Shreya Shankar:

"I think everyone is on the same side. I think the misconception is that people have rigid definitions of what evals is."
[81:59] Shreya Shankar:

"...they don't correlate with math problem solving, sorry to say."

Lightning Round (“Fun” Section Highlights)

Hamel: "Keep learning and think like a beginner."
Shreya: "Always try to think about the other side's argument. ... We're all much stronger together than if we start picking fights."
Top Product Discovery: Both love Claude Code (despite, or because of, the “built on vibes” meme!)
Favorite process: Shreya—error analysis; Hamel—removing friction to look at data; both see it as energizing and fun.

Final Words: Where to Learn More

Hamel: haml.dev
Shreya: Google “Shreya Shankar” for her website and contact; “AI evals for engineers and product managers” for the course
Course perks: 160-page book, 10 months of AI “coursebot” access, active Discord

How to be helpful:

Ask questions, share real-world successes, write about your learnings so others can benefit (they encourage more teachers in the field!).

Summary Takeaways

Great AI products require systematic, human-in-the-loop "evals"—the top new skill for product builders.
Evals go far beyond "unit tests" and require direct engagement with user data and product experience.
AI is a powerful assistant in synthesizing and automating evals, but cannot replace human judgment.
The process: manual trace analysis ➔ categorization ➔ error quantification ➔ targeted automation ➔ ongoing monitoring.
Avoid magic bullets and “just buy a tool” mentality—start with data, not dogma.
Evals are not only for debugging—they drive real, actionable improvements and product success.

For anyone building AI products—this episode is definitive listening.

Key Timestamps

| Segment | Timestamp | |---------------------------------------|------------| | Introduction & controversy | 00:00–01:09| | What are evals? Definition & context | 05:07–08:29| | Real-world eval walkthrough | 09:56–22:45| | Coding, categorizing, and iterating | 25:12–44:40| | Types of evals (code vs. LLM-Judge) | 48:31–53:02| | Validating evals and PM’s role | 57:38–62:58| | Controversies and misconceptions | 69:57–80:01| | Tips, tricks, and getting started | 86:37–91:55| | Lightning round & closing thoughts | 98:04–END |

Loading summary

Transcript289 lines

[00:00]
Lenny Rachitsky
To build great AI products, you need to be really good at building evals.
[00:03]
Hamil Hussain
It's the highest ROI activity you can engage in. This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot.
[00:13]
Lenny Rachitsky
What's cool about this is you don't need to do this many, many times. For most products, you do this process once and then you build on it.
[00:18]
Shreya Shankar
The goal is not to do evals perfectly. It's to actionably improve your product.
[00:23]
Lenny Rachitsky
I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions.
[00:28]
Shreya Shankar
People have been burned by evals in the past. People have done evals badly, then they didn't trust it anymore, and then they're like, oh, I'm anti evals.
[00:36]
Lenny Rachitsky
What are a couple of the most common misconceptions people have with evals?
[00:39]
Hamil Hussain
The top one is, we live in the age of AI. Can't the AI just eval it? But it doesn't work.
[00:45]
Lenny Rachitsky
A term that you used in your post that I love is this idea of a benevolent dictator.
[00:49]
Hamil Hussain
When you're doing this open coding, a lot of teams get bogged down in having a committee do this. For a lot of situations, that's wholly unnecessary. You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.
[01:10]
Lenny Rachitsky
Today, my guests are Hamil Hussain and Shreya Shankar. One of the most trending topics on this podcast over the past year has been the rise of evals. Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders. And since then, this has been a recurring theme across many of the top AI builders I've had on. Two years ago, I had never heard the term evals. Now it's coming up constantly. When was the last time that a new skill emerged that product builders had to get good at to be successful? Hamill and Shreya have played a major role in shifting evals from being an obscure, mysterious subject to one of the most necessary skills for AI product builders? They teach the definitive online course on evals, which happens to be the number one course on Maven. They've now taught over 2,000PMs and engineers across 500 companies, including large swaths of the OpenAI and anthropic teams, along with every other major AI lab. In this conversation we do a lot of show versus Tell. We walk through the process of developing an effective eval, explain what the heck evals are and what they look like, address many of the major misconceptions with evals, give you the first few steps you can take to start building evals for your product and also share just a ton of best practices that Hamill and Shreya have developed over the past few years. This episode is the deepest yet most understandable primer you will find on the world of evals and honestly got me excited to write evals even though I have nothing to write evals for. I think you'll feel the same way as you watch this. If this conversation gets you excited, definitely check out Hamill and Shreyas course on Maven. We'll link to it in the show notes if want to you if you use the code Lenny's List when you purchase the course, you'll get 35% off the price of the course. With that I bring you Hamil Hussain and Shreya Shankar. This episode is brought to you by Fin, the 1 AI agent for customer Service if your customer support tickets are piling up then you need Fin. Fin is the highest performing AI agent on the market with a 65% average resolution rate. FIN resolves even the most complex customer queries. No other AI agent performs better. In head to head bake offs with competitors, FIN wins every time. Yes, switching to a new tool can be scary, but FIN works on any help desk with no migration needed. Which means you don't have to overhaul your current system or deal with delays in service for your customers. And FIN is trusted by over 5,000 customer service leaders and top AI companies like Anthropic and Synthesia. And because FIN is powered by the FIN AI engine, which is a continuously improving system that allows you to analyze, train, test and deploy with ease, FIN can continuously improve your results too. So if you're ready to transform your customer service and scale your support, give FIN a try for only 99 cents per resolution plus fin comes with a 90 day money back guarantee. Find out how FIN can work for your team at FIN AI Lenny. That's fin AI Lenny. This episode is brought to you by dscout. Design teams today are expected to move fast, but also to get it right. That's where dscout comes in. Descout is the all in one research platform built for modern product and design teams. Whether you're running usability tests, interviews, surveys or in the wild field work Descott makes it easy to connect with real users and get real insights fast. You can even test your Figma prototypes directly inside the platform. No juggling tools, no chasing ghost participants. And with the industry's most trusted panel plus AI powered analysis, your team gets clarity and confidence to build better without slowing down. So if you're ready to streamline your research, speed up decisions and design with impact, head to dscout.com to learn more. That's d s c o u t dot com the answers you need to move confidently. Hamil and Shreya, thank you so much for being here and welcome to the podcast.
[05:05]
Hamil Hussain
Thank you for having us.
[05:05]
Shreya Shankar
Yeah, super excited.
[05:07]
Lenny Rachitsky
I'm even more excited. Okay, so a couple years ago I had never heard the term evals. Now it's one of the most trending topics on my podcast. Essentially that to build great AI products, you need to be really good at building evals. Also, turns out some of the fastest growing companies in the world are basically building and selling and creating evals for AI labs. I just had the CF Merkor on the podcast, so there's something really big happening here. I want to use this conversation to basically help people understand this space deeply. But let's start with the basics. Just what the heck are evals for folks that have no idea what we're talking about? Give us just a quick understanding of what an eval is. And let's start with Hamil.
[05:49]
Hamil Hussain
Sure. Evals is a way to systematically measure and improve an AI application. And it really doesn't have to be scary or unapproachable at all. It really is, at its core, data analytics on your LLM application in a systematic way of looking at that data and where necessary, creating metrics around things so you can measure what's happening and then so you can iterate and do experiments and improve.
[06:22]
Lenny Rachitsky
So that's a, that's a really good broad way of thinking about it. And if you go one level deeper, just to give people a very even more concrete way of imagining and visualizing what we're talking about, even if you have a example to show, would be even better. What's a. What's an even deeper way of understanding what an eval is?
[06:36]
Hamil Hussain
Let's say you have a real estate assistant, you know, application, and it's, it's not working the way you want, it's not writing emails to customers the way you want, or it's not, you know, calling the right tools or any number of errors. And before evals, you would be left with guessing. You would Maybe fix a prompt and hope that you're not breaking anything else with that prompt. And you might rely on vibe checks, which is totally fine. And vibe checks are good and you should do vibe checks initially. But it can become very unmanageable very fast because as your application grows, it's really hard to rely on vibe checks. You just feel lost. And so evals help you create metrics that you can use to measure how your application is doing and kind of give you a way to improve your application with confidence. You have a feedback signal in which to iterate against.
[07:45]
Lenny Rachitsky
So just to make it very real. So imagining this real estate agent, maybe they're helping you book a listing or go see an open house. The idea here is you have this agent talking to people. It's answering questions, pointing them to things. As a builder of that agent, how do you know if it's giving them good advice, good answers? Is it telling them things that are completely wrong? So the idea of eval is essentially is to build a set of tests that tell you is how often is this agent doing something wrong that you don't want it to do. And there's a bunch of ways wrong. You could define wrong. It could be just making up stuff, it could be just answering in a really strange way. The way I think about eval is and tell me if this is wrong just simply is like unit tests for code. And then you're smiling, you're like, no, you idiot.
[08:30]
Shreya Shankar
Oh, that's not what I was thinking.
[08:32]
Lenny Rachitsky
Okay, okay, tell me, tell me, how does that feel as a metaphor?
[08:35]
Shreya Shankar
So, okay, I like what you said first, which is we had a very broad definition. Evals is a big spectrum of ways to measure application quality. Now unit tests are one way of doing this. Maybe there are some non negotiable functionalities that you want your AI assistant to have and unit tests are going to be able to check that. Now maybe you also, because these AI assistants are doing such open ended tasks, you kind of also want to measure how good are they at very vague or ambiguous things like responding to new types of user requests or figuring out if there's new distributions of data, new users are coming and using your real estate agent that you didn't even know would use your product. And then all of a sudden you think like, oh, there's a different way. You want to kind of accommodate this new group of people. So evals could also be a way of looking at your data regularly to find these new cohorts of people. Evals could also be like metrics that you just want to track over time. Like you want to track people saying, yes, thumbs up. I liked your message. You want to very, very basic things that are not necessarily AI related, but can go back into this flywheel of improving your product. So I would say on the end, overall, unit tests are a very small part of of that very big puzzle.
[09:56]
Lenny Rachitsky
Awesome. You guys actually brought an example of an eval just to show us exactly what the hell we're talking about. We're talking in these big ideas. So how about let's pull one up and show people. Here's what an eval is.
[10:07]
Hamil Hussain
Yeah, let me just set the stage for it a little bit. So to echo what Shreya said, it's really important that we don't think of evals as just tests. There's a common trap that a lot of people fall into because they jump straight to the test, like, let me write some tests. And usually that's not what you want to do. You should start with some kind of data analysis to ground what you should even test. And that's a little bit different than software engineering, where you have a lot more expectations of how the system is going to work. With LLMs, it's a lot more surface area. It's very stochastic. So we kind of have a different flavor here. And so the example I'm going to show you today, it's actually a real estate example. It's a different kind of real estate example. It's from a company called Nurture Boss. I can share my screen to show you their website just to help you understand this use case a little bit. So let me share my screen. So this is a company that I worked with called Nurture Boss, and it is a AI assistant for property managers who are managing apartments. And it helps with various tasks such as inbound leads, customer service, booking appointments, so on and so forth. Like all the different sort of operations you might be doing as a property manager, it helps you with that. And so you can see kind of what they do. It's a very good example because it has a lot of the complexities of a modern AI application.
[11:40]
Shreya Shankar
So.
[11:40]
Hamil Hussain
So there's lots of different channels that you can interact through the AI with, like chat, text, voice, but also there's tool calls, lots of tool calls for, like, booking appointments, getting information about availability, so on and so forth. There's also rag retrieval, getting information about customers and properties and things like that.
[12:05]
Shreya Shankar
So.
[12:06]
Hamil Hussain
So it's pretty fully fledged in terms of an AI application. And so they have been really generous with me in allowing me to use their data as a teaching example. And so we have anonymized it. But what I'm going to walk through today is, okay, let's create. Let's do the first part of how we would say, start to build evals for Nurture Boss. Like, why would we even want to do that? So let's go through the very beginning stage, what we call error analysis, which is let's look at the data of their application and first start with what's going wrong. So I'm going to jump to that next and I'm going to open an observability tool. And you can use whatever you want here. I just happen to have this data loaded in a tool called Brain Trust, but you can load it in anything. You know, it's not. We don't have a favorite tool or anything. In the blog post that we wrote with you, we had the same example, but in Phoenix Arise. And I think Aman, on your blog post used Phoenix Arise as well. And there's also Langsmith. So these are kind of like different tools that you can use. So what you see here on the screen, this is logs from the application. And let me just show you how it looks. So what you see here is. And let me make it full screen. So this is one particular interaction that a customer had with, with the Nurture Boss application. And what it is is a detailed log of everything that happened. So it's called a trace, and it's just the engineering term for logs of a sequence of events. The concept of a trace has been around for a really long time, but it's especially really important when it comes to AI applications. So we have all the different components and pieces in information that the AI needs to do its job, and we are logged all of it, and we're looking at the view of that. And so you see here a system prompt. The system prompt says you are an AI assistant working as a leasing team member at retreat at Acme Apartments. Remember, I said this is anonymized, so that's why the name is Acme Apartments. Your primary role is to respond to text messages from both residents and prospective, both current residents and prospective residents. Your goal is to provide accurate, helpful information, yada, yada, yada. And then there's a lot of detail around guidelines of how we want this thing to behave.
[14:56]
Lenny Rachitsky
Is this their actual system prompt, by the way? For this company?
[14:59]
Hamil Hussain
It is, yes. It's a real system prompt.
[15:01]
Lenny Rachitsky
That's amazing because that's really. It's rare. You See an actual company product system prompt that's like their crown jewels a lot of times. So this is actually very cool on its own.
[15:08]
Hamil Hussain
Yeah, yeah, it's really cool. And you know, you see all these different sort of features that they want to are different use cases. So things about tour scheduling, handling applications, guidance on how to talk to different Personas, so on and so forth. And you can see the user just kind of jumps in. Here it says, ask, okay, do you have a one bedroom with study available? I saw it on virtual tours. And then you can see that the LLM calls some tools, it calls this get individuals information tool. And it pulls back that person's information and then it gets the community's availability. So it's, you know, it's querying a database with the availability for that apartment complex. And then finally the AI responds, hey, we have several one bedroom apartments available, but none specifically listed with the study. Here are a few options. And then it says, can you let me know when one with the study is available? Then it says, I currently don't have specific information on the availability of a one bedroom apartment. User says, thank you. And the AI says, you're welcome. If you have any more questions, feel free to reach out. Now this is an example of a trace. And this is we're looking at one specific data point. And so one thing that's really important to do when you're doing data analysis of your LLM application is to look at data. Now you might wonder, there's a lot of these logs, it's kind of messy, there's a lot of things going on here. How in the hell are you supposed to look at this data? Do you want to just drown in this data? How do you even analyze this data? So it turns out there is a way to do it that is completely manageable and it's not something that we invented. It's been around in machine learning and data science for a really long time. And it's called error analysis. And what you do is, is the first step in conquering data like this is just to write notes. Okay? So you got to put your product hat on, which is why we're talking to you, because product people have to be in the room and they have to be involved in sort of doing this. You know, usually a developer is not suited to do this, especially if it's not a coding application.
[17:47]
Lenny Rachitsky
And I'm just to mirror back why I think you're saying that is because this is the user experience of your product. People talking to this agent is the Entire product, essentially. And so it makes sense for the product person to be involved. Super involved in this.
[17:59]
Hamil Hussain
Yeah. So let's, let's reflect on this conversation. Okay. A user asked about availability. The AI said, oh, we don't really have that. Have a nice day. Now, for a product that is helping you with lead management, is that good? Like, do you feel like this is the way we want it to go?
[18:30]
Lenny Rachitsky
Not ideal.
[18:32]
Hamil Hussain
Yes, not ideal. And I'm glad you said that. A lot of people would say, oh, it's great. Like, the AI did the right thing. It said we don't. It looked, said we didn't have available, and it's not available. But with your product hat on, you know that's not correct. And so what you would do is you would just write a quick note here. You would say, okay, you might pop in here, let me just. And you can write a note. So every observability application has ability to write notes. And you wouldn't try to figure out if something is wrong in this application. In this case, it's kind of not doing the right thing. But you just write a quick note. Should, you know, should have handed off to a human.
[19:19]
Lenny Rachitsky
And as we watch this happening, it's like you mentioned this and you'll explain more. You're doing this. This feels very manual and unscalable. But as you said, this is just one step of the process. And there's a system to this. And it's just the first part.
[19:31]
Hamil Hussain
And you don't have to do it for all of your data. You sample your data and just take a look. And it's surprising how much you learn when you do this. Everyone that does this immediately gets addicted to it. And they say this is the greatest thing that you can do. When you're building an AI application, you just learn a lot. You're like, this is not how I want it to work. Okay. And so that's just an example. So you write this note and then we can go on to the next trace. So this is the next trace. I just pushed a hotkey on my keyboard. Let me go back to looking at it.
[20:09]
Lenny Rachitsky
And these tools make it easy to go through a bunch and add these notes quickly.
[20:13]
Hamil Hussain
Yes. And so this is another one, similar system prompt. We don't need to go through all of it again. We'll just jump right into the user question. Okay, I've been texting you all day. Maybe it's funny. And the user says, please. Okay, yeah, this one is. This one is just like an error in the application. Where, you know, this is a text message application. And so, you know, it's a tech. The, sorry, the channel through which the customer is communicating is through text message. And you're just getting like really garbled. And you can see here that it kind of doesn't make sense, you know, like the words are being cut off, like in the meantime. And then the system doesn't know how to respond because you know how people text message, they like write short phrases, they, you know, split, split their sentence across four or five different turns. So in this case, what do you.
[21:16]
Lenny Rachitsky
Do with something like that?
[21:16]
Hamil Hussain
Yeah, so this is a, this is a different kind of error. This is more of, hey, we're not handling this interaction correctly. This is more of a technical problem rather than, hey, the AI is not doing exactly what we want.
[21:30]
Lenny Rachitsky
So we would write like, it's amazing. You're catching that too here. Otherwise you'd have no idea this was happening.
[21:35]
Hamil Hussain
Yeah, you might not know this is happening. Right. And so you would just say, okay, you would write a note like, conversation flow is janky because of text message.
[21:51]
Lenny Rachitsky
And I like, yeah, I like that. I like that you're using the word janky. It shows you just how informal this can be at this stage.
[21:56]
Hamil Hussain
Yeah, it's supposed to be chill, like, just don't overthink it. And there's some, there's a way to do this. So the question always comes up, how do you do this? Do you look at, do you try to find all the different problems in this trace? What, what do you write a note about? And the answer is just write down the first thing that you see that's wrong. The most upstream error. Don't worry about all the errors. Just capture the first thing that you see that's wrong and stop and move on. And you can get really good at this. The first two or three can be very painful, but you know, it doesn't. You can, you know, do a bunch of them really fast. So here's another one and let's skip the system prompt again. And the user asks, hey, I'm looking for a two to three bedroom with either one or two bats. Do you provide virtual tours? And a bunch of tools are called and it says, hi Sarah, Currently we have three bedroom, two and a half bathroom apartment available for $2,175. Unfortunately, we don't have any two bedroom options at the moment. We do offer virtual tour tours. You can schedule a tour, blah, blah. It just so happens that there's no virtual tour. Right. So, you know, it Is hallucinating something that doesn't exist. And you would. You kind of have to bring your context as an engineer or even, you know, product content and say, hey, this is kind of weird. Like, you know, we shouldn't be telling person about virtual tour when it's not offered. So you would say, okay, you know, offered virtual tour, and you just write the note. You can see there's a diversity of different kinds of errors that we're seeing, and we're actually learning a lot about your application in a very short amount of time.
[23:55]
Shreya Shankar
One common question that we get from people at this stage is, okay, I understand what's going on. Can I ask an LLM to do this process for me?
[24:04]
Lenny Rachitsky
Great question.
[24:05]
Shreya Shankar
And I loved Hamel's most recent example, because what we usually find when we try to ask an LLM to do this error analysis is it just says the trace looks good because it doesn't have the context needed to understand whether something might be, you know, bad product smell or, you know, not. For example, the hallucination about scheduling the tour. Right. I can guarantee you I would bet money on this. If I put that into ChatGPT and asked, Is there an error? It would say, no. Did a great job. But Hamill had the context of knowing, oh, we don't actually have this virtual tour functionality. Right. So I think in these cases, it's so important to make sure you are manually doing this yourself. And we can talk a little bit more about when to use LLMs in the process later. But, like, number one pitfall right here is people are like, let me automate this with an LLM.
[24:55]
Lenny Rachitsky
Do you think we'll get to a place where an agent can do this?
[24:58]
Shreya Shankar
Oh, no, no, no, no. Sorry. There are parts of error analysis that an LLM is suited for, which we can talk about later in this podcast. But right now, in this stage of freeform, note taking is not the place for an LLM.
[25:12]
Lenny Rachitsky
And this is something you call open coding this step.
[25:15]
Shreya Shankar
Absolutely.
[25:16]
Lenny Rachitsky
Another term that you used in your post that I love and that fits into this step is this idea of a benevolent dictator. Maybe just talk about what that is and maybe Shreya cover that.
[25:27]
Shreya Shankar
Yeah. So Haml actually came up with this term.
[25:30]
Lenny Rachitsky
Okay. Maybe Ham will cover the answers.
[25:33]
Hamil Hussain
No problem. And we'll actually show the LM automation in this example, because we're going to take this example. We're going to go all the way through.
[25:40]
Lenny Rachitsky
Amazing.
[25:41]
Hamil Hussain
And so. And so benevolent dictator is just a catchy term for the fact that when you're doing this open coding, a lot of teams get bogged down in having a committee do this. And for a lot of situations that's wholly unnecessary. Like, you know, people get really uncomfortable with, okay, you know, we want everybody on board, we want everybody involved, so on and so forth. You need to cut through the noise In a lot of organizations, if you look really deeply, especially small, medium sized companies, there's really like, you can appoint one person whose taste that you trust and you can do this with a small number of people and often one person. And that's, it's really important to make this tractable. You don't want to make this process so expensive that you can't do it. You're going to lose out. So that's the idea behind Benevolent Dictator is hey, you need to simplify this across as many dimensions as you can. Another thing that we'll talk about later is when you go to building an LLM as a judge, you need a binary score. You don't want to think about is this like a 1, 2, 3, 4, 5, like assign a score to it. You can't, that's going to slow it down.
[26:59]
Lenny Rachitsky
Just to make sure this benevolent dictator point is really clear, basically this is the person that does this note taking and ideally they're the expert on the stuff. So if it's law stuff, maybe there's like a legal person that owns this. It could be a product manager. Give us advice on who this person should be.
[27:16]
Hamil Hussain
Yeah, it should be the person with domain expertise. So in this case it would be the person who understands the business of leasing, apartment leasing and has context to understand if this makes sense. It's always a domain expert like you said. Okay. For legal it would be a law person. For mental health it would be the mental health expert. Whether that's like a psychiatrist or someone else.
[27:41]
Lenny Rachitsky
Cool.
[27:42]
Hamil Hussain
Oftentimes it is the product manager. Cool.
[27:44]
Lenny Rachitsky
So the advice here, pick. That person may not feel so super fair that they're the one in charge and they're the dictator, but they're benevolent. It's going to go be okay.
[27:53]
Hamil Hussain
Yeah, it's going to be okay. You're just trying to, it's not perfection. You're just trying to make progress and get signal quickly so you have an idea of what to work on. Because it can become infinitely expensive if you're not careful.
[28:07]
Shreya Shankar
Yeah.
[28:07]
Lenny Rachitsky
Okay, cool. Let's go back to your examples.
[28:10]
Hamil Hussain
Yeah, no problem. So this is another example where we have someone saying, okay, do you have any specials? And the Assistant or the AI responds, hey, we have a 5% military discount. User response. Can you, in the switch as a subject, can you tell me how many floors there are? Do you have any one bedrooms available or one bedrooms on the first floor? And the AI responds, yeah, okay, we have several one bedroom apartments available. And then the user wants to confirm any of those on the first floor and how much are the one bedrooms? And then also it's a current resident. So they're also asking, I need a maintenance request. This is actually pretty like you could see the messiness of the real world in here. And the assistant just calls a tool that says transfer call, but it doesn't say anything. It just abruptly does transfer call. So it's pretty jank, I would say. Like it's just not, you know, another jank. Another kind of jank, A different kind of jank. So you don't want to, when you write the open note, you don't want to say jank. Because what we want to do is we want to understand what. And when we look at the notes later on we'll understand like what happened. So you just want to say, you know, did not confirm call transfer with, with user. It doesn't have to be perfect. You just have to have a general idea of what's going on.
[29:39]
Lenny Rachitsky
Cool.
[29:40]
Hamil Hussain
So, okay, so let's say we do. And we, Treya and I, we recommend doing at least 100 of these. The question is always like, how many of this do you do? And so there's not a magic number. We say 100 is because we know that as soon as you start doing this, once you do 20 of these, you will automatically find it so useful that you will continue doing it. So we just say 100 to mentally unblock you. So it's not intimidating like, don't worry, you're only going to do 100. And there is a term for that. So the right answer is keep looking at traces until you feel like you're not learning anything new. Maybe Shreya should talk about.
[30:30]
Shreya Shankar
Yeah, so there's actually a term in data analysis and qualitative analysis called theoretical saturation. So what this means is when you do all of these processes of looking at your data, when do you stop? It's when you are theoretically saturating or you're not uncovering any new types of notes, new types of concepts, or nothing that will materially change the next part of your process. And this kind of takes a little bit of intuition to develop. So typically people don't really know when they've reached theoretical saturation yet. That's Totally fine. When you do two or three examples or rounds of this, you will develop the intuition a lot of people realize, like, oh, okay, like, I only need to do 40. I only need to do 60. Actually, I only need to do, like, 15. I don't know, like, depends on the application and develops, like, how. Depends on how savvy you are with error analysis, for sure.
[31:25]
Lenny Rachitsky
And your point about, you probably want to. You're going to want to do a bunch. I imagine it's because you're just like, oh, I'm discovering all these problems. I got to see what else is going on here.
[31:34]
Shreya Shankar
Exactly. And promise at some point you're, like, not going to discover new types of problems.
[31:39]
Lenny Rachitsky
Yeah. Awesome. So let's say you did a hundred of these. What's the next step?
[31:42]
Hamil Hussain
Yeah, okay, so you did a hundred of these. Now you have all these notes. So this is where you can start using AI to help you. So the part where you looked at this data is important, like we discussed. You don't want to automate this part too much.
[31:59]
Lenny Rachitsky
Humans will still have jobs. This is a takeaway here. That's great.
[32:02]
Hamil Hussain
Yes.
[32:03]
Lenny Rachitsky
Just reviewing traces. At least there's one job left for now.
[32:06]
Hamil Hussain
Yeah, exactly. And so, okay, you have all these notes. Now to turn this into something useful, you can do basic counting. So basic counting is the most powerful analytical technique in data science because it's so simple and it's kind of undervalued in many cases. And. And so it's very approachable for people. And so the first thing you want to. You want to do is take these notes and you can categorize them with an LLM. And so there's a lot of different ways to do that. Right before this podcast, I took three different coding agents or, you know, AI tools, and had it categorize these notes. So one is, okay, I uploaded into a cloud project. I uploaded a CSV of these notes, and I just exported them directly from this interface. There's a lot of different ways to do this, but I'm showing you the simple, stupid way, the most basic way of doing things. I dumped a CSV in here, and I said, please analyze the following CSV file. I told it there's a metadata field that has a note in it. But what I said is I used the word open codes and said, hey, I have different open codes. And that's a term of art. That's. LLMs know what open codes are, and they know what axial codes are, because it is a. It is a concept that's Been around for a really long time. So those words help me shortcut, like what I'm trying to do.
[33:46]
Lenny Rachitsky
That's awesome. And the end of the end of the prompt is telling it to create axial codes.
[33:50]
Hamil Hussain
Yes, creating a codes. So what it does is.
[33:54]
Shreya Shankar
So maybe it's worth talking about what are axial codes? Or like what's the point Here you have a mess of open codes and you don't have 100 distinct problems. Actually many of them are repeats. But because you phrase them differently. Right. And in that you shouldn't have tried to create your taxonomy of failures. As you're open coding, you just want to get down what's wrong and then organize. Okay, what's the most common failure mode? So the purpose axial code basically is just a failure mode. It's like the label or category. And what our goal is is to get to this clusters of failure modes and figure out what is the most prevalent. So then you can go and run an attack that problem.
[34:36]
Lenny Rachitsky
That is really helpful. Basically just synthesizing all these categories and themes. Super cool. And we'll include this prompt in our show notes for folks so they don't have to sit there and screenshot it and try to type it out themselves.
[34:49]
Hamil Hussain
Yeah, great idea. And so Claude went ahead and analyzed the CSV file, decided how to parse it, blah blah, we don't need to worry about all that stuff. But it came up with a bunch of axial codes. Basically axial codes are categories like Shreya said. So one is okay. Capability limitations, misrepresentation process, protocol violations, human handoff issues, communication quality. It created these categories. Now do I like all the categories? Not really. I like some of them. It's a good first, like stab at it. I would probably rename a little bit because some of them are a bit too generic. Like what is capability limitations that seem a little bit too broad. It's not actionable. I want to get like a little bit more actionable with it so that if I do decide it's a problem, I know what to do with it. But we'll discuss that in a little bit. So you can do this with anything. And this is the dumbest way to do it. But dumb sometimes is a good way to get started.
[35:50]
Lenny Rachitsky
And this is what LLMs are really good at. Taking a bunch of information, synthesizing.
[35:53]
Shreya Shankar
Absolutely synthesizing for us to make sense of. Right. Note that it's not telling us. It's not automatically proposing fixes or anything. That's our job. But now we can wade through this mess of open Code's a lot easier. Another thing that's interesting here in this prompt to generate the axial codes is you can be very detailed if you want. Right. You can say, I want each axial code to actually be some actionable failure mode, and maybe the LLM will understand that and propose it. Or I want you to group these open codes by what stage of the user story that it's in. So this is where you can be creative or do what's best for you as a product manager or engineer working on this, and that will help you do the improvement later.
[36:40]
Lenny Rachitsky
So there's no definitive prompt of here's the one way to do it. You're saying there's. You can iterate see what works for you.
[36:46]
Shreya Shankar
Absolutely.
[36:47]
Lenny Rachitsky
It's interesting. The tools don't do this. Or do they try and they just don't do a great job?
[36:50]
Shreya Shankar
No, I don't think they do it. We've been screaming from the rooftops, please, please do this. I do think it's a little bit hard. Right? Like, part of this whole experience with the evals course Hamill and I are teaching are, like, a lot of people don't actually know this. So maybe it's that people don't know this and they don't know how to build tools for it. And hopefully we can demystify some of this magic.
[37:13]
Lenny Rachitsky
And just to double click on this point, this is not a thing everyone does or knows. This is something you two developed based on your experience doing data analysis and data science at other companies.
[37:24]
Shreya Shankar
Well, I want to caveat. We didn't invent error analysis. We don't actually want to invent things. That's bad signal. If somebody is coming to you with a way to do something that's entirely new and not grounded in hundreds of years of theory and literature, then you should, I don't know, be a little bit wary of that. But what we tried to do was distill, okay, what are the new tools and techniques that you need to make sense of the LLM Error analysis. And then we created a curriculum or structured way of doing this. So this is all very tailored to LLMs, but the terms open coding, axial coding are grounded in social science.
[38:04]
Lenny Rachitsky
Amazing. Okay, like, what's funny about you guys doing this is I just want to go do this somewhere. I don't have. I don't have an AI product to do this on. But it's just like, oh, this would be so fun. Just sit there and find all the problems I'm running into and categorize them. And then Try to fix them. Delightful. I love that Hamill pulled up a video. What do you got going on here?
[38:22]
Hamil Hussain
Yeah, so I pulled up a video just to drive home Shreya's point. Like we are not inventing anything. So what you see on the screen here is Andrew Ng, one of the famous machine learning researchers in the world who have taught a lot of people frankly machine learning. And you can see this is an 8 year old video and he's talking about error analysis. And so this is a technique that's been used to analyze stochastic systems for ages. And it's something that if you're just using the same machine learning ideas and principles, is bringing them into here because again, these are stochastic systems.
[39:01]
Lenny Rachitsky
Awesome. Well, one thing we're working on, getting Andrew in the podcast, we're chatting, so that'll be really fun too. I love that. My other, my podcast episode just came out today, is in your feed there, and it's standing out really well in that feed. So I'm really happy about that thumbnail.
[39:13]
Hamil Hussain
Very nice. Yeah, the recommendation algorithm is.
[39:15]
Lenny Rachitsky
Yes, here we go. You hope you click on that. Don't. Don't screw my algorithm. Okay, cool. So we've done some synthesis. I know we're not going to go through the entire step. This is like you have a whole course that takes. Takes many days to learn this whole process. What else do you want to share about how to go about this process? Okay.
[39:31]
Hamil Hussain
So you can, you can do this through anything. And you know, I've used the same thing, works just fine in ChatGPT. The same exact prompt. You can see it made axial codes. I really like using Julius AI. It's one of my favorite tools. Julius is kind of his third party tool that uses notebooks. I personally like Jupyter notebooks a lot. And so it's more of a data science thing. But a lot of product managers are kind of learning notebooks nowadays and it's kind of cool. It's like a fun playground where you can write code and look at data. We don't have to go deeply into that. Just wanted to mention you can use a lot. You know, AI is really good at this. So let's go to the fun part.
[40:12]
Lenny Rachitsky
Here we go.
[40:13]
Hamil Hussain
So now we have all the. We have these axial codes. So the first thing I like to do, I have these open codes. I have the axial codes that let's.
[40:27]
Lenny Rachitsky
Say.
[40:29]
Hamil Hussain
That we assigned from the Cloud project or the chatgpt. What I do is I collect them first and I take a look. Does These axial codes make sense. I look at the correspondence between the different axial codes and the open codes and I go through an exercise and I say, do I like these codes? Like can I make them better? Can I refine them? Can make them more specific? Instead of being generic, I make them very specific in actionable. So you see the ones that I came up with here are tour scheduling, rescheduling issues, human handoff or transfer issue, formatting error with an output conversational flow. We saw the conversational flow issue with the text messages making follow up promises not kept. And so basically what I can do, what you can do now is like you have these axial codes and so I just collect them into a list. So this is an Excel formula. Just collect these codes into a list. And now we have a comma separated list of these codes. And then what you can simply do is you could take your notes that you have those open codes and you can tell an AI. And this is using Gemini and AI just for simplicity. This is like the, you know, again we're trying to keep it simple. Categorize the following note into one of the following categories.
[41:56]
Lenny Rachitsky
By the way, this for folks watching, there's like, I like all these different prompts and formulas you're sharing. This is like the Google sheets. AI. AI prompt.
[42:04]
Shreya Shankar
Huge fan.
[42:05]
Hamil Hussain
Yeah. And so basically what you can do is you can then have, you can categorize your traces into one of the buckets. And that's what we have here. We have categorized all those problems that we encountered into one of these things.
[42:22]
Shreya Shankar
And this is automatic, which is very exciting. I mean the AI is doing it. So this also drives home the point that your open codes have to be detailed. Right. You can't just say janky because if the AI is reading janky, it's not going to be able to categorize it. Even a human wouldn't. Right. It would have to go. And remember why you said janky. So it's important to be somewhat detailed in your open code.
[42:45]
Lenny Rachitsky
Okay. So avoid the word janky is a good rule of thumb.
[42:51]
Hamil Hussain
I was being funny.
[42:52]
Lenny Rachitsky
Yeah. Okay. What are some of those other words just that people often use that you think are not good?
[42:57]
Shreya Shankar
I don't think it's specific words. I think it's just people are not detailed enough in the open code so it's hard to do the categorization.
[43:04]
Lenny Rachitsky
Great. And by the way, the reason you have to map them back is because say Clauder or JPT gave you suggestions and you change them and iterate it on them. So it doesn't. You can't just go back and say, cool, what are in each bucket?
[43:16]
Hamil Hussain
Yeah, yeah, great. That's a really good question, actually. It's good to iterate and think about it a little bit. Like, do I like these open codes? Do these actually make sense to me? Just like anything that AI does, it's really good to kind of put yourself in the middle, just a little bit.
[43:32]
Lenny Rachitsky
Of a loop, still space for us. Great.
[43:34]
Hamil Hussain
Yeah.
[43:34]
Shreya Shankar
One of the things that I like to do in this step, if I'm trying to use AI to do this labeling, is also have a new category called none of the above. So an AI can actually say none of the above in the axial code. And that informs me, okay, my axial codes are not complete. Like, let's go look at those open codes. Let's figure out what some new categories are. Or figure out how to reword my other axial code.
[44:00]
Lenny Rachitsky
Awesome. And what's cool about this is you don't need to do this many, many times. Like, for most products, you do this process once, and then you build on it, I imagine, and you just tweak it over time.
[44:09]
Shreya Shankar
Absolutely. And it gets so fast. Like, people. People do this, like, once a week. And you can do all of this in, like, 30 minutes. And like, suddenly your product is, like, so much better than if you were never aware of any of these problems.
[44:23]
Lenny Rachitsky
Yeah. It's absurd to feel like you don't. You wouldn't know this is happening. Like, watching this happening. I'm like, how could you not do this to your product?
[44:30]
Shreya Shankar
A lot of people have no idea.
[44:32]
Hamil Hussain
Most people.
[44:32]
Lenny Rachitsky
Yeah, we'll talk about that. There's a whole debate around this stuff that we want to talk about. Okay, cool. So you have this. You have the sheet. What comes next?
[44:40]
Hamil Hussain
Okay, so here's the big unveil. This is the magic moment right now. So we have all these codes that, you know, we applied the ones that we like on our traces. Now you can do the Ta da. You can count them. So here's a pivot table, and we just can do pivot table on those. And we can count how many times those different things occurred. So what do we find? Find on this? On these, like, traces that we categorize? We found 17 conversational flow issues. And I really like pivot tables because you can do cool things. You can, like, double click on these. You can say, oh, okay, let me. Let me take a look at those. But that's going into an aside about pivot tables, how cool they are. But, you know, now we have just a nice rough cut of what are our problems. And now we have gone from chaos to some kind of thinking around, oh, you know what? These are my biggest problems. I need to fix conversational issues. You know, maybe these human handoff issues, it's not necessarily the count is the most important thing. You know, that might be something that's just really bad and you want to fix that. But, okay, now you have some way of looking at your problem, and now you can think about whether you need evals for. For some of these. So, you know, with the, you know, there might be some of these things that might be just dumb engineering errors that you don't need to write an eval for because it's very obvious on how to fix them. Maybe the formatting error with output, maybe you just forgot to tell the LLM how you want it to be formatted. And like, you didn't even say that in the prompt. So, like, just go ahead and fix the prompt maybe, you know, and we can decide like, okay, do you want. And do you want to write an eval for that? You might still want to write an eval for that because you might be able to test that with just code. You could just test the string. Does it have the right formatting, potentially without running an LLM? So there's a cost benefit trade off to evals. You don't want to get carried away with it, but you want to start. You want to usually ground yourself in your actual errors. You don't want to skip this step. And so the reason I'm kind of spending so much time on this is like, this is where people get lost. They go straight into evals. Like, let me just write some tests. And that is where things go off the rails. Okay, so let's say we want to tackle one of these things. So, for example, let's say we want to tackle this human handoff issue and we're like, I'm not really sure how to fix this. Like, that's a kind of subjective sort of judgment call on, you know, should we be handing off to human. And I don't know immediately how to fix it. It's not super obvious, per se. Yeah, I can, like change my prompt, but I'm not like, sure. I'm not 100% sure. Well, that might be sort of an interesting thing for an LLM as a judge, for example. So there's different kinds of evals. One is code based, which you should try to do if you can because they're cheaper. You don't have to you know, LM as a judge is something, it's like a meta eval. You have to eval that eval to make sure the lm, the judging is doing the right thing, which we'll talk about in a second. So, okay, LLM as a judge, that's one thing. How do you build an LLM as a judge?
[48:31]
Lenny Rachitsky
Before we get into that, actually, just to make sure people know exactly what you're describing there these two types of evals. One is, you said it's code based, one is LLM as judge. Maybe Shreya, just help us understand what a code based eval even is. It's like essentially a unit test. Is that a simple way to think about it?
[48:47]
Shreya Shankar
Maybe eval is not the right term here, but think like automated evaluator. So when we find these failure modes, one of the things we want is like, okay, can we now like go check the prevalence of that failure mode in an automated way without me manually labeling and doing all the coding and the grouping And I want to run it on thousands and thousands of traces. I want to run it every week. That is okay. You should probably build an automated evaluator to check for that failure mode. Now when we're saying code based versus LLM based, we're saying, okay, so maybe I could write like a python function or a piece of code to check whether that failure mode is present in a trace or not. And that's possible to do for certain things like checking the output is JSON or checking that it's markdown or checking that it's short. These are all things you can capture in code or you could approximately capture in code. When we're talking about LLM judge here, we're saying that this is a complex failure mode and we don't know how to evaluate in an automated way. So maybe we will try to use an LLM to evaluate this very, very narrow, specific failure mode of handoffs.
[49:56]
Lenny Rachitsky
So just to try to mirror back what you're describing, you want to test what your say agent or AI product is doing. You ask it a question, it gets back with something. One way to test if it's giving you the right answer is if it's consistently doing the same thing that you could write it code to tell you this is true or false. For example, will it ever say there's a virtual tour? So you could ask it, do you provide virtual tours? It says yes or no. And then you could write code to tell you if it's correct based on that specific answer. But if you're Asking about something more complicated, and it's not binary. You almost need, like, in one world, you need a human to tell you this is correct. The solution to avoid humans having to review all this every time automatically is LLMs replacing human judgment. And you'd call it LLM as judge. The LLM as being the judge if this is correct or not.
[50:47]
Shreya Shankar
Absolutely. You nailed it. So people always think, like, oh, this is at least as hard as my problem of creating the original agent. And it's not because you're asking the judge to do one thing, evaluate one failure mode. So the scope of the problem is very small, and the output of this LLM judge is like, pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.
[51:18]
Lenny Rachitsky
And the goal here is just to have a suite of tests that run before you ship to production that tell you things are going the way you want them to, the way your agent is interacting.
[51:28]
Shreya Shankar
The beautiful thing about LLM judges, you can use them in unit tests or CI. Sure. But you could also use it online for monitoring. Right. Like, I can sample, like, thousand traces every day, run my LLM judge, real production traces, and see what the failure rate is there. This is not a unit test. Right. But still, now we get, like, an extremely specific measure of application quality.
[51:53]
Lenny Rachitsky
Cool. That's a really great point, because a lot of people disavows for being this, like, not real life thing. It's a thing that you test before it's actually in the real world and what's actually happening in the real world. You're saying you could actually. You should actually do exactly that. Test your real thing running in production, and it's like a daily, hourly sort of thing. You could be running.
[52:09]
Hamil Hussain
Totally awesome.
[52:11]
Lenny Rachitsky
Okay, Hamill's got an example of an actual LM as judge eval here, so let's take a look.
[52:16]
Hamil Hussain
I love how Shreya really teed it up for me, so thank you so much. So what we have is a elements of judge prompt for this one specific failure. Like Shreya said, you would want to do one specific failure, and you want to make it binary. Because we want to simplify things. We don't want. Hey, like, score this on a rating of 1 to 5. Like, how good is it? That's just mostly in most cases, that's a weasel way of, like, not making a decision. Like, no, you need to make a decision. Is this good enough or not? Yes or no. Can be painful to think about what that is, but you should absolutely do it. Otherwise this thing becomes very untractable. And then when you report these metrics, no one knows what 3.2 versus 3.7 means.
[53:03]
Shreya Shankar
Yeah, we see this all the time also. And even with expert curated content on the Internet where it's like, oh, here's your LLM judge evaluator prompt. Here's a one to seven scale. And I always think, I always text Hamill like, oh no, now we have to fight the misinformation again because we know somebody is going to try it out and then come back to us and say, oh, I have 4.2 average. And we're going to be like, okay.
[53:31]
Lenny Rachitsky
It's wild how much drama there is in Eval's space. We're going to get to that. Oh man. This episode is brought to you by Mercury. I've been banking with Mercury for years and honestly, I can't imagine banking any other way at this point. I switched from Chase and holy moly, what a difference. Sending wires, tracking spend, giving people on my team access to move money around. So freaking easy. Where most traditional banking websites and apps are clunky and hard to use, Mercury is meticulously designed to be an intuitive and simple experience. And Mercury brings all the ways that you use money into a single product, including credit cards, invoicing, bill pay, reimbursements for your teammates and capital. Whether you're a funded tech starter startup looking for ways to pay contractors and earn yield on your idle cash, or an agency that needs to invoice customers and keep them current, or an e commerce brand that needs to stay on top of cash flow and excess capital, Mercury can be tailored to help your business perform at its highest level. See what over 200,000 entrepreneurs love about Mercury. Visit mercury.com to apply online in 10 minutes. Mercury is a fintech, not a bank. Banking services provided through Mercury's FDIC insured partner banks. For more details, check out the show notes.
[54:46]
Hamil Hussain
Okay, so this is your judge prompt. There's no one way to do it. It's okay to use an LLM to help you create it, but again, put yourself in the loop. Don't just blindly accept what LLM does. And in all of these cases, that's what we did, like with the axial codes we kind of iterated on this. You can use an LLM to like help you create this prompt, but make sure you read it, make sure you edit it, whatever. This is not necessarily the perfect prompt. This is just the stupid, like keeping it very simple. Just to show you the idea is like, okay, for this handoff failure, you know, I said, okay, I want you to output true or false is binary. It's a binary judge. That's what we recommend. And then we, then I just go through and say, okay, like when should you be doing a handoff? And I just list them out. Like, okay, explicit human request ignored or looped, some policy mandated transfer sensitive resident issues tool data unavailability, same day walk in or tour requests, you know, you need to talk to a human for that, so on and so forth. Right? And so the idea is like, now that I know that this is a failure from my data, I'm interested in iterating on it because I know this is actually happening all the time. And like Shreya said, like, it would be nice to have a way not only to evaluate this on like the data I have, but also on production data just to get a sense of like, well, what scales is happening? Let me find more traces. Let me have a, you know, a way to iterate on this. And so we can take this prompt and I'm going to use a spreadsheet again. So the first step is, okay, when I'm doing this judge, I wrote the prompt. Now a lot of people stop there and they say, okay, I have my judge prompt, we're done. Good. Like let's just, let's just ship it and let's. The prompt says, if the judge says it's wrong, it's wrong. They just like accept it as the gospel. Be like, okay, the LLM says wrong. It's, it must be wrong. Don't do that because that's the fastest way that you can have evals that don't match what's going on. And when people lose trust in your evals, they'll lose trust in you. So it's really important that you don't do that. And so before you release your LM as a judge, you want to make sure it's aligned to the human. So how do you do that? Is you have those axial codes and you want to measure your judge against the axial code and say like, hey, does it agree with me? Does my own judge doesn't agree with me? Just measure it. And so what we have here is, okay, I say assess this LLM trace. Again, I'm using just spreadsheets here. Assess this LLM trace according to these rules. And the rules are just the prompt that I just showed you and I ask it, okay, is there a handoff error? True or false? So then this column, let me just zoom in a Bit column H, I have, okay, did this error occur? And column G is whether I thought the error occurred or not.
[57:53]
Lenny Rachitsky
You can see you're going through it manually. You do that.
[57:55]
Hamil Hussain
Yeah, which we already did. We already went through it manually. So it's not like we have to do it again because we kind of have that cheat code from the axial coding. We already did it. You might have to go through it again if you need more data. And there's a lot of details to this on, like how to do this correctly. You want to split your data and do all these things so that you're not cheating. But I just want to show you the concept. And basically what you can do is measure the agreement. Now, one thing you should know as a product manager is a lot of people go straight to this, like agreement. They say, okay, my judge agrees with the human some percentage of the time. Now that sounds appealing, but it's a very dangerous metric to use because a lot of times errors have, you know, they only happen on the, on the long tail and they don't happen as frequently. So, like, if you only have the error 10% of the time, then you can easily have 90% agreement by just having a judge say it passes all the time. Does that make sense? So, like, 90% agreement might look good on paper, but it might be misleading. And that's rare. It's a rare. Yeah. So you know, as a product manager or someone, even if you're not doing this calculation yourself, if someone ever reports to you agreement, you should immediately ask, okay, tell me more. Like you need, you know, you no need to look into it. To give you more intuition, here is like a matrix, okay, of this specific judge in the Google sheet. And this is again a pivot table. Just keeping it dumb and simple is okay. On the rows I have what did the human think? What did I think? Did it have an error true or false? And then did my judge have an error true or false?
[59:56]
Shreya Shankar
The intuition here is exactly what Hamill said, right? You need to look at each type of error. So when the human said false, but the judge said true or vice versa. So those non green diagonals here, and if they're too large, then go iterate on your prompt. Make it more clear to the LLM judge so that you can reduce that misalignment. You want to get to a point where most you're going to have some misalignment. That's okay. We talk about in our course also how to code correct that misalignment. But in this stage, if You're a product manager and the person who's building the LLM judge eval has not done this. They're saying, like, oh, it agrees 75% of the time, we're good. They don't, like, have this matrix and they haven't iterated to make sure that these two types of errors have gone down to zero. Then it's a bad smell. Go and ask them to go fix that.
[60:52]
Lenny Rachitsky
Awesome. That's a really good tip. Is. Is what to look for when someone's doing this wrong.
[60:56]
Shreya Shankar
Yeah.
[60:57]
Lenny Rachitsky
Actually, can you take us back to the LLM as judge prompt? I just want to highlight something really interesting here. I've had some guests on the podcast recently who've been saying evals are the new PRDs. And if you look at this, this is exactly what this is. Like product managers, product teams write, here's what the product should be, here's all the requirements, here's like the how it should work. They built the thing and then they test it manually often. What's cool about this is this is exactly that same thing. And it's running constantly. It's telling you, here's how this agent should respond in very specific ways. If it's this, this, this, do that. If it's this, this, that, do that. And so it's exactly what I've been hearing again and again. You could see right here, this is like the purest sense of what a product requirements document should be. Is this eval judge that's telling you exactly what it should be. And it's automatic and running constantly.
[61:46]
Shreya Shankar
Yeah, absolutely. And it's kind of derived from our own data. So of course it's a product manager's expectations. What I find that a lot of people miss is they just put in what their expectations are before looking at their data. But as we look at our data, we uncover more expectations that we couldn't have dreamed up in the first place, and that ends up going into this prompt.
[62:06]
Lenny Rachitsky
So that is interesting. So it's not. So your advice is not skip straight to evals and LLM as judge prompts before you build the product. Still write traditional one pagers PRDs to tell your team what we're doing, why we're doing it, what success looks like. But then at the end, you could probably pull from that and even improve that original prd. If you're evolving the product using this.
[62:27]
Shreya Shankar
Process, I would go even further to say you're going to improve your. It's going to change. You're never going to know what the failure Modes are going to be upfront, and you're always going to uncover new vibes that you think that your product should have. You don't really know what you want until you see it with these LLMs. So you got to be kind of flexible. Have to look at your data, have to. PRDs are a great abstraction for thinking about BIS, but it's not the end all, be all. It's going to change.
[62:58]
Lenny Rachitsky
I love that. And Hamill's pulling up some cool research report. What's this about?
[63:03]
Hamil Hussain
Oh, this is one of the coolest research reports you can possibly read if you want to know about Evals. So it was authored by someone named Shreya Shankar.
[63:13]
Shreya Shankar
Oh, my God.
[63:15]
Hamil Hussain
And her collaborators. And so it's called who Validates the Validators.
[63:20]
Lenny Rachitsky
That is the best name for a research.
[63:22]
Shreya Shankar
Thank you so much.
[63:24]
Hamil Hussain
So I should let Shreya talk about this. I think one of the most important things to pay attention to in this paper are the criteria drift and what she found.
[63:35]
Shreya Shankar
So we did this super fun study when we were doing user studies with people who were trying to write LLM judges or just validate their own LLM outputs. And we were. This was. I think this was before Evals was, like, extremely popular, I feel like on the Internet. This was. We did this project, like, late 2023 was when we started it. But then the thing that really was burning in my mind as a researcher was like, why is this problem so hard? We've been having machine learning and AI for so long, it's not new, but suddenly this time around, everything is really difficult. So we just did this user study with a bunch of developers, and we realized, okay, what's new here is that you can't figure out your rubrics up front. People's opinions of good and bad change as they review more outputs. They think of failure modes only after seeing 10 outputs they would never have dreamed of in the first place. And these are experts, right? These are people who have built many LLM pipelines and now agents before, and just. You can't ever dream up everything in the first place. And I think that's so key in today's world of AI development.
[64:50]
Lenny Rachitsky
Okay, that is a really good point. That's very much reinforcing what we were just talking about. And that's why Hamil pulled this up, is just.
[64:56]
Shreya Shankar
Okay, the research behind it.
[64:58]
Lenny Rachitsky
Yeah, okay, great. You still gotta do product the same way, but now you have this really powerful tool that helps you make sure what you've built is correct. It's not Gonna replace the PRD process. Cool. How many evals of these, how many, say I don't know, llanomous judge prompts do you end up with? Usually say I don't know. Like I know. Obviously depends complexity of the product, but what's like a number in your experience?
[65:19]
Shreya Shankar
For me, like between four and seven.
[65:22]
Hamil Hussain
Oh, that's it.
[65:23]
Shreya Shankar
It's not that many. Because a lot of the failure modes, as Hamill said earlier, can be fixed by just fixing your prompt. You just didn't think to put it in your prompts and now you put it in your. You shouldn't do an eval like this for everything. Just the pesky ones that you've described your ideal behavior in your agent prompt. But it's still failing.
[65:43]
Lenny Rachitsky
Got it. So say you found a problem, you fixed it. In traditional software development, you'd write a unit test to make sure it doesn't happen again. Is your insight here is don't even bother writing an eval around that if it's just gone.
[65:54]
Shreya Shankar
I think you can if you want to, but the whole game here is about prioritizing. You have finite resources and finite time. You can't write an eval for everything. So prioritize the ones that are the more pesky areas and probably the ones.
[66:08]
Lenny Rachitsky
That are most risky to your business. If they say something like Mecca, Hitler, Skrock, cool. Okay, so that's very relieving that this because this was prompted like a lot of work to really think through all these details.
[66:21]
Shreya Shankar
But it's a lot of one time cost right now forever. You can run this on your application.
[66:27]
Lenny Rachitsky
Right?
[66:29]
Hamil Hussain
And I want to say, okay, data analysis is super powerful, is going to drive lots of improvements very quickly to your application. We showed the most basic kind of data analysis, which is counting, which is accessible to everyone. You can get more sophisticated with the data analysis. There's lots of different ways to sample look at data. We kind of made it look easy in a sense, but there's a lot of skills here to do to it.
[66:59]
Lenny Rachitsky
Well.
[67:02]
Hamil Hussain
Building an intuition and a nose for how to sort through this data. For example, let's say I find conversational issues. This like conversational flow issues. Maybe if I was trying to chase down this problem further, I would think about ways to find other conversational flows, flow issues that I didn't code. You know, I would maybe dig through the data in several ways and there's, you know, different ways to go about this. It kind of, it's very similar, if not almost exactly similar as kind of traditional analytics techniques that you would do on any product.
[67:41]
Lenny Rachitsky
Give us just a quick sense of what comes next. And then let's talk about the debate around evals and a couple more things.
[67:48]
Shreya Shankar
So what comes next after you've built your LLM judge? Well, we find that people just try to use that everywhere they can. So they'll put the LLM judge in unit tests and they will know, like, oh, here are some example traces where we saw that failure because we labeled it. Now we're going to make those part of unit tests and make sure that every time we push a change to our code, these tests are going to pass. They also use it for online monitoring. People are making dashboards on this, and I think that's incredible. I think, like, the products that are doing this right, they have a very sharp sense of how well their application is performing. And people don't talk about it because this is their moat. Right. So people are not going to go and share all of these things because. Makes sense, right? If you are an email writing assistant and you're doing this and you're doing it well, you don't want somebody else to go and build an email writing assistant and then kind of get you out of business. So I really want to stress the point that it's like, try to use these artifacts that you're building wherever possible online, repeatedly. Use them to drive improvements to your product. Oftentimes, Hamill and I will kind of, we'll tell people how to do this up to this very point, and it clicks for people, and then they never come back again. So either they have, I don't know, quit their jobs, they're not doing AI development anymore, or they know what to do from here on out. I think it's the latter, but I think it's very powerful.
[69:15]
Lenny Rachitsky
Like, just watching you do this really opened my eyes to what this is and how systematic the process is. I always imagine you just sit on a computer, okay, what are the things I need to make sure work correctly? And what you're showing us here is, here's. It's a very simple, step by step, based on real things that are happening in your product, how to catch them, identify them, prioritize them, and then catch them if they happen again and fix them.
[69:39]
Shreya Shankar
Yeah, it's not magic. Like, anyone can do this. You're going to have to practice the skill. Like any new skill, you have to practice, but you can do it. And I think what's very empowering now is that product managers are doing this and can do this and can really build very Very profitable products with this skill set.
[69:58]
Lenny Rachitsky
Okay, great segue to a debate that we kind of got pulled into that was happening on X the other day. I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions. Saba at Shreya, give us just a sense of the two sides of the debate around the importance and value of evals and then give us your perspective.
[70:20]
Shreya Shankar
Yeah, so, all right, I'll be a little bit placating. And I say I think everyone is on the same side. I think the misconception is that people have very rigid definitions of what evals is. For example, they might think that evals is just unit tests or they might think that evals is just the data analysis part and no online monitoring or any. No monitoring of product specific metrics like actually number of chats engaged in or whatnot. So I think everyone has a different mindset of evals going in. And the other thing I will say is that people have been burned by evals in the past. So I think people have done evals badly. One concrete example of this is they've tried to do an LLM judge, but it has not aligned with their expectations. They only uncovered this later on and then they didn't trust it anymore. And then they're like, oh, anti evals. And I 100% empathize with that because you should be anti likert scale LLM judge. I absolutely agree with you. We are anti that as well. So a lot of the misconception stems from two things, right? Like people having a narrow definition of evals and then people not doing it well and then getting burned and then wanting to avoid other people making that mistake. And then unfortunately X or Twitter is like a medium where people are misinterpreting what everybody is saying all the time. And you just get all these strong opinions of like, don't do evals. It's bad. We tried it. It doesn't work. We're Claude code or whatever other famous product and we don't do evals. And there's just so much nuance behind all of it because a lot of these applications are standing on the shoulders of evals. Coding agents is a great example of that Claude code. Right. They are standing on the shoulders of Claude base mod, not base, but the fine tuned Claude models have been evaluated on many coding benchmarks. Can't argue against that.
[72:23]
Lenny Rachitsky
And just to double. Just to make clear exactly what you're talking about there, one of the heads, I think maybe the head Engineer of Claude code went on a podcast and he's like, oh, we don't do evals. We just vibe. We just look at vibes. And vibes, meaning they just use it and feel if it's right or wrong.
[72:37]
Shreya Shankar
And I think that kind of works. So there's two things to that, right? One is they're standing on the shoulders of the evals that their colleagues are.
[72:44]
Hamil Hussain
Doing for coding of the cloud.
[72:46]
Lenny Rachitsky
Foundational model.
[72:47]
Shreya Shankar
Absolutely right. We know that they report those numbers because we see the benchmarks, we know who's doing well on those. The other thing is they are actually probably very systematic about the error analysis to some extent. I bet you that they are monitoring who is using Claude, how many people are using Claude, how many chats are being created, how long these chats are. They're also probably monitoring in their internal team. They're dogfooding. Anytime something is off, they maybe have a queue or they send it to the person developing Claude code, and this person is implicitly doing some form of error analysis that Hamill talked about. All of this is evals. There's no world in which they're just being like, I made Claude code. I'm never looking at anything. And unfortunately, when you don't think about that or talk about that, I think that the community, most of the community, is beginners or people who don't know about evals and want to learn about it, and it sends the wrong message there. Now, I don't know what Claude code is doing, obviously, but I would be willing to bet money that they're doing something in the form of evals.
[73:53]
Hamil Hussain
We'll also say that coding agents are fundamentally very different than other AI products because the developer is the domain expert, so you can short circuit a lot of things, and also the developer is using it all day long. So there's a type of dogfooding and type of domain expertise that is, you know, you can collapse the activities. You don't need as much data, you don't need as much feedback or exploration because, you know, so your eval process, you know, should look different because you're.
[74:32]
Lenny Rachitsky
Seeing the code like you see the code is generating. You can tell this is great. This is terrible.
[74:36]
Hamil Hussain
Yeah, yeah. And so. And so I think a lot of people had generalized coding agents, because coding agents are the first AI product released into the wild. And I think it's a mistake to try to generalize that at large.
[74:52]
Shreya Shankar
The other thing is, yeah, engineers have a dogfooding personality. There are plenty of applications where people are Trying to build AI in certain domains and they don't have dogfooding for like doctors, for example, are not out there trying to get all the most incorrect advice from AI and be tolerant and receptive to that. So. So it's very important to keep, I think, these nuanced things in mind.
[75:16]
Lenny Rachitsky
So what I'm hearing from you, Shreya, interestingly, is that if humans on the team are doing very close data analysis, error analysis, dogfooding like crazy, and essentially they are the human evals. And you're describing that as. That's within the umbrella of evals. So you could do it that way if you're very. If you have time and motivation to do that. Or you could set these things up to be automatic.
[75:41]
Shreya Shankar
Absolutely. It's also about the skills. Right. People who work at Anthropic are very, very highly skilled. They've been trained in data analysis or software engineering or AI and whatnot. Right. And you know, you can get there, anyone can get there, of course, by like learning the concepts. But most people don't have that skill right now.
[76:03]
Hamil Hussain
Dog fooding is a dangerous one only because a lot of people will say they're dog fooding, like, yeah, we dog food it, but are they really? And a lot of people aren't really dog fooding it at that visceral level that you would need to. To have to close that feedback loop. So that's the only caveat I would add.
[76:25]
Lenny Rachitsky
There's also this kind of feels like strawman argument of evals versus a B tests. Talk about your thoughts there because that's feels like a big part of this debate people are having. Do you need evals if you have a B test that are testing production level metrics?
[76:37]
Shreya Shankar
So a B tests are again, another form of evals, I imagine, right. Like when you're doing an a B test, you have two different experimental conditions and then you have a metric that quantifies the success of something and you're comparing the metric and again. Right. An eval in our mind is systematic measurement of quality. Some metrics, you can't really do an AV test without the eval to compare. So maybe we just have a different weird take on it.
[77:07]
Lenny Rachitsky
Yeah. Okay, so what I'm hearing is like you consider a B test as part of the suite of evals that you do. I think when people think a B test, it's like we're changing something in the product. We're going to see if this improves some metric we care about. Is that enough? Why do we need to test every little feature, like if it's impacting a metric we care about as a business, we have a bunch of AB tests that are just constantly running.
[77:28]
Shreya Shankar
This is now a great point. So I think a lot of people prematurely do AB tests because they've never done any error analysis in the first place. They just have hypothetically come up with their product requirements and they believe that we should test these things. But it turns out when you get into the data, as Hamill showed, that the errors that you're seeing are not what you thought what the errors might be. They were these weird handoff issues or, I don't know, the text message thing was strange. So I would say that if you're going to do AB tests and they are powered by actual error analysis, as we've shown today, then that's great, go do it. But if you're just going to do them, which we find that people try to do, just trying to do them based on what you hypothetically think is why this is important, then I would encourage people to go and like rethink that and kind of ground your hypotheses.
[78:23]
Lenny Rachitsky
Do you have thoughts on what Statsigs going to do at OpenAI? Is there anything there that's interesting? Just like that was a big deal, a huge acquisition, a B test company. People are like, oh, AB test the future. Thoughts?
[78:34]
Hamil Hussain
You know, just to add to the previous question a little bit, is why is there this debate a B testing versus evals? I think fundamentally evals is. People are trying to wrap their head around how to improve their applications. And fundamentally you need to do data science. Data science is useful in products like looking at data, doing data analytics. There's many different suite of tools and you don't need to invent anything new.
[79:08]
Lenny Rachitsky
Sure.
[79:08]
Hamil Hussain
You don't need necessarily the whole breadth of data science. And it looks slightly different, just slightly. With LLMs, you know, you might. Your tactics might be different. And so really what it is is like using analytic tools to understand your product. Now people say the word evals, trying to kind of like carve out this new thing and saying, oh, evals and then a B testing. But if you zoom out, it's the same data science as before. And I think that's what's causing the confusion is, hey, we need data science thinking and AI products just, you know, is helpful to have that thinking in AI products like it is in any product is my take on that.
[79:49]
Lenny Rachitsky
So, yeah, that's a really good take. Like, I think just the word evals triggers people now and if you just call it, we're just doing error analysis using, doing data science to understand where our problem break, our product breaks and just setting up tests to make sure we know that's boring.
[80:01]
Shreya Shankar
It sounds boring. No, no, no. We need a mysterious term like evals to really get the momentum going. Your question about statsig, I think it's very exciting, to be honest. I don't know much about it because I just imagine that they're this company that many, there's a tool that many people use and maybe it just so happened that OpenAI acquired them. I'm sure they'd been using them in the past. I'm sure OpenAI's competitors are using Statsig as well. So maybe there is something strategic in that acquisition. I have no idea. I don't know anything there. But I think those are really the bigger questions for me than, you know, is this fundamentally changing a B testing or making evals more of a priority? I think they've always been a priority. I think OpenAI has always been doing some form of them. And OpenAI has gone so far, historically speaking, as to go and look at all the Twitter sentiment and try to do some sort of retrospective on that and then tie that back to their products. Like they're certainly they're doing some amount of evals before they ship their new foundation models, but they're going so much beyond and being like, okay, let's find all the tweets that are complaining about it, all the Reddit threats that are complaining about it that go try to like figure out what's going on. So it goes to show that like evals are very, very important. No one has really figured it out yet. People are using all the available sources, signal that they can to improve their products.
[81:26]
Hamil Hussain
What I will say is I'm really hopeful that it might shift the or creative focus within OpenAI. Hopefully up until now, a lot of the big labs understandably focused on general benchmarks like mmlu score, human eval, things like that, which are very important for foundation models and those not very related to product specific evals like the ones we talked about today. But like handoff and stuff like that, like those, you know, they tend not to correlate.
[82:00]
Shreya Shankar
Yeah, they don't correlate with math problem solving, sorry to say.
[82:06]
Hamil Hussain
Exactly. And so, you know, if you look at the eval products, let's say the ones up until recently that some of the big labs have, they don't have error analysis, they have a suite of generic tools, cosine similarity, hallucination, Score, whatever. And that doesn't work. It's a good first stab at it. It's okay, you know, at least you're doing something. Getting people, maybe it's like getting people to look at data. But eventually what we hope to see is okay, some a bit more data science thinking in this, like, eval process. Hopefully the tools will get to Hamill.
[82:45]
Shreya Shankar
And I should not be the only two people on the planet that are promoting, like, a structured way of thinking about app patient specific evals. It's like, mind boggling to me. Why are we the only two people doing this? The whole world? What's wrong? So I hope that, you know, we're not the only people and that more people catch on.
[83:03]
Lenny Rachitsky
Well, the fact that your course on Maven is the number one highest grossing course in Maven, clearly there's demand and interest and there's more people, I think, on your side. Interestingly, just an example you've been sharing on Twitter that was, I think, is informative. Everyone's been saying how Claude code doesn't care about evals. They're all about vibes. And everyone's like, and they're the best coding agent out there. So clearly this is right. More recently, there's all this talk about codecs, OpenAI codecs being better and everyone's switching and they're so pro evals.
[83:33]
Shreya Shankar
I know.
[83:35]
Lenny Rachitsky
Yeah.
[83:35]
Hamil Hussain
So gets me every time.
[83:38]
Shreya Shankar
The Internet's so inconsistent. My favorite thing was like, yesterday, I believe, like, a couple of lab mates and I were out getting, like, dessert or something and somebody said like, oh, do you like Codex or Claude better or whatever? And the other person said, oh, I like Claude. And then someone else said, but the new version of Codex is better. And then the first person said, oh, but last I checked was two days ago. So maybe my thoughts, maybe I'm not up to date. And I was like, oh, my God, so true.
[84:10]
Lenny Rachitsky
This is the world we live in. Oh, my God. Okay, so I want to ask about just top misconceptions people have with evals and top tips and tricks for being successful. So maybe just share one or two each of each. So let me just start with misconceptions and maybe I'll go to Hamill first. Just what are a couple of the most common misconceptions people have with evals?
[84:30]
Hamil Hussain
Still, the top one is, hey, I can just buy a tool, plug it in, and it'll do the eval for you. Why do I have to worry about this? We live in the age of AI. Can't the AI Just eval it. That's the most common misconception. And people want that so much that people do sell it, but it doesn't work. So that's the first one.
[84:56]
Lenny Rachitsky
Shoot. We need humans still. Great. I think that's great news.
[85:00]
Hamil Hussain
The second one that, you know, I see a lot is, hey, just not looking at the data, you know. So in my consulting, people come to me with problems all the time, and the first thing I'll say is, let's go look at your traces. And you can see the kind of their eyes pop open. Be like, what do you mean? Like, yeah, let's look at it right now. And they're surprised that I am going, I'm going to go look at individual traces. And we always, it always 100% of the time, learn a lot and figure out what the problem is. And so I think people just don't know how powerful looking at the data is. Like we showed on this podcast.
[85:49]
Shreya Shankar
I would agree with that.
[85:51]
Lenny Rachitsky
Those are the top two. Okay. Is there anything else or. Those are. Those are the ones like solve those problems.
[85:55]
Shreya Shankar
Oh, those are definitely. And then I guess the one I would add is there's no one correct way to do evals. There are many incorrect ways of doing evals, but there are also many correct ways of doing it. And you got to think about where you are at with your product, how many. How much resources you have, and figure out the plan that works best for you. It'll always involve some form of error analysis, as we showed today, but how you operationalize those metrics is going to change based on where you're at.
[86:28]
Lenny Rachitsky
Amazing. Okay, what are a couple. Just tips and tricks you want to leave people with as they start on their eval journey or just try to get better at something they're already doing.
[86:37]
Shreya Shankar
So tip number one is just don't be alarmed or don't be scared of looking at your data. The process, we try to make it as structured as possible. There are inevitably questions that are going to come up. That's totally fine. You might feel like you're not doing it perfectly. That's also fine. The goal is not to do evals perfectly. It's to actionably improve your product. And we guarantee you, no matter what you do, if you're doing parts of these processes, you're going to find ways of actionable improvement, and then you're going to iterate on your own process from there. The other tip that I would say is we are very pro AI use LLMs to help you organize any thoughts that you have throughout this entire process. So this could be everything ranging from like, initial product requirements. Right. Figure out how to organize them for yourself. Figure out how to improve on that product requirements doc based on the open codes that you've created. Right? Like, don't be afraid to use AI in ways that, you know, present information better for you.
[87:44]
Lenny Rachitsky
Sweet. So don't be scared. Use LMS as much as you can.
[87:48]
Shreya Shankar
Throughout the process, but not to replace yourself.
[87:51]
Lenny Rachitsky
Right. Okay, great. Still jobs great. Amal.
[87:55]
Hamil Hussain
Yeah. Let me actually share my screen so when I show something. So to piggyback off what Shreya said is if you heard any phrase in this podcast, you've probably heard look at your data more than anything else. And so it's so important that we teach that you should create your own tools to make it as easy as possible. So I showed you some tools when we were going through the live example of, like, how to annotate data. Most of the people I work with, they realize how important this is and they vibe code their own tools. Are they. We shouldn't say vibe code.
[88:33]
Lenny Rachitsky
We.
[88:34]
Hamil Hussain
There's just. They make their own tools and it's cheaper than ever before because you have AI that can help you. And AI is really good at creating simple web applications that can show you data that have, you know, that can write to a database. It's very simple. And so for the Nurture Boss use case, we wanted to remove all the friction of looking at data. And so what you see here is just some screenshots of what the application that they created looks like. It's just, okay, they have the different channels, voice, email, text, they have the different threads, they hid the system prompt by default, little quality of life improvements. And then they actually have this axial coding part here where you can see in red the count of different errors. They automated that part in a nice way and they created this within a few hours. And so it's really hard to have a one size fits all thing for looking at your data. You don't have to go here immediately, but something to think about is make it as easy as possible because again, it's the most powerful activity that you can engage in. It's the highest ROI activity you can engage in with AI. Yeah, just remove all the friction.
[89:56]
Lenny Rachitsky
That's amazing. And again, I think the ROI piece is so important. We haven't even touched on this enough. The goal here is to make your product better, which will make your business more successful. This isn't just a little exercise to catch bugs and things like that. This is the way to make AI products better. Because the experience is how users interact with your AI.
[90:16]
Hamil Hussain
Absolutely. We teach our students, hey, when you're doing these evals, if you see something that's wrong, just go fix it. Like, the whole point is not to have evals. A beautiful eval suite where you can point at, edit it and say, oh, look at my evals. No, just fix your application, make it better. If it's obvious, do it. So totally agree with you.
[90:38]
Lenny Rachitsky
Amazing. A question I didn't ask, but this is, I think, something people are thinking about. How long do you spend on this? How long does this usually take to do?
[90:44]
Shreya Shankar
The first time I can answer for myself. For applications that I work with, usually I'll spend three to four days really working with whoever to do initial rounds of error analysis. Like a lot of labeling. Feel like we're in a good place to create the spreadsheet that Hamill had and everyone's kind of on board and convinced and even like a few LLM judge evaluators. But this is a one time cost. Once I figured out how to integrate that in unit tests or I have like a script that automatically runs it on samples and I will create a cron job to just do this every week. I would say it's like, I don't know, I find myself probably spending more time looking at data because I'm just data hungry like that. I'm so curious. I'm like, I've gained so much from this process and it's like, put me above and beyond in any of my, you know, collaborations with folks. So I want to keep doing it, but I don't have to. I would say like maybe 30 minutes a week after that.
[91:41]
Lenny Rachitsky
So it's a week, essentially. A week, essentially upfront, and then like 30 minutes to keep improving and adding to your suite.
[91:47]
Shreya Shankar
Yeah, it's really not that much time. I think people just get overwhelmed by how much time they spend up front and then thinking that they have to keep doing this all the time.
[91:56]
Lenny Rachitsky
Amazing. Is there anything else that you wanted to share or leave listeners with? Anything else you wanted to kind of double down as a point before we get to our very exciting lightning round?
[92:06]
Hamil Hussain
So I would say this process is a lot of fun, actually. So it's like, okay, you're looking at data. Oh, it sounds like you're annotating things. Okay, actually, like, so I was just looking at a client's data yesterday. The same exact process. It's a application that sends emails, recruiting emails to try to get candidates to apply for A job. And we decided to start looking at traces. We jumped right into it. Hey, let's look at your traces. The we looked at a trace. The first thing I saw was this, like, email that is worded like, given your background, blah, blah, blah, blah. So I asked the person right away, and this is where putting your product hat on and just being critical. And this is where the fun part is. I said, you know what? I hate this email. Like, do you like the email? Given your background? When and when I receive a message given your background, comma, I just delete that. So I'm like, what is this? Given your background with machine learning and blah, blah. I'm like, this is a generic thing. Like, so I asked the person, like, hey, you know, can we do better than this? Like, this is kind of like a. This is like a. Sounds like generic recruiting. And they're like, oh, yeah, maybe. Yeah, like, it's the AI. Because they were like, they were proud of it. They're like, the AI is doing the right thing, is sending this email with the right information, with the right link, with the right name, everything. And so that's where the fun part is, is like, put your product hat on and get into, like, is this really good?
[93:39]
Lenny Rachitsky
Something I want to make sure we cover before we get to a very exciting lightning round is this is just scratching the surface of all the things you need to know to do this. Well, I think this is the best primer I've ever seen on how to do this. Well, nice, But I think we did it. But you guys teach a course that goes much, much deeper for people that really want to get good at this and take this really seriously. Share what else you teach in the course that we didn't cover and what else you get as a student being part of the course you teach on Maven.
[94:07]
Shreya Shankar
Yeah, I can talk about the syllabus a little bit and then Hamil can talk about all the perks. So we go through a life cycle of error analysis, then automated evaluators, then how to improve your application. Like, how do you create that flywheel for yourself? We also have a few special topics that we find, like, pretty much no one has ever heard of or taught before, which is exciting. One is how do you build your own interfaces for error analysis? So we kind of go through actual interfaces that we've built and we also live code them on the spot for new data and we show kind of how we use Claude, code, cursor, whatever we're feeling in the moment that day to build these interfaces. And we also talk about kind of broadly cost optimization as well. So we, a couple of people that I've worked with, they got to a point where their evals are very good, their product is very good, but it's all very expensive because they're using state of the art models. So how can we kind of replace certain uses of the most expensive GPT5 models with 5 Nano 4 mini whatnot and save a lot of money but still maintain the same quality? So we also give some tips for that. Hamill, you want? We also have many perks.
[95:23]
Lenny Rachitsky
Yeah, talk about the perks.
[95:24]
Hamil Hussain
Okay, the perks. So my favorite Perk is there's 160 page book that's meticulously written that we've created that walks through the entire process in detail of how to do evals that supplement the course so you don't have to sit there and take all these notes. We've done all the hard work for you and we have documented it in detail and organized things. So that is really useful. Another really interesting thing and something that I got the idea from you, Lenny, is okay, this is an AI course. Education shouldn't be this thing where you're only watching lectures and doing homework assignments. So students should have access to an AI that also helps them. So what we have done is we've, you know, just like there's the Lenny bot that you have.
[96:19]
Lenny Rachitsky
Dot com.
[96:20]
Hamil Hussain
Yeah, lennybot dot com. We have made the same thing with the same software that you're using and we have put everything we've ever said about evals into that. So every single lesson, every office hours, every discord chat, any blogs, papers, anything that we've ever said publicly and within our course, we've put it in there and we've tested it with a bunch of students and they've said it's helpful. So we're giving all students 10 months free unlimited access to that alongside the course.
[96:57]
Lenny Rachitsky
Amazing. And then you'll charge for that later down the road.
[97:00]
Hamil Hussain
I have no idea. I just take one month at a time. I don't know what we're going to do.
[97:03]
Lenny Rachitsky
Eight months and then we'll have to figure it out. I was thinking this whole interview should have just been our bots talking to each other.
[97:09]
Shreya Shankar
That's amazing. I would watch that only for like 10 minutes. Then I don't know what they're talking about.
[97:15]
Lenny Rachitsky
Yeah, maybe, maybe 30 seconds. Do you guys train it on the voice mode? By the way, that's my favorite feature of this, of Delphi's product. And if not you should do that.
[97:22]
Hamil Hussain
Oh, I think I'm. I can't remember.
[97:25]
Lenny Rachitsky
Okay.
[97:25]
Hamil Hussain
I should actually look at it.
[97:27]
Lenny Rachitsky
Definitely should. Now that we have this podcast episode, you could use this content to train it. It's 11 labs powered. It's so good. Okay, so that's. How do they get to. I guess that's okay. They get to that once they become a. They enter your course. So there's no URL.
[97:39]
Shreya Shankar
Yeah, sign up for the course and then you'll get a bunch of emails. Everything will be clear, hopefully.
[97:44]
Lenny Rachitsky
Amazing.
[97:44]
Shreya Shankar
Oh, we also have a discord of all the students who have ever taken the class. And that discord is so active. I can't go on vacation without getting notified on the plane or.
[97:55]
Lenny Rachitsky
Bittersweet. Bittersweet. Incredible. Okay, with that, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?
[98:04]
Shreya Shankar
Yes. Let's go.
[98:06]
Lenny Rachitsky
Let's do it. Okay, so I'm going to bounce between you two. Share something if you want. You can pass if you want. First question, Shreya. What are two or three books that you find yourself recommending most to other people?
[98:17]
Shreya Shankar
So I like to recommend a fiction book because life is about more than evals. So recently I read Pachinko by Linjun Li. Really great book. And then I also am currently reading Apple in China, which the name of the author is slipping my mind. But this is kind of more of an exposition written by journalists on how Apple did a lot of manufacturing processes in Asia over the last couple several decades. Very eye opening.
[98:49]
Lenny Rachitsky
Amazing. Haml.
[98:52]
Hamil Hussain
Yeah, I have them right here. So I'm the nerd. Okay, so I'm not as cool as Shreya is. So I actually have, like, textbooks which are like, my favorite. So this one is very classic one. Machine Learning by Mitchell. Now, it's kind of theoretical, but the thing I like about it is it really drives home the fact that Occam's Razor is prevalent not only in, like, science, but also in machine learning and AI. So a lot of times the simplest and also engineering. So like, a lot of times the simpler approach generalizes better. And so that's the thing I kind of internalize deeply from that book and also really like this one. So another textbook. I told you I'm a nerd. This is like, also a very old one. And this is like, you know, Norvig algorithms. And it's just like, I really like it because it's just human ingenuity and it's very, like, lots of clever, useful things.
[99:49]
Shreya Shankar
They're down the street. I'm at Berkeley.
[99:54]
Lenny Rachitsky
The people that did that research.
[99:55]
Shreya Shankar
Yeah, yeah. Textbook authors.
[99:58]
Lenny Rachitsky
Super cool. Oh, man. Nerds.
[100:00]
Hamil Hussain
I love it.
[100:01]
Lenny Rachitsky
Okay, next question. Favorite recent movie or TV show? I'll jump to Hamill first.
[100:06]
Hamil Hussain
Okay, so I'm a dad of two parents. I have two. Oh, sorry. Two kids. So, yeah, I'm a dad of two kids, and I don't really get the time to watch any TV or movies, so I watch whatever my kids are watching. So I've watched Frozen, like, three times in the last week.
[100:25]
Lenny Rachitsky
Only three? Oh, okay. In the last week. Okay. Yeah.
[100:28]
Hamil Hussain
So that's my.
[100:29]
Lenny Rachitsky
Great. That's Mahamal Frozen. I love it.
[100:31]
Hamil Hussain
Okay, sure.
[100:32]
Shreya Shankar
Yeah, Yeah. I don't have kids, so I can give all these amazing answers, actually. So my husband and I have been watching the Wire recently. We never actually saw it growing up, so we started watching it, and it's great.
[100:46]
Lenny Rachitsky
I feel like everyone goes through that eventually in their life. They decide, I will watch the Wire. I know. So we are in that a year of your life. It's great. It's such a great show. Oh, man. But it's so many episodes, and everyone's an hour long.
[100:58]
Shreya Shankar
I know, I know. We get through, like, two or three a week, so we're very slow.
[101:04]
Lenny Rachitsky
It's worth it. Okay, next question. Do you have a favorite product you recently discovered that you really love? And we'll start with Shreya.
[101:10]
Shreya Shankar
Yeah, I really like using Cursor, honestly now. Claude code. I'll say.
[101:17]
Hamil Hussain
Why?
[101:17]
Shreya Shankar
So I think I'm a researcher more so than anything else. I write papers, I write code, I build systems, everything. And I find that I'm so bullish on AI assisted coding because I have to wear a lot of hats all the time. And now I can be more ambitious with the things that I build and write papers about. So I'm super excited about those. Cursor was my entry point into this, but I'm starting to find myself trying, always trying to keep up with all these AI assisted coding tools.
[101:48]
Lenny Rachitsky
Hamil? Yeah.
[101:50]
Hamil Hussain
I really like cloud code, and I like it because I feel like the UX is outstanding. There's a lot of love that went into that. It's just really impressive as a terminal application. That is that nice.
[102:04]
Lenny Rachitsky
How ironic that you two both love clock code when it's just built on vibes.
[102:09]
Shreya Shankar
I think it's false. It's not just built on vibes.
[102:13]
Lenny Rachitsky
There we go. Okay, two more questions. Hamill, do you have a favorite life motto that you find yourself using? And coming back to in work or in life.
[102:22]
Hamil Hussain
Keep learning and think like a beginner.
[102:26]
Lenny Rachitsky
Beautiful Shreya.
[102:27]
Shreya Shankar
I like that. For me, it's to always try to think about the other side's argument. I find myself sometimes just encountering arguments on the Internet like this recent evals debates, and really think, okay, put myself in their shoes. There's probably a generous take, generous interpretation, and I think we're all much stronger together than if we start picking fights. My vision for evals is not that Hamila and I become billionaires. It is that everyone can build AI products and we're all on the same page.
[102:59]
Lenny Rachitsky
Slash. Everyone becomes billionaires.
[103:02]
Shreya Shankar
Yes.
[103:02]
Hamil Hussain
Yes.
[103:04]
Lenny Rachitsky
Amazing. Final question. When I have two guests on, I always like to ask this question, and I'll start with Hamill. What's something about Shreya that you like most? What do you like most about Shreya? And I'm going to ask her the same question in reverse.
[103:18]
Hamil Hussain
Yeah. Shreya is one of the wisest people that I know, especially for being so young. Relative to me, I feel like she's, like, much wiser than I am, honestly. Seriously, she's very grounded and has, like, a very even perspective on things. And so I'm just really impressed by that all the time.
[103:42]
Lenny Rachitsky
Sure.
[103:42]
Hamil Hussain
Yeah.
[103:43]
Shreya Shankar
Yeah. My favorite thing about Hamill is his energy. I don't know anybody who consistently maintains momentum and energy like Hamill does. I often think that, like, I would start caring much less about evals if not for Hamill. And everyone needs a Hamill in their life for sure.
[104:06]
Lenny Rachitsky
Well, we all have a Hamill in our life now. This was incredible. This was everything I'd hoped it'd be. I feel like this is the most in interesting, in depth, consumable primer on evals that I've ever seen. I'm really thankful you two made time for this. Two final questions. Where can folks find you? Where can they find the course? And how can listeners be useful to you? I'll start with Shreya.
[104:29]
Shreya Shankar
Yeah, you can reach me via email. It's on my website. If you Google my name, that is the easiest way to get to my website. You can find the course if you Google AI evals for engineers and product managers or just AI Evals course, you'll find it. We'll send some links hopefully after this. So it's easy and how to be helpful. Two things always for me. One is ask me questions when you have them. I will try to get to the respond as soon as I can. The other one is tell us your successes. One of the things that keeps us going is somebody tells us like what they implemented or what they did. A real case study. And Hamill and I get so excited from these and it really keeps us going. So please share.
[105:16]
Hamil Hussain
Yeah, it's pretty easy to find me. My website is haml.dev I'll give you the link. You can find me on social media, LinkedIn, Twitter thing that's most helpful is to echo what Shreya said. We would be delighted if we're not the only people teaching evals. We would love other people to teach evals. And so any kind of blog posts, writing, especially that as you go through this and learn this that you want to share, we would be delighted to help re share that or amplify that.
[105:54]
Lenny Rachitsky
Amazing. Very generous. Thank you two so much for being here. I really appreciate it. And you guys have a lot going on. So. So thank you.
[106:01]
Shreya Shankar
Thanks, Lenny for having us and for all the compliments.
[106:05]
Lenny Rachitsky
My pleasure. Bye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at Lenny's page, podcast. Com. See you in the next episode.