Summary7 min read

Podcast Summary: The Growth Podcast

Episode: The PM’s Role in AI Evals: Step-by-Step

Host: Aakash Gupta
Guests: Hamel Hussain & Shreya Shankar (AI evaluation experts)
Date: July 11, 2025

Episode Overview

This episode explores the pivotal role of AI evaluations (“evals”) in building reliable, high-quality AI products. Host Aakash Gupta interviews industry experts Hamel Hussain and Shreya Shankar, whose evaluation frameworks are trusted by companies like OpenAI and Arise. The conversation demystifies AI evals, discusses their strategic value for Product Managers (PMs), presents step-by-step practical guidance, and shares hard-won lessons from real-world AI products—including GitHub Copilot, RAG systems, and more.

Key Discussion Points & Insights

Why AI Evals Matter for PMs

Injecting Product Taste: Evals let PMs “inject their taste and judgment directly into the critical path of the AI product” ([00:02], [02:09] - Hussain).
Iteration: Evals provide structured ways to rapidly gather feedback and improve products.
Scalability: Well-designed evals can systematize PM judgment across products and teams.
Quote:

“When you build that foundation of evals, you have immense leverage... It's a really quick way to exert lots of influence over the process and in a good way.”
—Hamel Hussain [03:50]

What Are Evals?

Definition:

“An eval is some systematic measurement of some aspect of quality... what varies in an eval is the criterion (e.g., conciseness, accuracy) and how you measure it.”
—Shreya Shankar [05:29]
Products typically require 3–10 different evals; no one metric suffices.
Evals codify “vibe checks”—making internal PM taste explicit and scalable.

The Case for Binary Criteria

Why Binary?
Binary pass/fail rubrics outperform 1–5 scoring for both humans and LLMs.
- Calibrating numerical scales is difficult; binary forces clarity.
- LLMs perform reliably with binary judgment, less so with nuanced ratings.
Quote:

“Binary judgments force you to make a pass/fail decision. And for the vast majority of people, that's the right choice.”
—Hamel Hussain [09:00]

Human Evals vs. Automated Evals

Human “vibe checking” does not scale—evals operationalize subjective judgment so it can be shared and automated.
LLMs as judges are feasible if criteria are concrete and binary; complex, subjective, or vague rubrics make automation unreliable.
Quote:

“Your vibe checks are very important, but they don’t scale... Evals let you translate those checks into something concrete.”
—Shreya Shankar [07:47]

The Scientific Method in Evals

Skepticism is vital: Be “skeptical of everything and do lots of experiments” ([15:00] - Hussain).
Always validate LLM judge performance against your labeled ground truth.
Measure and iterate on both the evaluator (model) and your own interpretive rubric.
Memorable Analogy:

“It’s like playing whack-a-mole without evals... you keep hammering problems but don’t make progress.”
—Hamel Hussain [15:00]

Error Analysis: The PM's Superpower

The critical skill for PMs in AI evals is “error analysis”—systematically reviewing outputs, quantifying failure modes, and turning learnings into ongoing product improvements ([00:56], [56:11]).
Inspired by social science research (“grounded theory,” “open coding”):
- Start with freeform notes
- Cluster insights into failure modes
- Prioritize by frequency/impact

Domain-Specific Evals: No One-Size-Fits-All

Off-the-shelf tools and generic metrics (e.g., “hallucination score”) rarely suffice for real products.
PMs must define what “good” looks like for their own users and context.
Quote:

“The foundation model labs... are very much focused on the general purpose benchmarks... but they’re at the same level as everyone else when it comes to domain-specific evals.”
—Hamel Hussain [42:51]
This domain-specificity creates a moat for startups that “operationalize taste.”

Evals as the True Moat

“Evals are the moat for AI products... Truly nothing else.”
—Shreya Shankar [46:20]
Well-implemented evaluation pipelines enable rapid model swaps, easy fine-tuning, and defensible product quality.
Your eval suite—with tightly aligned LLM judges and actionable metrics—forms the deepest competitive advantage.

When to Use Prompting, RAG, and Fine-Tuning

Prompting: First line of improvement; communicate explicit requirements to LLMs.
RAG (Retrieval-Augmented Generation): Use when LLM needs external/contextual information.
Fine-Tuning: Only after exhausting prompting/RAG; expensive and requires ongoing maintenance.
Framework: The “Three Gulfs” model explains when each method is appropriate ([53:07]):
1. Gulf of Specification: Is the desired behavior clearly described? Prompting helps here.
2. Gulf of Generalization: Does the model lack capacity or context? Use RAG/fine-tuning.
3. Gulf of Evaluation: Can you measure/assess if the LLM met your goals?

Overfitting & Safe Practices

Danger: Overfitting by designing evals/prompting based on the same test data.
Solution: Always have a hidden test set for evaluation only ([35:15]).
Suspiciously high accuracy (near 100%) is usually a red flag ([36:21]).
Differentiate between regression (must-pass) and aspirational evals (show headroom for improvement).

Case Studies and Real-World Examples

GitHub Copilot: Success required upfront investment in eval systems and automated “test harnesses” ([25:39]).
Airbnb (pre-LLMs): Evals in ML carried over directly; robust evaluation essential for stochastic systems ([19:49], [21:27]).
Search vs. LLMs: LLMs have different context/salience patterns than humans; necessitates different retrieval and evaluation strategies ([22:36]–[24:47]).

Lessons from the Field and the Importance of Interfaces

Best teams custom-build labeling/annotation interfaces for high-quality feedback ([38:22]).
The practice of “benevolent dictators”: Assign a single accountable evaluator to prevent committee paralysis in binary labeling ([64:23]).
Interface design is a bottleneck (see chapters 10–11 of their course/book for in-depth guidance).
Quote:

“PMs are like, vital in building AI products… We’re not going to have successful AI products across different domains unless we have good AI PMs.”
—Shreya Shankar [63:42]

Roadmap for Mastering Evals

(as taught in their course/book—see detailed breakdown at [56:11] onward)

What is evaluation?
Understand LLM strengths/weaknesses
Error analysis (grounded theory, open/axial coding)
Designing & validating LLM judges
UI/interfaces for efficient labeling
Multi-turn and RAG evaluation strategies
Productionizing evals (CI/CD, automation)
Cost optimization

Meta-Lessons: Creating Popular AI Courses (and Why They're Ending Theirs)

Evals are a niche but high-value topic; smaller, focused, and well-structured cohorts are more effective ([86:27], [87:09]).
The course’s value comes from addressing immediate real-world PM/pain points, not “timeless content” ([83:45]).
Marketing, guest speakers, constant student feedback, and relentless iteration all contributed to their success ([90:09], [90:55]).
Quote:

"It's like telling people to eat their vegetables. It's not really that popular... much easier to talk about agents. But at some point I just didn't care. We need to create the category."
—Hamel Hussain [87:09]

Notable Quotes & Memorable Moments

Binary vs. Ratings for LLMs:

"[1–5 scaling is] a smell of intellectual laziness... Binary forces you to be clear about what you want."
—Hamel Hussain [11:36]
Error Analysis Value:

“A sizable portion of my clients... do the error analysis part and they're like, great, we're done. This is so much value.”
—Hamel Hussain [64:23]
PMs as Leverage:

“PMs are like, vital in building AI products... We need this to really realize the vision of AI products changing people's lives.”
—Shreya Shankar [63:42]
Evaluations as Moat:

“Evals are the moat for AI products. Truly nothing else.”
—Shreya Shankar [46:20]
Hill Climbing (and Overfitting):

“If you're getting 100% accuracy in your evals, it's likely your evals are worthless because they're providing no signal.”
—Hamel Hussain [36:21]

Timestamps for Core Segments

Evals’ strategic value for PMs: [00:02], [02:09]
What is an eval? [05:29]
Why binary matters: [09:00]
LLMs as judges & challenges: [09:59], [14:14]
Error analysis as skill: [56:11], [59:29]
Avoiding overfitting in evals: [35:11], [36:21]
Domain-specificity & moats: [42:51], [46:20]
Prompt/RAG/fine-tuning framework: [53:07]
Designing effective interfaces: [73:33]
Course philosophy & business model: [77:35], [86:27], [87:09]
PMs’ critical role: [63:42]

Where to Find the Experts

Both are active on X (formerly Twitter) and publish blogs.
Shreya Shankar: Email at shreyashankar@berkeley.edu
Hamel Hussain: DMs open on X; details via Google search

Final Takeaways

Evals are the backbone of iterative, reliable, and differentiated AI products.
PMs should take the lead in defining, analyzing, and operationalizing deeply aligned evals.
Do not rely on generic tools or metrics; your product’s success (and moat) depends on bespoke, well-crafted evaluations.
Error analysis is not just a phase, but an ongoing intuition-engine building exercise for every AI PM.
Systematic, credible evaluation is what turns “vibe checks” into high-velocity, scalable, world-class AI products.

For more resources and actual frameworks, visit the guests’ newsletters, blogs, or course reader.

Loading summary

Transcript109 lines

[00:00]
A
Why do PMs need to be good at AI evals?
[00:03]
B
Okay, so there's three things that are really important. One, evals give you a way, as a pm, to inject your taste and your judgment directly into the critical path of the AI product being developed. The second thing is, like, evals are really important in helping you iterate. The most effective way to do that is using evals, specifically looking at data in a very structured way. And then the third thing is scale. By mastering evals, what you can do is you can make sure that you can scale your face, judgment, judgment, so on and so forth. User requirements across all the AI workloads that are running.
[00:37]
A
When it comes to AI evals, Hamil Hussain and Shreya Shankar are known as the worldwide leading experts. Companies like OpenAI and Arise go to them and today we're going to learn everything you need to know about evals from them. What is the most critical skill for PMs who want to build AI features
[00:57]
C
to develop, hands down, error analysis, the ability to look at your outputs and systematically figure out what makes for a bad output. Quantify how many of these failure modes you see in a big batch of traces for your system, and then figure out how to turn that measurement into a continuous flywheel of improving your product.
[01:19]
A
If you guys have to build a roadmap for people who wanted to get really deep on AI evals, what topics should they learn really quickly? I think a crazy stat is that more than 50% of you listening are not subscribed. If you can subscribe on YouTube, follow on Apple or Spotify, Spotify podcasts. My commitment to you is that we'll continue to make this content better and better. And now on to today's episode. Hamal Hussain and Shreya Shankar are the people who the experts go to for evals. OpenAI, arise, AI. Those people are going to them for evals, and we have them on the podcast today. Welcome, Shreya and Hummel.
[02:03]
B
Hey, thank you. Nice to be here.
[02:06]
A
Why do PMs need to be good at AI evals?
[02:10]
B
Okay, so there's three things that are really important. One, evals give you a way as a PM to, you know, inject your taste and your judgment directly into the critical path of the AI product being developed. So, like, you know, as we all know, like PMs, they spend a lot of time gaining context from customers, user feedback, so on and so forth. Writing PRDs, they're, you know, trying to give context to engineers and, you know, they're hoping, like, kind of Engineers are faithfully carrying out their vision. Now, what evals give you is, you know, you can directly make sure that your taste and all of that context, if done correctly, is now on the critical path when your engineering team is developing those AI products. The second thing is, like, evals are really important in helping you iterate. So, you know, nothing is, like, set in stone. You have to constantly, like, change your requirements, you're learning more about your customers, so on and so forth. The most effective way to do that is using evals, specifically looking at data in a very structured way. This is one of the things that Shreya and I teach that allows you to refine and have really fast feedback loops and really fast cycles of feedback. Then the third thing is scale. By mastering evals, what you can do is you can make sure that you can scale your taste judgment, so on and so forth, user requirements across, you know, all the AI workloads that are running in a way that you just couldn't before. Because ultimately there's a lot of, you know, you can bake a lot of these evals. They're using AI themselves. You just have to make sure that you do it correctly. So you have to make sure that you align the AI with yourself as a PM in a very kind of process that we teach. And as long as you do that correctly and you do it in such a way that you develop trust in the AI that is doing the eval, and there's a way to do that alignment, then you can really scale yourself. So a lot of times, PMs, or not just PMs, but people at large, they kind of view evals as a very monotonous task that, you know, you just want someone else to do it. It's like, oh, like, I have to look at data, I have to annotate data. You know, who's going to do this? You don't want to give up that leverage because when you. When you build that foundation of evals, you develop, you have immense leverage and you can, you know, it's a really quick way to kind of exert lots of influence over the process and in a good way. And so this is why I would encourage PMs to really pay attention to this.
[05:25]
A
Can you guys precisely define evals?
[05:30]
C
Yeah, I can take this one. An eval is some systematic measurement of some aspect of quality. So what varies in an eval is what that criterion is. For example, maybe it's conciseness of a response and then how you want to measure it. So maybe that is, I'M gonna define it by, you know, word length. I'm gonna define it by sentence length. Um, maybe it is some very, very complex bespoke human judgment or something that's more subjective. But those two things make up an eval. And oftentimes products actually have a suite of evals. I've never seen just one eval doing the job. I see three to five, sometimes even up to 10 evals that are really important for a product.
[06:24]
A
People say that if you get evals right, you've gotten the hardest part of the AI product solved. Is that accurate?
[06:33]
C
I think it's accurate now. What do you think?
[06:35]
B
I think it's totally accurate. Just like anything else, it's the process of creating the evals that provides all the value. It's not necessarily the eval itself is the journey that creates all the value. Once you've done all of that work, you've looked at all of your data, you've iterated on your system, you've thought very carefully, oftentimes, scientifically, about how to improve your system. You've already got 99% of the way there.
[07:02]
C
The way that I like to think about it is if you ever want your product to make it past one iteration, you need evals. I've never seen somebody make it through multiple iterations of their product without any evals. But once you have good evals in place, then evals are not necessarily the bottleneck for you. But that's a good thing. That's how it should be. You should be able to focus on building out other aspects of the product, making things faster, making things feel better, more intuitive, everything beyond that.
[07:37]
A
Why can't you just rely on, like, human evals? Like, the PM looks at the feature, the engineers look at the feature, and they feel like those outputs are good enough.
[07:47]
C
Oh, I love this question on the vibe checks and why. So Hamilton and I teach our course and pitch it in a way that we are helping you codify, operationalize, and scale up your vibe checks. Your vibe checks are very important, but they don't scale right because they involve you, the human. It's very hard to onboard other people to do the vibe checks in the same way as you are. So, like, I would have to observe you do this thousands of times, look at outputs, try to build my own rubric or mental model of what you're doing, and then I have no good way of teaching other people of how to do this. So being able to do evals just means taking your vibe checks and translating them to something concrete in Our course we define that as a rubric of binary criteria. Every criteria can be complex, that's fine. Can be subjective, that's fine. But you better have a very precise definition for pass fail, have some examples of pass, have some examples of fail. And we also teach people ways to measure alignment on those results. That's really what this whole process is about.
[08:53]
A
I think the critical phrase there is binary criteria. Why binary?
[09:00]
B
Yeah. So binary really is a kind of a heuristic in a sense that is like a simplification that works for most people. And the thing is, a lot of people try to assign scores, say on a rating scale 1 to 5. That's usually a really bad idea because no one knows what that means. If you have a average score of 3.2 versus average score of 3.7, what does that really mean? And you know that can be very hard to calibrate and you have to work incredibly hard to make sense of that. So binary judgments force you to kind of make a pass fail decision. And that tends to also correlate with the fact that you are going to have to ship this product. Do you want to ship it or not? It really distills that decision making down into the annotation. And for the vast majority of people, that's the right choice.
[09:59]
C
Yeah. And to provide a little bit extra context on the background of LLM as judge and why people have a lot of variance in whether they want it to be binary or rating based scale. LLM as judge has been around, you know, before these foundation models, even just regular language models, fine tuning models to serve as judges. And in those cases people A had a lot of preference data of what is good and bad and maybe even a fine grained scale and B could fine tune models to be aligned with that preference data. Today's world of LLM judge is very different. We don't see people fine tuning judge models as much. We see people trying to use off the shelf models still want to align with their complex subjective criteria. And now the alignment problem is much harder. Right. You can't steer the LLM in a way that you could before. And for that reason we say limit yourself to binary because that is what the LLM can do very well. All you have to do is provide examples of pass, provide examples of fail and have very simple or good rubrics. People find that much easier to do than say rating on a scale of 1 to 5. Okay, now you need to provide examples for 1 for 2, for 3, for 4, for 5. You need to have descriptions of what makes a 1 different from a 2. All of these things, the pairwise interactions between all these ratings just explode in complexity. And we never see people successfully able to operationalize that at the rate at which they can do binary evals.
[11:37]
B
And a lot of times non binary evals, like ratings of 1 to 5. That is a smell of intellectual laziness. The work hasn't been done to actually, you know, to make a call of like what is good enough and what's not good enough. And it's kind of like, oh, we don't really know. Let's just capture these like rough things, you know, in, in this like score because we're going to lose something and it doesn't, you know, like the binary skill like really forces you to be very clear about what you want.
[12:11]
A
AI evals are one of the most important skills for PMs. And I know, you know, they matter. The question is, are you doing them right? Most teams are winging it with basic metrics and hoping for the best. Meanwhile, the teams that actually ship reliable AI, they've cracked the code on systematic evaluation. Today's Episod this episode is brought to you by the Aievals for Engineers and PM's course by Hamal Hussain and Shreya Shankar. This live maven course will teach you the battle tested frameworks from Haml and Shreya, who are the engineers behind GitHub Copilot's evaluation system and 25 plus production AI implementations. Four weeks live instruction next cohort starts July 21st start shipping AI that actually works. Enroll@maven.com with my code ag product growth for over $800 off that's ag pro duct gr o w t h today's episode is brought to you by JIRA Product Discovery. If you're like most product managers, you're probably in JIRA tracking tickets and managing the backlog. But what about everything that happens before delivery? JIRA Product Discovery helps you move your discovery, prioritization and even roadmapping work out out of spreadsheets and into a purpose built tool designed for product teams, capture insights, prioritize what matters and create roadmaps you can easily tailor for any audience. And because it's built to work with Jira, everything stays connected from idea to delivery. Used by product teams at Canva, Deliveroo and even the Economist. Check out why and try it for free today at atlassian.com product-discovery that's a T L a S S I a n.com product-discovery Jira product discovery build the right thing I've heard that LLMs are also not very good at 1 to 5 ratings. Is that true?
[14:14]
C
They're good at what they're trained on. Somewhere out there in the world, I am sure there is a task with very clear or simple 1 to 5 ratings, and the LLM is good for that. But to make such a blanket statement for all products and all use cases is very hard to do. That's the thing. That's the message we want to hammer home to every single product manager who takes the course. Look, you think the LLM might be able to do something. You saw an instance of an LLM being able to do the task for some other domain. That doesn't mean it's going to translate to your domain or your use case. You still have to put in this work and just don't trust any. Hamill is a great way of saying this. Maybe he should talk about it. But he always tells people, never trust it. Always put on your detective hat emo. You want to talk about that?
[15:01]
B
Yeah. What underlines the entire process of evals is the scientific method, something that we've all learned in high school education, but it's really applied, you know, in this context. And what you have to do is be very skeptical of everything and do lots of experiments and prove to yourself that the. The thing that you're trying to achieve or some new complexity, you want to add whatever it is that is actually working and try to do it in the simplest way and build intuition doing by doing lots of experiments. But the point is to, like, measure those and like, you know, record those and go through it in a structured way rather than those vibe checks. You. You asked about vibe checks earlier. The analogy that I like to use, and I can give this to you if you ask me later. There's a. There's a little video of my friend Greg Secarelli playing whack a mole. And it's my favorite meme to use when telling people about the need for evals. It's, you know, it's like you're playing whack a mole without evals. You know, so you see a problem, okay, hammer it over with some tool or a prompt change. Then another problem comes up. You hammer that and you keep going. You don't really make any progress. It's really with evals that you can systematically try to solve the problem without going in circles.
[16:26]
A
I want to talk about some stories. So one of the features I implemented in my last job, I was VP of Product at Apollo IO. It's a unicorn Startup that does sales technology. So what do salespeople need to do? Right, they need to write emails. So as soon as I think it was Chad GPT 3.5 came out, we're like, okay, we're going to use GPT 3.5 to write people's emails. But the very first thing we found was that it was hallucinating all sorts of crazy details. And ultimately we had to set up a bunch of evals instead of vibe checks. Can you guys give a little bit more context into why evals help solve that hallucination problem for us?
[17:07]
B
Yeah. So with any kind of problem that you see, you know, hallucination or whatever it is, so the first thing to know about evals, where people go off the rails is do not reach for generic metrics. So the industry is full of tools and vendors wanting to sell you like a magic pill to solve your evals problem. Like, hey, don't worry, just buy your tool, plug it in, we'll show you a dashboard of all of these metrics and things like that. The problem is it doesn't work because those off the shelf evals, hallucination score is not going to work. What you need to do is take a look at, in your case, those emails and understand like what exactly is the failure mode and what if you do observe a hallucination, what is the hallucination? And kind of give more life to the domain specificity of the hallucination so that you can then start crafting an LLM as a judge that is prompted in such a way that is very specific to the types of hallucination that you are seeing. And then you go through an iterative process so you kind of hand label when hallucinations happen. You know how they're happening. You're building an LLM as a judge and you're measuring that judge. You're being skeptical again through the scientific process and you're saying, okay, I have this judge. Can I get it to agree with me? Can I have it be, let's say if it's you, Akash, making this judge, this email hallucination thing, you know, how do I make this judge a proxy of me and how do I trust it? Almost like an employee. And so the only way to do that is to check it. And you can do that iteratively through this process. And when you do that, then you not only do you have something that scales, it's an automated way of checking that problem, but it's also something that you trust in that second part, that something that you Trust is key because the last thing you want is the whole bunch of evals that you put up on a dashboard. And then people stop looking at them because they're like, oh, we have these evals. But you know what, the product doesn't really work. No, that's the death of your AI product, because then no one's going to ever look at evals again. And then you don't have any leverage.
[19:34]
A
That's exactly what we experienced and you weren't even there. I want to talk a little bit about your experiences. I know you worked at Airbnb on these products. Can you tell us a little bit more about that? And what was the most difficult part of building evals then there?
[19:50]
B
So I didn't work on LLMs at Airbnb because it was prior to, you know, way before ChatGPT. But what I did work on my entire career is machine learning and, you know, building predictive models. And a lot of the same machinery of evals comes directly from machine learning. The reason that is is because machine learning systems are systems that produce stochastic outputs. You know, they'll give you predictions of various kinds or like classified things. And they're, they're non deterministic and you have to evaluate them and like, you know, they're giving you different outputs every time and they have noise. And so how do you do that? That's something that's very well established in machine learning that a lot of people haven't been exposed to. So when it comes to AI more generally now, you have really very similar thing. It's like, hey, you have like a stochastic system. You know, it's non deterministic, it can output anything like these emails. It's like, how do you go about measuring that? And yeah, and so basically you can kind of bring that over. And so what we teach is instead of going through the entire data science machine learning curriculum, how do you have a very focused way of learning that that is contextualized to LLMs?
[21:21]
A
That's actually fascinating. How was Airbnb using machine learning models? I'm sure people want to know. Under the hood.
[21:28]
B
Yeah, Airbnb was using it for a lot of things, such as detecting fraud in payments. Also, the biggest use case at Airbnb was search ranking. So if you're searching for a listing, let me show you other listings that you might be interested in based upon what you've been looking at and what you're searching for. So basic recommendation systems, search ranking, things like that, a lot of like growth marketing initiatives, like trying to figure out the lifetime value of a specific guest so you can allocate marketing to them appropriately. That's what I worked on, you know, and there's many other use cases, but those are the ones that sort of one of like their bigger ones. And now they have, you know, now Airbnb doing generative AI stuff as well, just like any other company.
[22:19]
A
Search ranking is something that probably we've been dealing with, right, for like 30 plus years, I guess, ever since, you know, these search engines ever came out. So what can we learn from how people are evaluating search and apply into how we're evaluating our LLMs?
[22:37]
C
The number one difference now is we definitely search systems to try to get context to improve our LLMs. Sure, but the consumer of the search results is now the LLM. In the past, the consumer of search has been the human. And humans and LLMs are good at different things. LLMs are good at finding needle in a haystack and very, very long complex windows. Humans have very short attention spans. They'll read through things, but after a few paragraphs they're done. So some of the metrics, like how high up a relevant result was ranked, are much more important for humans than they are for LLMs. You know, you could try to retrieve like 500 results. And as long as it's, you know, of, even if it's like 150, 200, the LLM will pick it out and figure out how to give that result to the hue, the user. So I think the bottom line differences are, you know, we still want to use the same metrics, but our tolerance has changed slightly. We still want to prioritize recalling the right information. But now that we have long context windows, it's okay for it to kind of be at the end as long as we carefully, you know, leverage LLMs to go and iteratively refine those search results, pick out the bottom results and then go and show that back to the human.
[23:57]
B
One thing I'll add on to this is people often wonder, okay, how do I evaluate RAG systems? So RAG is a big thing. And so what is, what is rag? Retrieval, Augmented generation. There's the retrieval part and then there's the generation part, the retrieval part. You evaluate that pretty much the exact same way that you would evaluate any search system. So all those classic search systems and informational retrieval like that entire scientific discipline can be applied onto the retrieval part. So like optimizing that, making sure you're getting the right documents, the right context on and so forth and a lot of it applies. And Shreya is right, there's a lot. There is some nuance in terms of different tolerances and things like that.
[24:47]
C
Yeah. The number one thing I see that's different is instead of measuring like recall at 10, like we can measure recall at 500 and have a long context model like Gemini, be able to consume those results. At the end of the day, maybe you want to measure precision for the result of the LLM call that feeds in the result to the human. Sure. Whatever humans consume, they should all be measured the same way. But if LLMs are there refining search results, then we can use that to our advantage to be able to re rank outputs from a search engine.
[25:22]
A
So that's search. One of the other sort of really big things in the AI space that I think Hummel, you worked on a little bit was the precursor to GitHub Copilot. Right. LLMs for code generation. What was the biggest problem you guys faced with evals for that?
[25:40]
B
So with. Okay, things like co generation are actually really interesting. So in evals there's more generally, in AI products, more generally is you want your domain expert inside the inner loop. So what you don't want to do is give your developer the task of annotating your data and, you know, writing evals necessarily because they don't know enough, they don't have enough context. And that really bottlenecks a lot of teams because they don't get that right. But there's an exception. The exception is developer tools. That is one case where the domain expert is the developer. And so that's why we saw developer tools like as the first, you know, AI products, because that was the sweet spot that was like where it was easiest to develop. That's one property of, you know, developer tools. The second property is as a highly verifiable domain, so there's some structure to code. And it turns out like on GitHub, you know, you have all this code and you also have lots of tests defined against that code. So a lot of time was spent in that verifiable domain. And it's excellent when you have a verifiable domain is to develop kind of a test harness. And the test harness was very impressive. Basically what it did is it took all of the code at scale, not all of it, but a select, like filtered quantity of it and basically recreated the environment that code is going to run in and ran the test at scale. And basically things like asking the LLM to fix certain things or complete certain code and it would run all the tests. It's kind of a very impressive engineering effort because you're talking about running all this random software from everywhere with all kinds of dependencies at scale all the time, so the details don't matter. What I want to say is, like, the evals really mattered because a lot of upfront work went into constructing the eval system. And it's really after that that the team was able to iterate really fast. Like, when GitHub Copilot was first released internally, it didn't really work. You know, it was something like, you know, 20% or so of, you know, like, acceptance rates of, like, suggestions, things like that. And then after the evals, the team was able to, like, iterate really fast and, like, climb that, you know, to where it did work. And so, you know, it was really like. All right, let me step back. It was. Yeah, it was. It was really like the key that unlocked, you know, progress on that. You know, I didn't work directly on GitHub Copilot, like, you know, in that phase, I was kind of working on some research. Before that, I was actually working on some research that led to some of the benchmarks. So basically, one thing I did at GitHub is, you know, I worked on this project called Code Search. Net, which is a semantic search of code. And basically you type in what code you're looking for and semantically try to find it. And it leveraged a lot of the. A lot of code has comments in it, in the documentation embedded in that code. So it was just a really large data set. We opened a large data set with benchmarks of code retrieval that was used by OpenAI in their very first iteration of Codex, which is like a very old model. It's not. Not the current Codex. This is like a different. Same name, but different. Different thing, but it was. It was an eval. It was an eval that was used back in the day. So you can say, like, been working on LM evals for a really long time.
[29:52]
A
Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit, or a seasoned professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SoC2 and ISO 27001, centralized security workflows, complete questionnaires up to five times faster, and proactively manage vendor risk. Vanta can help you start or scale your security program by connecting you with auditors and experts experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora and Factory who use Vanta to manage risk and prove security in real time. For a limited time, my listeners get $1,000 off vanta@vanta.com Akash that's V A N T A dot com A K-A-S H for $1,000 off. Today's episode is brought to you by the AIPM certification on Maven, run by Mikdad Jaffer, who is a product leader at OpenAI. This is not your typical course. It's eight weeks of live cohort based learning with a leader at one of the top companies in tech. OpenAI just doesn't stop shipping and this is your chance to learn how. Run along with product faculty and Mo Ali, the course has a 4.9 rating with 133 reviews. Former students come from companies like OpenAI, Shopify, Stripe, Google and Meta. The best part? Your company can probably cover the cost. So if you want to get $500 off, use my code A A K-25 and head to maven.com product faculty. That's M-A-V-E-N.com P-R-O--U C T-F-A C U L-T-Y you mentioned Hill climbing and I think that's a really interesting concept because I was reading Daniel McKinnon's piece on evals. He is a PM on Meta's llama team and he talk how it's so important for the PM to define the evals so that then the AI engineers and research teams that he works with can hill climb. Can you guys explain that?
[32:04]
C
Engineers are really good at hill climbing, especially ML AI data science, stats people, people who are trained to go and look at, you know, ML metrics, figure out why they're bad and how to improve them. They were good at that in structured data and traditional machine learning because those metrics are very well defined. Like accuracy is well defined, loss is well defined. If you're doing a binary classifier and you already have labeled data, it's all there for you. You're just doing your job and trying to improve on those metrics. Now we're in a role in which none of the metrics are defined and so there's no way to go and exercise all the skills that the AI engineers have to improve those metrics. I think it is so spot on that this article talks about, you know, PMs need to set these metrics because they have the best context to set them. And once you set them, once you provide the definition or, you know, examples of good and bad, then engineers will figure out a way to encode this into the product and improve, even get to self improve. Improving products like that is exactly what I see. Also in my experience,
[33:14]
B
one thing you have to be careful with with hill climbing and when that term is said, it does tickle my brain to make sure that people don't overfit. Because you know, engineers, sometimes they use that term and they are overfitting in some sense if they're not. If you know, machine learning people know, data science people know. But it is a, it is a thing.
[33:37]
A
And what is, what is overfitting look like in practice?
[33:42]
B
Yeah, overfitting means that you, you have hill climbed against some data or something that doesn't generalize. So the key phrases does not generalize. And you know, maybe you have some evals. So like one very trivial way to mess this up, and this is from a real client is you have some, you have a test data set of, you know, cases that you want to do well on in your evals and you use the same data as few shot examples in your prompt. So that is a hilariously kind of straightforward way of overfitting because you have given the LLM the answer directly in the prompt and there's, you know, we can relax that a little bit. There's, there's more subtle ways that you might do that. But essentially, you know, you might over engineer your prompt with those specific cases in mind, with the details of those cases that are a little bit too specific. You know, even though you don't have the exact like information there, you know, that might lead to overfitting. And there's many ways you can overfit. And what we, Shreya and I teach is, well, how do you know that you overfit? How can you guard against. Turns out you can use a lot of the same techniques from machine learning to kind of give yourself an early warning system to say, hey, you overfit.
[35:12]
A
What are some of those techniques?
[35:15]
C
Number one thing is collect, label data, reserve some of it for testing. As Hamill kind of alluded to, you want to have some test cases that you want to try out your product to see if it performs well on. Reserve that set and never, never look at it, never look at it when you're developing your product, when you're developing your prompts, even when you're kind of trying to test out your prompts to figure out how to improve it. Just make sure that the test data is in the sandbox that you never, never see. That's the number one strategy that I tell people. Another thing that I tell people and PMs should also know this is anytime you see a metric that is suspiciously too high, like I achieved 99% alignment or 95 or 100 or whatever, immediate smell, warning flags that something is off, we're leaking some data into our prompts from the test set. Maybe, maybe our test set doesn't have enough examples. So go back and question the process if you see any numbers that are too high.
[36:21]
B
That's absolutely true. One of my favorite examples is there's, there's a interview that's been going around for a while now. It's from a company called Case Text who's a, you know, who's brought by, I believe Thompson, Reuters or maybe LexisNex is one of the two big ones, legal companies. And the CEO went on the, you know, and said, hey, like you need to keep iterating on your evals until you get to 100% accuracy. And so, you know, if you're iterating or if you're getting 100% accuracy in your evals, it's likely your evals are worthless because they are providing you with no signal. Just think about anything else. If every, every student in your class is getting a perfect score, is it a good test? You know, any kind of test, if everybody is passing it and everything is passing, should you be celebrating? No. It means the test doesn't have, is not differentiating good and bad and it probably is not really worth it. And so it's really good. You know, you have to keep that in mind.
[37:31]
C
You want your test to bit of nuance to it. You want some tests that are, you know, basic regression tests or functionality tests that absolutely you want to pass them because otherwise you're shipping a broken product. But I think the broader point Hamel and I want to make is you also should have some aspirational evals and those, you know, it doesn't need to be 0% or like designed adversarially, but some benchmark that helps, you know, whether you know, the general AI or intelligence in your system is getting better.
[38:08]
A
In talking to so many AI companies about eVals, what's one AI company that sticks out to you doing it particularly well and why.
[38:22]
C
Everyone's struggling. Okay, I'll talk about some design principles that I see people converging on that are solving parts of their problems. So I don't think anybody is doing it perfectly. I mean otherwise you would have like a rocket ship AI startup that's like already perfect and like gone public and everything works. We just don't have that right. It's not there. So like, let's be honest, some of the things that I've been seeing teams that really seem to give them leverage are one, building custom data labeling and annotation interfaces or providing ways for domain experts or other people that they trust to be able to get feedback on traces. A lot of the observability tools for LLMs have started to build in these features. They didn't have them even in the last two months, three months. So it's going to take some time until we can really see the effect of that. Another thing that I've been seeing is these really well scoped LLM judges that are being deployed as part of products. Claude Anthropic just had a new post on their research agent about how they really break down the complex task into a multi agent system. And then critically they have a bunch of LLM judges that are very, very well scoped to each task, much like we teach in our course. And I'm sure they're making a bunch of revenue off of this product. And then of course I think there's a lot around. You know, Cursor famously, you know, very dutifully measures your next token prediction. They have their own models for this. I think next token prediction for coding is interesting because that metric is very well defined. People know that it's useful, it directly correlates with product success. Not all products have such a metric like that. PMs need to really go figure that out. But these kinds of things I think are broadly helpful across the board that I've been seeing. But I think we're pretty far out from the product that's perfect right now.
[40:31]
A
What is next token prediction
[40:35]
C
when you're writing code? The next snippet that you're going to write in your code. If the AI model predicted that correctly, then the metric goes up, if not metric goes down.
[40:47]
A
Kind of like an advanced autocomplete. Yeah, why haven't we seen better autocomplete in like regular products like email and text message and things like that?
[40:59]
B
It's incredibly frustrating, especially if you're writing code. If you're a programmer using Cursor, Shreya is using Cursor. I know like six hours a day. And, you know, the developer tools are very far ahead of the other tools. And it's incredibly frustrating if you try to use AI in Gmail or in Google Docs or in PowerPoint or anything. And it's like even more frustrating because then you open LinkedIn or whatever and it's like, Louis have AI and everything. And we're like, no, you don't. Like, you barely do. And yeah, I'm not really sure why.
[41:38]
C
I have some hypotheses. Yeah. So it goes back to Hamill's comment on verifiable domains. Code is a verifiable domain to some extent. You can just make sure the code runs. That's already great. The other thing is there's a bunch of data of how people write code that's already there and available on the Internet in ways that we don't have that for emails. Emails are not very verifiable. Nobody is telling you, like this is ground true, like this is correct, or like this is not correct. PMs have to do this job of figuring out how to design a system around good and bad emails. And that's where it becomes really, really bespoken. And I think that it goes also back to Hamill's comment around code, where it's like, developers knew what the evals were, so they were able to implement that. But when. This is why we need AI pms. We need people to be able to help developers craft these verifiable or even loosely verifiable signals.
[42:42]
A
I know you guys have spent some time with OpenAI. How is OpenAI approaching evals and what can we learn from them?
[42:52]
B
So one thing I'll say is there's a really big difference between foundation model benchmarks. So MMLU score, human eval score, these are like general purpose benchmarks which try to assess the general capability of models. And then there's another kind of eval, which is your domain specific eval evals for your business. And they are very, very different. And it's important that people know that now, rightfully so. The foundation model labs, like OpenAI, they're very much focused on the former, the general purpose benchmarks, because that is, you know, relevant to their product. But I think, you know, they're kind of at the same level of playing field as everybody else when it comes to domain specific evals and like, how the companies should be defining their evals.
[43:51]
C
The other thing is they're not going to do it because your definition of a good Email is different from my definition of a good email. And we're both building email assistant companies. They're going to be different products than they should be because they should reflect our taste and our company's vision. OpenAI just cannot solve both of them with one model. That doesn't make sense. And they're not in that business. They're trying to make the model that you want to hire, for lack of a better term, to do the job. But you need to train. Not just, not really train in terms of model parameters, but you need to figure out how to bake an environment, how to elicit the right signals, how to reject emails that are bad according to your definitions. Like, that's all the stuff that you've got to build that's domain specific, that no foundation model company is going to do for you.
[44:39]
B
Definitely. One example that is top of mind right now is Shreya just wrote this excellent blog post writing in the age of LLMs. And you know, if I was to. So the problem is, is like foundation models, you know, they're trained, they have like sets of labelers and basically it's the. It's by definition trained on the average taste of some people. And, you know, if you're. If for. So if you're writing a lot, you know, like, I know you are. Akash as well is, you know what comes out of the LLM? Like, almost everybody I know that writes a lot kind of hates it. Like, why can't it be better? Like, why is it giving me like the same, like, why is it like using so many words? And why is it like, doing this? Why does it keep using EM dash everywhere? Why is it using so many emojis? Like, stop it. That's because, like, you know, whoever's labeling the data, you know, they kind of bias towards like, longer explanations and like overusing bullet points and all this stuff. And so like, you know, if you want to have like a, a thing that writers love, you know, then you have to get someone, you know, you have to get like, perspective like Shreya has and like iterate on your product, like over time and make it do those things that incorporate the taste that you have, you know, and so that's kind of where it diverges from the foundation models.
[46:12]
A
So do those wrappers where you're putting in your own taste into the evals? Do they actually have some sort of moat where people should consider building those startups?
[46:21]
C
Oh, absolutely. I think maybe I'm crazy. I think evals are the moat for AI products. And like, Truly nothing else. Like tomorrow you can use a different model, you can have a different stack for serving whatever it is. Like the secret sauce is your evals and how you're able to operationalize or scale that out as quickly as possible. So that means having good LLM judges, for example, that are like very well aligned with your preferences, that you can just automatically run on everything and build that flywheel for yourself to go, then look at what failed and improve your product and so forth.
[46:59]
B
And it's not just like, it's not just the evaluation itself, it's the whole system. Like Shreya saying around the eval, the entire eval pipeline, it basically should be portable. You should be able to switch models or, you know, switch components and see what the effect of that is. And the key thing is like evals open up a whole lot of doors. So it opens up the door for easy fine tuning. You know, once you've done an eval, you've already done 99% of the work of fine tuning. Fine tuning is just a kind of a formality that you can go through after that almost to an extent because you've already done, you already have all the tools for data curation. You know how to measure things, you know how to like see, measure the effects of the fine tuning, so on and so forth, you know, and I suspect that that will become, that just accelerates your advantage like that much more.
[47:54]
A
It's kind of funny because I see people tend to focus on fine tuning a lot more and there's a lot more attention given to that. I think it's because like in the OpenAI and anthropic developer docs for when they give you a model, they're like, yeah, this is how you can fine tune it and things like that. How does fine tuning really fit in with the overall life cycle? When should you be putting attention into fine tuning?
[48:19]
B
We should put it in last. So you should do evals first because like, what are you, you have to know,
[48:29]
C
Sorry, maybe my connection is bad. There's like lag here. Yeah, you need to know this. This goes back to, you want to iterate on your product. You want to know, okay, my product is not good, so I'm going to improve it. You don't have any way of knowing if your product is not good unless you have evaluation. So evals are step zero, quantify your performance, then step one is, okay, how am I going to improve its weak points? Some low hanging fruit ways of improving are just use a more powerful model. Switch from GPT4OH many to GPT4O see if that works because you have good evals, right? You can just make that switch, run it, see if your numbers go up. There are other more complicated strategies you can do, like take a complex LLM call a complex task described in them and break them down into multiple LLM calls. So instead of having an LLM extracting 10 things from this document, I'm going to have one LLM call each extracting one of those 10 things. So like that you can kind of do task. That's what we call task decomposition. When you exhaust these strategies, then it makes sense to move into fine tuning where it's like I have no other way of solving my problem. The model is just not there yet. I'm going to collect a bunch of data or use the data I've already labeled and fine tune a model. Fine tuning has cons because now you have to make sure that you're continually fine tuning that model. As you get new data, you learn something new about your preferences of your customers. They don't just automatically update if the base model updates right, GPT4.0 or5.0 or whatever comes out. Now you have to redo your all fine tuning all over again. And then if you don't use a model provider's fine tuning service like you're fine tuning an open source model, you need to deal with the MLOPS complexity of serving that. Making sure it has good uptime latency is loan load. This is all actually pretty complicated. It costs a lot to maintain this infrastructure, both in human personnel as well as in money. And using an off the shelf LLM is like pretty much the right way to go in most people's use cases. So for that reason, for those reasons, we really try to not encourage people to do fine tuning unless you really, really have a good reason for doing your own fine tuning.
[50:55]
B
One example where I might go to fine tuning, hilariously is this writing thing. Because like Shreya, we talk about it a lot. Like writing, writing, we bash your head against the wall. Even right before this call, it's like, oh, like do you try 4.1 industry? I made good comments like no, like, you know, you can't prompt it with all these rules. It's not going to follow these rules. So I'm like, okay, there's no. This really feels like it's only one avenue left here. You know, we have to put a lot of work into it. But it's like if you wanted Shreya GPT, that feels like the only way to get it?
[51:27]
A
Fine tuning is always talked in the same conversation as prompt engineering and rag. So when do you really think about using each of those three techniques?
[51:39]
B
Okay, so like, yeah, prompt engineering is basically, you know, everybody that's this basic, you know, you're using the LLM, you have to communicate what you want to it. You have to specify what you want in whatever way. So you should always be writing prompts. You don't even need to worry about engineering at the end. We are kind of, you know, adding a lot of ceremony. You know, we're just, you're writing, you know, write English or, you know, in some cases you might, you can write other languages too, but it is mostly English, you know, and refine your thinking and refine your instructions. RAG is, you know, anytime you're. You need external context, which there's a very narrow set of use cases where you don't need external context. Most of the time you do need, in many applications, you need some external context. You know, that's not some general knowledge of the world. Like, and so that's when you need rag, you know, and you're not trying to bake all that external context into your prompt. And then, yeah, fine tuning is sort of, hey, like if you can't prompt this behavior, if the, if the model is not doing what you want, Shreya has a, has created this like very powerful model called the three Gulfs. Maybe we should bring it up on the screen actually.
[53:08]
C
Yeah, this is the three Golfs question.
[53:12]
B
Yeah, we should dive into the three Gulfs.
[53:17]
C
I think they're in lesson one.
[53:20]
A
Okay.
[53:21]
C
Or great, you have them. Oh, you made your own gulf. Nice. Yeah, I can take a step. So your question was, you know, you have all these tools available to you for improving AI products. You have prompt engineering, you have rag, you have fine tuning. Are they the same? Are they the different, when do I use them? Blah, blah, blah. This is a good question that trips people up a lot because people are just thinking of them all equally as ways to improve their product. They're all complementary strategies. You can do multiple of them, you can do none of them, you can do one of them. Whatever it is. Prompting is very good. When you have specific requirements that your task follows and you need to communicate that to the LLM. Say you were to hire some human to do some job or contract some job for your company. You might give them a task specification. Say, do this, then do step A, then step B, then step C. Make sure your output follows these requirements. That's prompting. There's no they almost feel like there's no ceiling sometimes to how much value you can get from good prompting. And this is the first thing that you should be doing. And this solves this gulf of specification problem that Hamilton I see a lot in how people build LLM products, which is, you know, they have some latent, this hidden criteria, this hidden task that they want the LLM to do, but they just don't know how to specify it well in a clear way and completely so that LLM fully understands all of those hidden preferences. Now, fine tuning RAG strategies are for this gulf of generalization which targets this problem of I have a very good specification, but the model is simply not good enough for some reason. Maybe the model is not powerful enough so I should do fine tuning. Maybe the model just doesn't have all of the context that it needs to make the decision. That's where RAG comes in. I need to pull in external data sources to improve my product.
[55:35]
A
But yeah, very helpful.
[55:38]
B
And I would submit to you. I know that you ask this question a lot in your podcast because I think it's confusing, rightfully so. Yes, this question, I would say this might be a good framework for answering that question. You know, since it comes up a lot.
[55:56]
A
Yeah, I try to play the role of the watcher. Right. And they keep asking me that question, so I got to ask the experts.
[56:04]
B
Makes sense if you guys had to
[56:05]
A
build a roadmap for people who wanted to get really deep on AI evals, what topics should they learn?
[56:12]
B
Great question. So Shreya and I have developed this course reader. It's actually a very extensive set of notes. One might even call it a book. It probably will become a book. It's 150 pages. This is the detail that we go through in our course. We arm our students with a lot of information to make sure that they get materials in many different ways, including, you know, live instruction, office hours, but also this very detailed sort of treatise on, like, how you go through eval step by step process. So we start off with, you know, like, what is evaluation? We go through the three gulfs framework that we just described. You know, we talk about why you need evals, we motivate that. Then we kind of do a little bit of overview of like, okay, the strengths and weaknesses of LLMs. You know, what kinds of things you need to intuitively understand when you're doing evals. Before you even get to evals. The third chapter is very important. It's probably where we spend a lot of our time in practice and a lot of People, we don't know about this step. It's called error analysis. So what is error analysis? So you might hear Shreya and I talk a lot about looking at your data. We keep beating people over the head with this phrase, look at your data, look at your data. What does that mean, look at my data? What data do you look at? Do you look at all your data? Do you just quit your job and look at your data and do nothing else? You don't have time. How do you look at your data? And what do you do when you look at the data? Like, how do you make sense of it? How do you make it. How does it make it tractable and learn something from your data? What if you don't have any data? Like, what data? Am I talking? What if you haven't built anything yet? The key thing is, like, you need to look at some sort of data, go through a structured process. It's not as painful as it sounds, looking at data. It's actually really beneficial. A lot of my clients, you know, we have a plan to go through this whole process, and they get so much value just out of error analysis that they just. They're like, this is amazing. I'm done. I'm like, wait, what do you mean you're done? Like, I can do all this other stuff for you. Like, no, no, this is great. Like, I. This is like, I'm busy for a while. Error analysis has taught me so much. And, you know, error analysis is, you know, to go to error analysis, we can scroll through. Let me see if I can. There's a diagram here. But, you know, this is. This is like one way to generate synthetic queries, for example. It's hard to dive into the details just without the context. But we go through the process of, like, how to generate, you know, synthetic data, you know, and then also we go through, like, how to look at your data and, you know, so we describe it in a lot of detail. A lot of it is supplemented with videos. Let me see if I can scroll up maybe. Shreya, you can.
[59:28]
C
Yeah, that diagram is pretty good.
[59:30]
B
Yeah, this one. So there's a concept called axial coding or open coding and axial coding. Shreya, you want to tell us everyone where that comes from and the history behind it?
[59:42]
C
Yeah, definitely. So in creating this curriculum, we took a lot of inspiration from social science research, actually, because we were thinking, okay, in what field and domain do people need to look at vast amount of unstructured freeform text and labels and come up with meaningful, actionable insights? Out of it. Well, turns out social science researchers have done this for a very long time. There's this process called grounded theory that gives this systematic structure to the process. And first what people do is go through their open ended data outputs or whatever it is. They read them and they write freeform open codes is what they call it. Freeform notes on what's good, what's bad, any themes that emerge, what not on a trace by trace basis. Then after going through about 100 of those, they will try to merge similar notes together into clusters. This determines your failure modes. If you find that a bunch of notes around a specific type of hallucination get merged together, well, looks like that's a huge problem that you need to solve in your product. And this kind of loop keeps going until what we call theoretical saturation in qualitative research, which means I've not learned any new failure modes. That's all. I keep looking at data and I just keep adding to my existing failure modes. At that point you can kind of stop and then you can move on to, okay, how am I going to turn those into automated evaluators so I don't have to do it all the time? Maybe I'll build LLMs judges based on my labeled failure modes. Maybe I'll write some code based evaluators for things that code can check. And then we find that we teach people this diagram error analysis. And then suddenly they're like, oh, this is great. Okay, bye. It's like, I mean, I guess, right? Like it makes sense. This is where most people are actually bottlenecked, right? Because ML traditional machine learning people didn't have to do this. They didn't. When they looked at their data, they would just look at all these tables of features which are all numbers. And the outputs are also numbers. And so there are ways to kind of debug those. LLMs are different because now the inputs are freeform text and the outputs are freeform text and people are like, I don't know how to there. It's new technology, it's new new skills that you need to be able to make sense of that. But fortunately once you do this, then you can go and implement automated evaluators as you might with traditional machine learning. And then you can do improvement strategies, you can do fine tuning all of these things like a traditional machine learning person would.
[62:28]
B
One thing I want to share that might be useful here or interesting perhaps for a product manager. We have Teresa taking our class and you know, just today she was discussing, you know, you know, okay, I Love the class too. You know, when open and axial coding are we analyzing opportunities towards an outcome? And she responds, I've been thinking about this a lot, you know, and then she has some comments about generating synthetic data, you know, but she also recognizes, and this is really fun, this is why we have product managers in our class. Because, you know, these kind of like tying together of things like, okay, identifying opportunities from interviews is based on grounded theory also, you know, and then she's experimenting with methods for capturing annotations directly from the customer. And, you know, she also kind of points out that, okay, these are the scientific method applied look at the data. So I think project managers have a lot to offer here, you know, bringing their customer and user research into the whole process. That's why we think it's really important that they are in the driver's seat of these evals.
[63:42]
C
I will go even further to say that I think that PMs are like, vital in building AI products. Like, we're not going to have successful AI products across different domains unless we have good AI PMS. Like, I cannot emphasize that enough. Without good AI PMS, like, we're just going to have failed AI products. So I hope this is a call to action for PMs. Like, this is a skill that you absolutely need to develop. It's going to set you apart, obviously, of course, from a career perspective, but also we need this, right, in order to really realize this vision of AI products changing people's lives.
[64:24]
B
Okay? So to get back on the things to know. So once you do error analysis, now you have a grounded way of knowing what to focus on. Because the question is, what do you even evaluate? You can come up with millions of metrics like hallucination score, toxicity score, conciseness score. You can name these scores all day long and you can get really confused and overwhelmed and say, and say, oh, my goodness, like, I can't do it. So error analysis is very important. You know, you would be surprised. In our Discord channel, almost every other question is answered with the phrase error analysis. Because, you know, one of the main trouble people have with evals is what do you eval? You can, you know, you can eval anything. There's an infinite number of things, ideas of things that can go wrong. You know, instead of just being paranoid and sort of, you know, working yourself up about, like, what can go wrong, you should be grounding it in things that are going wrong or the things that will probably go wrong. And that's what error analysis helps you with. It helps you focus on high value things, because evals are not free. It takes a little bit of effort. And so there's a. This entire time that you're doing evals, there has to be a cost benefit analysis around. Like, what do you even eval? Should you be evaling? What do you eval? Whatever. And so error analysis is the answer to all those questions. And error analysis is so valuable that, you know, a sizable portion of my clients, they sign up with me to go through this whole evaluation process, and they do the error analysis part and they're like, great, Hamill. We love it. We're done. Like, this is. I'm like, no, wait, I have all this other stuff. Like, all this other stuff in this table contest I can take you through. They're like, no, this is so much value. Like, we've. We're so happy. And it's true because, like, you know, error analysis is like this thing about looking at your data. It actually drives you to develop a very deep intuition about your system and what's going wrong with it. And it makes you develop a nose for everything and kind of gives you a sixth sense of, like, what is even going to break if you're doing it enough. And so it's extremely powerful, probably the most powerful technique in evals. So, you know, I can't highlight it more than that. But, okay, once you move past error analysis, now it's. Now you know what is wrong and where to focus and what to prioritize. Now, how do you actually go about the process of writing the evals? And so that's where these other chapters come in around, okay, how. Who does the. Who writes the evals? Who does the annotations? So we have this. We have some strong guidance in here. We're not. We're not afraid to, you know, cut right to the chase and have opinions. And so, like, one opinion that we have, for example, of the many opinions, but I'll just highlight it here, is benevolent dictators. So we say, okay, people always ask, like, well, who is going to make the final call of whether or not this is a pass or fail? Or who is going to be the one writing the email? Who's in charge? I have a whole team. To just involve the whole team. The answer is no. You need to have a benevolent dictator in most situations. And there's some, you know, there's some reasoning behind that. It's just, you know, it's like the binary classification thing. It forces you to kind of make a decision, and it makes the whole process tractable because it can quickly become intractable if you add various complexity. So going back to the table of context, let me. Context let me stroll back up. Feel free to interrupt me anytime. So we talked through implementing, you know, making this a process automated with LLM as a judge. Now with LM as a judge, you're writing prompts and you're asking an LLM to grade something. Now almost everybody, I would say 80% of people who are using LM as a judge or just writing a prompt and just praying that is doing the right thing. But that's not what you should do. The right way to use an LLM as a judge is to label, to get your label data, which you already done in a previous step. That's what we're teaching and measure the LLM as a judge against your label data to understand is the judge any good? Number one. But number two, more importantly, you have to iterate on your LM as a judge. You have to see like, where is it wrong? And you're like, oh, it's always the case that not only do you adjust the LLM as a judge, but Shreya has a lot of research that shows that the annotator also adjusts their requirements seeing the LLM output. Shreya, you want to talk a little bit about that?
[69:34]
C
Yeah, I have a paper called who validates the validators with 2024 if people are interested in the venue. But yeah, one of the things we did was we built an interface for people to go and annotate outputs to train LLM judge evaluators. And then we found that people will keep changing the rubric as they encounter new failure modes when looking at the data. And this underscores how tricky it is to figure out what the rubric is, how you need to go through this iterative process, how sometimes when the criteria is a little bit too subjective, maybe you want to have multiple people weigh in on what correctness means. And there are ways to do this, like the inner annotator agreement metric that we talk about, Cohen's Kappa and this course reader, a lot of stuff. Point is that we just provide a lot of frameworks to really systematically solve the problems when you encounter challenges in coming up with good labels and ground truth.
[70:38]
B
And so, you know, we teach you a lot of things like also how to, you know, correct the LLM as a judge error rate based upon your real error rate and you know, all kinds of advanced things. We also teach you how to think about multi turn evaluations like okay, if you have a long conversation, that's like multiple terms between an AI and a human. How do you actually evaluate that? Do you evaluate the entire conversation? You know, one piece of advice we have is like you should, you know, evaluate, you should, when you're doing open coding, for example, you should stop on the first error that you see because these things have causal relationships. And it's usually the case that the first error is the blocking, is the blocking problem. And so a simplifying heuristic is to anchor on the first error. And there's all kinds of heuristics like this that make the entire process tractable. But also when it comes to evaluating multi turn conversations, like how do you actually have test data sets for that, how do you simulate in multi turn conversation? Or do you need to simulate a multi turn conversation? There's a lot of things to dive in there. So we cover that. We also cover how to evaluate retrieval, augmented generation. And so we talk about when to treat certain aspects of the problem, like a search problem, when what components of the problem, like the generation problem, how to think about that separately and you know, all the nuances there. And then we talk about other specific architectures in components. So like how to think about tool calling, what you need to know about agents. So agents are systems that can be very dynamic. You know, they might have trajectories that are unpredictable and do various things that you can't anticipate. How do you evaluate that? How do you tame that monster, that spaghetti? It turns out you can use various analytical tools that simplify the problem and allow you to attack it and make sense of, you know, agentic systems. And we show you things like how to use transition matrices and other analytical tools to help wrangle that problem. And then we talk about, you know, how to evaluate specific input data modalities, like different types of modalities and like not just text. Then we talk about production, productionizing these things. So like CI cd, you know, kind of automating these and running these at scale. And then we talk about. Also kind of probably my second favorite subject in chapter 10 is interfaces. And I'll give it to Shreya to talk about the interfaces. Actually.
[73:34]
C
Yeah, chapter 10 and 11, they're the last week of the course and they're kind of like special advanced topics that you will, I think I can guarantee that you won't find them anywhere out there on the Internet. So we wanted to do something special for our students. Chapter 10 is about how to build, perhaps even vibe code your way to an effective interface that has people really, really Labeling things very quickly, maximizing the throughput of how many labels you can get from human reviewers. We talk about principles there to case studies of good bad interfaces. And then chapter 11, I think is about improvement. Obvious strategies that you might know about like decomposing tasks, fine tuning models, so forth, but also cost optimization and cost improvement. A lot of people actually say like, oh, my pipeline is working but it's too expensive, especially if it's working on really long documents. How do we keep the same quality but reduce the cost by an order of magnitude or even more? Well, we talk about some cost optimization techniques there. So that's a very long answer, I think, to your question on, you know, what's the roadmap like? Here is a roadmap, we think it's pretty good. A lot of places you can dive in, you can dive into each chapter that you're interested in. And chapters 10 and 11 here I think are pretty open territory in both in research to explore more as well.
[75:02]
B
So upon seeing this book, a natural question is where can I get the book or are you going to release this book? And so the answer to that question is eventually, yes, maybe early next year. However, you know, if you want that kind of hands on approach of, you know, doing this in, you know, very guided way, then of course take our course.
[75:27]
A
That's actually what I wanted to ask next Hamil, is you were working at Airbnb and GitHub. Why did you transition into consulting and courses?
[75:37]
B
Good question. It kind of happened somewhat organically. You know, I took a sabbatical after GitHub for a bit. I worked at some startups and then I, sorry, I, I. After GitHub I worked at a, a few startups then decided to take a sabbatical and a company decided to contract me. They convinced me kind of to the name of the company was weights and biases. They convinced me to do some consulting work for them. And so I, yeah, I always thought that like I would hate consulting because I did consulting with a large consulting company very early in my career before tech and, and I realized like, hey, I really like it when you're doing it on your own, becoming like an independent indie kind of entrepreneur. It's very enjoyable, it's a lot of freedom and I found that a lot of people found it very valuable like in terms of like helping them sort of navigate LLMs and AI. And then I just, yeah, I just started really enjoying it and one thing led to another and here I am like there's no, let's say grand Plan as such. It's just kind of, you know. Yeah, it just like happened to be in this place and it happened that everybody building with AI, they're always stuck on evals every single time in this exact same problem, almost every single time. I needed to start with error analysis and this is like, you know, working with 25, 30 companies, you know, and then I just started writing about it a lot and it was really apparent to me that there's not that much education on this topic and it's where everyone's struggling, you know, and consulting is expensive. And so that's why we created this course to make this way more accessible to a larger audience.
[77:36]
A
And you tell people who want to consult you as a consultant to not reach out if they can't see spend $38,000.
[77:44]
B
Yeah. So I think one thing that you have to do as an entrepreneur is to a lot of different things. One is like, it's important, can be important to have a niche so that you customers know what you can help them with. So in this case, mine is evals. And you know, just from a practical perspective, there's a lot of overhead that comes with being an entrepreneur. Like you have, you're in charge of your own sales, your own marketing, your own, you know, all the administrative overhead of it. And you know, I don't want to be on sales calls all day. You have to qualify people and, you know, the problem has to be painful enough for them to want to solve it. And so that's just a kind of a basic kind of blocking, tackling approach to say, okay, like, how do I not drown in Zoom meetings?
[78:44]
A
And then the course, it costs roughly $2,000 and you have over 600 people in this cohort that me, Teresa, Pavel, other people are participating in. So does that mean you earned over a million dollars on this cohort of the course?
[79:01]
B
Not, no. So we didn't right cross the million dollar threshold because we gave a lot of stuff away for free. A lot of friends, we, you know, I let all my friends in, basically. Shreya.
[79:14]
C
We're also very generous with any discounts for people who have any need. Basically, if your company can't reimburse it and you don't want to pay fully out of pocket, you know, just email us with how much your company can reimburse. And we are super flexible. Like we want people to join the course. We have a steep price point because we find that we want to attract people who are serious about building AI products and eVals. It benefits nobody. If our course has 10,000 people and everyone's only mildly interested in serious. We want to focus on the people who are going to go and put these techniques into production, who are going to go build products. And for that turns out you need to have a steep price point.
[79:57]
B
So almost. Yeah, almost a million. I don't dodge the question because I know, like, whatever. It's like almost a million. Is that 800k is what it was. First cohort.
[80:08]
A
Which is insane. That's mind blowing, right? Especially for the first cohort of a course. And what's even more insane, I think, is that your next cohort is going to be your last. Why?
[80:22]
B
Yes. So I don't want it to be, you know, how you like, watch a movie and like the first one is good and usually, like, it's really hard to make a good sequel when this is like really impossible. Like to make it like, continue to be good. And honestly, like, you know, I don't want to. You know, I have a lot of different interests. But also there's a lot of things that we can do with the course. So we could, for example, you know, you see this like 150 page. This is just a draft, honestly. I mean, we keep iterating on this, on this book that we have open right now. So one, you know, one sort of motion is to like, okay, focus time on writing a book. Then can, you know, it's currently. It's a $2,300 course. We're going to actually be increasing the price. $2,500 on Friday. You know, can there be a recorded version that doesn't involve us, like live, that doesn't involve any office hours or anything else? You know, that's like lower cost. Okay, that's fine. So we think like, there's different products out there that we can offer or different ways that we can offer this. But also like, you know, this course takes an immense amount of time to deliver. And so one of the things that we like to do is actually build these AI projects. Like, I like to work with customers and build these things. I know Shreya does too. She likes to do research on this topic. Her research is very applied. She works with companies on this. She builds tools in the space. And it's really. Yeah, if we do the course over and over again like this, we wouldn't be able to do that. It would actually dilute the course. So we want to keep it special. We want to, we want people to feel like, okay, they were here for this course, something they could remember it, you know, Last year, I had a course like this. It was not on this. It was on fine tuning. It was. It was provisionally about fine tuning and then became like, basically a conference, but it was basically the same scale as this course. It was like also a $900,000 revenue course and one cohort. And, yeah, we decided, like, we don't want to do that again, because how can we possibly recreate that. That subject and that feeling of that time? You know, we don't want to do a second one and just have it be underwhelming because we try to repeat something. So, like, you know, this is different. This is like. That was, you know, kind of a conference in a way. It turned into a conference. It was just like everybody about anything about LLMs. It was really fun. You know, this is like, really focused on evals. We think that, okay, there's a lot more repeatability to this. But, you know, our goal is not necessarily, you know, trying to milk all the money out of it, per se. We're actually like, you know, all this. This money that made. We actually, like, going to invest it in all these things. So, like, writing a book, you know, doing this additional, like, different kinds of courses like that are. Can be delivered in different ways. You know, it's all going to be reinvested into that to make this material, like, accessible, because we really believe we have a lot of conviction in this message and this topic and the impact of learning this.
[83:46]
A
I think there's a lot of lessons there for aspiring and current course creators, but I think one of the most interesting ones is how of the time you've made this course, like, you're addressing the problems that you're seeing with your consulting clients and with AI engineers and AI PMS that they're facing with evals today. So you're able to write about it that way. You're not trying to just create timeless content. And I think that makes it really effective.
[84:12]
B
Yeah, I hope so. I mean, you know, people message me, they're like, oh, this is a lot of money for like a month or whatever. But then I remind them, like, I've been writing about this for two years. I think Shreya has reviewed all of my blog posts throughout the years. You know, I send it to her and she, like, reviews it. You know, I'm really grateful for that, you know, and she's been writing about this for many years as well. And by some stroke of luck, I was able to convince her to nerd sniper also and say, like, okay, let's do this. Course together. And I wouldn't have done it if she said no. It was only like, if she does it, then I will do it kind of situation, you know, because Shreya is, you know, if you look at her writing, it's actually really impressive. Like, she. Yeah, she, like, it's a very good compliment because I'm, like, really focused on consulting and stuff like that. I don't have time to survey the field, bring in, you know, like this, like, generalized perspective in theory. But there's a really good compliment there where Shreya brings to, you know, structuring this material, bringing in, like, you know, a lot of the kind of. I mean, I didn't even know what axial coding. I was just doing axial coding, open coding. I didn't know it had a history in social sciences. So it was really good to know that there's a. There's actually some, like, history of this working. So things like that. I mean, I don't even know if I'm answering the question anymore.
[85:47]
A
There was no question, so that was great. If you were to give advice to somebody who wanted to be create a 800-900,000 course, what would be your advice to them?
[85:59]
B
Yeah, so one is. Okay, so a lot of people ask me this question. Let me try my best. One is, is to really find a niche that you are passionate about. Like, I didn't think I was going to create a course around this, honestly. You know, I've been writing about this forever. In fact, emails is like one of the most unpopular things you can possibly write about or talk about. It's like deeply, deeply unpopular.
[86:28]
C
Yeah, I really want to drive that point. Had we talked about something very cool and sexy, like, you know, I don't even. There's so many buzzwords out there, like MCP and like, agents and stuff, right? Like, that is an easy topic to write a course on because everyone wants to learn how to do it. Evals is something that everybody knows is a problem but doesn't want to go and put in the effort to doing. So I think, yeah, my number one advice would be to find a. If you want to do it the easy way, find a problem that people are excited about going and actually doing the work for, because you could probably 10x what we did for evals if you pick an exciting topic. But Hamill, you should continue.
[87:10]
B
Yeah, I mean, I actually went through several nights or weeks or months or whatever of thinking to myself, man, am I swimming upstream? Like, you know, I talk about evals. I built. Sorry. I built my consulting around evals. But like, you know, you know, it's like telling people to eat their vegetables. You know, it's not really that popular. And you know, it's much easier to talk about agents. You know, if, if you did some like background research, if you try to come out of the perspective like, hey, I should build a course. I want to like build a course. You would never pick evals. You would pick agents. Yeah. You'd be like, okay, I'm gonna sell an agent course. Everyone wants to learn about that. People are actively googling. If you do like SEO research, like, what are like the keywords people? Like, almost nobody is like searching for evals. There's very little competition for that keyword. Actually, like, if you search the evals, like Shreya, myself, Eugene, basically our friends, like Shreya, me and our friends. And I'm talking about our friends, meaning the friends that Shreya and I have in common. We all are come, come up on like, you know, at the top of the list. It's pretty insane. It's like that's how niche or quote niche it is. But, you know, at some point I just didn't care. I was like, we need to create the category so, you know, it's not popular. You need to make it popular. So I don't know if that's good advice. That was my mindset. But I can tell you the only reason that allowed me to get into the mindset is because Shreya partnered with me. And so one thing I would say is like, find a good partner to partner with you. So like, to get into the mechanics, like, you know, a lot of course sales has to, you know, marketing is very important. You need to let people know that a course exists. Otherwise I can't buy the course. We don't know. And they not only need to know it exists, they need to know like what it's about. They need to get excited about it. They won't need to know, like it's good. They need to know like, you know, need to have some social proof or need to have, you know, some idea. Like people like it. They, their peers get value out of it, all of these things. And they need to be reminded about it constantly. Because, you know, I need to be reminded about things constantly for anything. So, you know, all of this is very, very important. And so that's a full time job, you know, and so if you want a course, you want to sell a course like that, like, you have to spend a lot of time doing that. And can you outsource that marketing, you can definitely partner with people in marketing, but you kind of have to drive it because you are the subject matter expert, you're the domain expert. You know, you know, you have to build that authority and that connection with your audience because, you know, whatever, you know, just relentless focus on that.
[90:10]
C
But he's not also talking about all of the work that he spends putting together. You know, like a lot of really great guest speakers really thoughtfully figuring out, okay, you know, we're talking about X, Y and Z, so and so is like the world's leading expert on X. Like Y has been applied in these three companies. So let's get somebody from there. This, the guest speakers are really diverse and I think really distinguish also our course from a lot of other courses out there. It's not just us monotonously lecturing. It's we've got other people coming in showing you how these things are implemented in practice at big companies, at small companies, it specific domains or whatnot.
[90:55]
B
Another thing is focus. Yeah, this is like I get up in the morning, all I think about is the course and how can I let more people know about the course or like, you know, make it better for students, how can I bring more perspective into there, so on and so forth. Yes, all this stuff is really important and just constant experimentation. So kind of going back, it's kind of like a, you know, I'm a data scientist kind of thinking person. I'm just constantly doing experiments and looking at the data. And you know, I send Shreya like probably the average of 200 text messages a day or something like that with like different experiments. Like I'm like, oh, I'm trying this. I'm doing this thing. Look at these numbers. Okay, like what do you like this marketing copy? This here, I put it like you put an ad here, this will. This is what happens. Do you have any ideas for like how we can go about this differently? Whatever, you know, hopefully she has do not disturb on I hope but you know, so yeah, I mean this is like constant experimentation. So you're kind of treating it like how I treat evals to be meta is like, you know, this like iterating on, you know, the business of making the course and the thing that's like really good about making money in a course is like you can. I've hired a lot of my friends, you know, and so like, you know, a lot of my friends are data scientists, machine learning engineers and you know, a lot of them are involved in one way or the other either as affiliates or guest speakers or TAs. And I pay all of them. And it's great. I love paying them. And yeah, it's like really great. Like, all my friends win too. And that's like, the good part is, you know, to create that situation where everybody wins.
[92:49]
A
What a great place to end it. If people want to find you guys online, where should they go?
[92:54]
C
Yeah, we're both active on X. You can pretty much just search us up. You'll find us. I have a blog. Emil has a blog. Email us. Yeah, nothing super fancy, I think.
[93:09]
A
Where do they find your email?
[93:13]
C
The best way to do it is just to Google us and our email is at the top of the page. But I can also. My email is first name, last name shreyashankarkeley Edu and mine is just get
[93:29]
B
in touch with me at X on X. It's easy, you know, you can send me a DM there, you can message me, whatever, and I'll figure it out.
[93:39]
A
Amazing. Thank you guys so much for lending your expertise. We'll see everybody in the next one. So if you want to learn more about how to shift to this way of working, check out our full conversation on Apple or Spotify podcasts. And if you want the actual documents that we showed, the tools and frameworks and public links, be sure to check out my next newsletter post with all of the details. Finally, thank you so much for watching. It would really mean a lot if you could make sure you are subscribed on YouTube, following on Apple or Spotify podcasts, and leave us a review on those platforms that really helps grow the podcast and support our work so that we can do bigger and better productions. I'll see you in the next one.