wavePod

Get Wave AI

AI Evals Explained Simply by Ankit Shukla - The Growth Podcast | Wave AI Podcast Notes

Back to The Growth Podcast

AI Evals Explained Simply by Ankit Shukla

The Growth Podcast

Thu Feb 19 2026

Summary

The Growth Podcast — Episode Summary

Episode: "AI Evals Explained Simply" with Ankit Shukla

Host: Aakash Gupta
Date: February 19, 2026

Episode Overview

In this “masterclass” episode, Aakash Gupta is joined by Ankit Shukla—AI PM educator and practitioner—to demystify AI evaluations (“evals”) for product managers. The episode breaks down what evals are, why they’re critical to shipping successful AI features, and walks through a rigorous, actionable framework (with a real-world case study) for implementing evals in generative AI products. The show highlights the nuance, strategy, and collaboration needed, moving beyond basic QA to show how evals underpin quality, trust, business impact, and compliance in AI-driven products.

Key Discussion Points & Insights

1. Why AI Evals Matter

AI models are not deterministic: Unlike traditional software, the same input in a GenAI system may produce different outputs, requiring constant assessment of behavior, quality, safety, and business alignment.
- "It is almost like a lion in a circus... as a ringmaster, you need to make sure that you are able to tame that behavior and show a good product." — Ankit (04:16)
Without evals, products “lie” to you: Shipping AI features without evaluation means flying blind and exposes products/customers to unpredictable behavior.
- "If you are shipping AI features without evaluations, your product is lying to you and you have no idea." — Ankit (00:00)
Evals = the new PM skill: Learning to design and apply evaluations is now essential for AI product management.
- "Most PMs have no idea about how to write evaluations... I'll give you a real case study so you can plan your evals like a pro AIPM." — Ankit (00:44)

2. What Are Evals? Simple to Advanced Forms

Evals as tests for AI systems: Ranging from simple unit tests (e.g., word counts, code checks) to LLM-based judges that emulate human assessment.
- "An eval can be anything...a unit test in the old language...or in the most advanced form, an LLM judge replicating some of that human intuition." — Akash (10:04)
Case Example: Job Website Enrichment Tool
- Extract job descriptions from major sites, use an LLM to generate summaries, interview questions, skills, guides, and quizzes.
- Evals check for correctness, relevance, length (summary < 300 words), relevance of skills, actionable guides, hallucination, and use code, LLM judges, or humans depending on the attribute.

3. Why Do AI Prototypes Fail? And Where Do Evals Help?

(Timestamp: ~12:15–18:00)

Five Key Failure Modes:
1. Data Drift
2. Cost considerations
3. Engineering limitations
4. Lack of guardrails (no feedback loops, fallback logic, legal compliance)
5. Collaboration failures
Evals address:
- Data Drift: By running evals online and offline, you track if the AI's performance erodes over time.
- Cost: Evals let you confidently select cheaper models if they meet your eval criteria (e.g., swap GPT-5.1 with GPT-Nano or open-source alternatives, see 25x price difference discussion at 16:00).
- Guardrails: Use evals to enforce expected/forbidden outputs, legal constraints (e.g., AI can't give financial advice).
- Collaboration: Evals make explicit what success means, promoting clarity between PMs, engineers, and subject matter experts.
"Evals are not optional. They are the guardrails for all the AI-driven outcomes." — Ankit (60:42)

4. End-to-End Framework for AI Evals

(Timestamp: ~22:10 onwards)

A. Start With Success Criteria & Expected Behavior

Define what “good” looks like in your product and what must not happen. Example from finance: answers must be <150 words, no buy/sell advice, factual, grounded from specific data sources.

B. Map Success Criteria to Evaluable Metrics

Metrics can include:
- Output quality & accuracy
- Factual/citation correctness
- Regulatory compliance
- UX metrics (e.g., latency)
- Specific guardrails

C. Build a Dataset of Inputs & Expected Outputs

Collect real production data, research common user questions, generate synthetic data with LLMs, consult subject matter experts.

D. Run Outputs Through Evals

Use:
- Code for objective tests (e.g., character count)
- LLMs as judges for subjective qualities (relevance, helpfulness)
- Humans/Domain Experts for high-stakes, nuanced judgment
- Hybrid: LLMs flag edge cases, humans review them

E. Iterate Prompts, Models, and Tools

Treat models, system prompts, context, tools, and orchestration as variables, not fixed assets—continuously tune based on eval feedback.

F. Offline and Online Evals

Offline: Before launch, extensive evals on curated datasets (alpha/beta style).
- "The way the best AI companies work is that the AI PM defines these evals and that becomes the PRD for the AI engineers." — Akash (35:49)
Online: Continuous/live evals during production, using observability tools (e.g., Arise, Trulens) to monitor for data drift, regression, compliance lapses.

G. AB Testing and Feedback Loops

AB test models/prompts in production, use both hard (thumbs up/down) and soft (user behavior, repeated queries, escalation) feedback.

H. Document and Share

Create living, accessible documentation for all stakeholders (recommend synchronous read-throughs to ensure alignment).
- "Don't expect other people to go ahead and read it...get them on a meeting and for 15-20min, get them to read the doc together." — Ankit (56:57)

5. Case Study: Financial Stock Chatbot (IND Money/Robinhood Cortex)

(Timestamp: ~22:14–53:34)

Scenario: AI assistant to answer user-posed stock questions with regulated, concise, insightful responses.
Example guardrails: No explicit financial advice, only data-backed analysis (e.g., explain P/E ratio, no buy/sell calls).
Evaluation criteria: Factual accuracy, compliance, soundness, and coverage of expected user and regulatory scenarios.
Process: Gather logs, expert inputs, synthetic data for hard queries, build open-source cheat sheets for metrics, and iteratively improve with human and LLM feedback.

6. Memorable Quotes & Highlights

"Most of the PMs have no idea how to write evaluations. They have no idea how to build a professional-level app. In these 60 minutes, you are not only going to understand the fundamentals of evals, but I'll give you a real case study..." — Ankit (00:44)
"Even if models become perfect in not hallucinating, they still can't understand what your customers want...You as a product manager need to ensure the right kind of experience." — Ankit (21:00)
"Put AI where AI is better. Put humans where humans are better." — Akash (47:52)
"Evaluations are not optional...Don’t think you’ll be able to create a solid, complex product without them." — Ankit (60:40)

7. Common Mistakes & Pro Tips

Don’t treat evals as a one-and-done task—regularly operate and adapt both online and offline evals.
Don’t assume the best model is required for production—use evals to quantify “good enough” and optimize for cost.
Don’t confuse evals with QA: evals are strategic, deeply tied to product and business success, requiring context, domain guidance, and continuous improvement.

Timestamps for Key Segments

| Time | Segment / Topic | |---------------|---------------------------------------------------------------| | 00:00-03:26 | Importance of AI evals, skills for AI PMs | | 03:26-10:33 | What are evals? Simple vs advanced forms, job site example | | 12:15-18:10 | Five failure modes of AI prototypes, how evals help | | 22:14-39:21 | Framework: From criteria to data to evals (stock chatbot) | | 35:49-37:02 | Offline vs online evals, PM as definition owner | | 41:43-44:20 | Documentation, context, and collaboration tips | | 53:34-54:12 | AB testing, role of user feedback in evals | | 56:57-57:57 | Stakeholder hacks: how to ensure people actually read docs | | 58:07-61:54 | Why this isn’t QA: business/scaling case studies |

Memorable Moments

The “chai” analogy for non-determinism (21:57): Ankit uses the ubiquitous variability in chai (“tea”) to explain how product context affects AI model outputs, driving home the point that PMs need to guarantee experience consistency via evals.
Reverse-engineered doc walkthrough (39:33): Ankit dissects a financial chatbot’s PRD, showing how response guidelines, context, metrics, edge cases, and evaluations interplay.
Data drift in the wild: Discussion about chatbots that fail by providing outdated info, illustrating why continuous evals are critical for trust and business outcomes.
Cost optimization (16:21): Ankit explains using smaller models with transfer learning, using evals for “cost-down” decisions in high-scale AI products.

Actionable Takeaways for AI Product Managers

Define explicit success criteria & forbidden behaviors up front.
Map these to measurable evaluation metrics—combine code, LLM, and human tests.
Curate a real, diverse evaluation dataset using logs, expert feedback, and synthetic data.
Build evals into both your pre-launch (offline) and production (online) process.
Make evals and their continuous improvement part of your product’s operating rhythm to prevent data drift and maintain quality.
Document, socialize, and continuously align evaluation criteria within your org.
Use evals as the basis for cost/model decisions and business impact measurement.
Leverage AI to help draft and iterate prompts and evals—humans for domain judgment, AI for code and prompt-writing.
Don’t treat evals as “just QA”—they’re integral, strategic, and ongoing.
Always connect evals to real user outcomes—pair synthetic and user feedback for full coverage.

Conclusion

Evals are the backbone of AI PM practice in 2026. They enable trust, scalability, cost efficiency, and regulatory safety in AI products.
They are neither just technical nor purely “PM hygiene”—evals are ongoing, strategic, and collaborative, sitting at the intersection of engineering, business, and user empathy.
A rigorous eval process is what stands between a failed prototype and a scalable, trusted, and impactful AI product.

Relevant links, cheat sheets, and templates mentioned in the episode will be in the episode description and newsletter.

Loading summary...

Transcript

Ankit Chukla (0:00)

Your AI feature fails not because of the model, but because you didn't evaluate it. If you are shipping AI features without evaluations, your product is lying to you and you have no idea.

Akash (0:08)

Ankit Chukla has taught thousands of PMs AI evals and today he's open sourcing the knowledge that he normally charges thousands of dollars for free.

Ankit Chukla (0:17)

So I'm going to call today's class the Masterclass on Creating Effective evaluations.

Akash (0:22)

How do we actually build this?

Ankit Chukla (0:23)

So first is the success criteria and the expected behavior.

Akash (0:26)

The way the best AI companies work is that the AIPM defines these eval and that is basically the PRD for the AI engineers.

Ankit Chukla (0:35)

If you are not doing offline events correctly then you have not even created a product that can be actually launched to the real audience.

Akash (0:41)

How can I become an AI PM in 2026?

Ankit Chukla (0:44)

Follow these steps. Number one is make sure that your product sensitive skills are exceptional. The second part is make yourself aware about certain technical concepts and some gen AI concepts. Most of the PMs have no idea about how to write evaluations. They have no idea about how to go ahead and build a professional level app. In these 60 minutes I'm going to assure you that you are not only going understand the fundamentals of evals, but I'll give you a real case study where we are going to help you understand how to plan your evals like a pro aip.

Akash (1:12)

Let's get right into it. Before we go any further, do me a favor and check that you are subscribed on YouTube and following on Apple and Spotify podcasts. And if you want to get access to amazing AI tools, check out my bundle where if you become an anal subscriber to my newsletter, you get a full year free of the paid plans of Mobin, Arise, Relayout, Dovetail, Linear Magic Patterns, Deep Sky, Reforge, Build, Descript and Speechify. So be sure to check that out@bundle akashg.com and now into today's episode. Quick note for my audio listeners, there are some things that we showed which you can see on Spotify or YouTube, but we've edited this audio so that it's a really good listening experience nonetheless. Ankit, welcome to the podcast.

Ankit Chukla (3:33)

Yeah. So I'm going to call today's class the masterclass on creating effective evaluations and understand that I'm going to follow this approach that initially. So I want to give the agenda beforehand so that you are able to take time and make sure that whenever you're watching this episode, you are sitting out with a notebook so that you can go ahead and revise and recollect things. So I'll divide this whole section into three parts. The first part is we look at an AI product because if you don't understand what are the nuances of an AI product, you, you'll not understand what evaluations are. Then I'll give you a quick five minute introduction of what evaluations are so that you understand what we're talking about. After that we'll give you like a quick introduction to evals. I'm going to talk about the nature of large language models, what are the metrics for evaluations, and then I'LL give you an end to end flow of how to create evaluations for almost any kind of product. Whether you are talking about agents, you're talking about simple chatbots, or you're talking about some enterprise grade products. And eventually I'm going to go ahead and give you some tips for writing effective evaluations and then I'm going to give you an an end to end case study of how a company might do evaluations. So that is the agenda that we are going to follow today. Now, before I could go ahead and talk about evals, let's talk about the fundamental building blocks of Genai product and why it is different. So so far most of us have been creating products which are very deterministic in nature. But if I go ahead and talk about the nature of a Genai product, these are some of the critical components. Let's say one of the component is the language model. I'm not calling it a large language model because there are also some useful small language models. So we'll only use language models. Then we have the context engineering part which is you data that you give from rag or something or the prompt that you put. Then we have tools, then we have orchestrations, how things are going to connect with each other and then we also take care of the user experience, how the user is going to interact, how do you include humans in the loop, how do you take care of latency and all right, so these are I would say five critical components. But now the issue here is that this particular part, the language models, it is not deterministic for similar kind of inputs. It can give you different kind of outputs. So it is almost like a line in a circus where you have to make sure that although you know about the nature of the line, which is it's a beast, but as a ringmaster of the circus, you need to make sure that you are able to tame that behavior and show like a good circus or a good product. So that is why we need evaluations. And there are other things also that are going to matter for the evaluations which we are going to go ahead and talk about. So this is the reason why we need evaluation because there is indeterministic nature or stochastic nature for the large language models. Now before I could go deep into the evals part to make everyone understand what evaluations are, I'll take a very simple case study. It'll only take let's say four or five minutes. So let's say we are creating a first job website and the use case is very simple, let's say I want to apply for a job. Every one of us who is applying for a job, we need some information. The information is what I'm the product is that I will crawl the major job portals of the world, maybe the LinkedIn, the Hired or the angel co of the world. After that I'm going to put the job description through a large language model. It could be any of these large language model or something better. And then I am going to enhance that job description, make it more, I would say much better for the candidates because I'm going to enhance it into summary of the job description, possible interview questions from the job description for the job. What are the skills that you need? What is the learning guide? If you are really seriously preparing for it and if you think you are prepared, we are also going to give you quiz for assessment. So this we are creating from the small piece of information that we have got from the job description. And I'm sure that many of the people who are looking for jobs would definitely find this interesting. So this is like a simple product. You can also call it a wrapper on the top of an LLM. Right now, even if I do this, I will understand that I cannot trust large language models to do this job really, really well. So what I'll do is in order to understand that whether I'm giving the right output to the users and all this information is done correctly, what I'll do is I will try to evaluate and by evaluation I mean what I'll do is I will use some or the other method to make sure that whatever the content is being generated, it is factual, accurate, correct, helpful, and maybe some other kind of parameters are being satisfied. For example, I want to make sure that the summary of the job description should be less than 300 words, otherwise it's not a summary. I want to make sure that the skills that are needed are actually real product management skills. And I also want to make sure that the learning guide that I'm creating is actually an actionable guide and the model is not hallucinating. So what I'll do is I'll generate all of this, then I'll write an evaluation and then I will understand whether my prompt, my model are working correctly or not. Because it might happen that initially I have given a prompt but it is not able to work correctly. Or maybe I need to choose between whether I should use GPT5 or I should be okay with maybe GPT3 or GPT4, which is a cheaper API, right? So that is the purpose of evals. If I need to exactly show you what this evaluation prompt is going to look like. Let me show you. So this is what the evaluation prompt is going to look like. It's a very simple use case. So I'm going to call another LLM where I'm going to tell it that you are an AI quality evaluator for product management, job listings and whatever in input output that I've got from this particular model, like the prompt, my initial prompt, I'm going to give it over there that this is the job description, this is the AI generated summary, this is the interview questions, this is the skills, these are the concepts, these are the quiz. Now you go ahead and run all of these things, which is, is the summary accurate, are the interview questions relevant, are the listed skills aligned? And all of these things. So what we'll do is that we will whatever output that you have got from the model initially from the base prompt, we are base session prompt, we are going to take it and we are going to run this particular prompt or evaluation through that, right? And then I'll understand that whether the summary was good or not, questions are good, quality or not, or skills were liked or not. And then based on that I'll be able to understand whether my prompts, my language, my language models are working correctly or not in case I was able to understand that accuracy is not good. So I'm going to tweak the prompt to make it more tight tightened. If I think that maybe I'm able to get the same kind of output with a cheaper model, which is GPT5 or something, GPT4 or something, I'm going to use that and not use the most costless model. So this is what evaluation is. I am going to get output from, let's say some of the LLMs or something. And then I want to make sure that I am able to evaluate that output on certain kind of metrics. So one of the things that I am going to monitor is the length. Length can be done by a simple evaluation of counting the characters. That can be done with the help of code. While other things which are the accuracy, whether these are the good questions or not, can be evaluated by maybe a human or LLM or other large language model as a judge, which is generally more intelligent than the other models that you use in your product. So that was like a,

Ankit Chukla (12:15)

Ankit and to talk about the first part, why prototype fails, we have been able to do research and we are able to find that there are generally five reasons why prototypes Fail to scale. And this is very important because this is the reason between you playing with prototypes in you going ahead and becoming like a real aipm and building, let's say, some impactful products. So the reason is, one very common reason is the data drift. Data drift means that you have trained the model in different conditions, you have created the product in different conditions. But now the customers have evolved, the data have evolved, the context has evolved and the knowledge has evolved and your product is not able to keep up. The second part is cost considerations. Now AI is not like some, your like gen, AI is not like other product where the costs are fixed. So in SaaS most of the time the operating cost, although I would say the marginal cost for every other user is almost very, very small. But in AI, with each and every call you are going ahead and paying money. So sometimes what happens is that costs do not scale well. After that we have engineering limitations, which is we have not done testing, issues of scalability, asynchronous behavior we are not taking care of. And then this is an important part which is in prototype because you are going ahead and doing it with less data. You do not think of the guardrails, which is how you are going to create a feedback loop, how you are going to create the fallback logic, what are the legal and thrill lengths that you are going to play there. And then the last part, which is not the discussion for today, but still it's the major problem which is collaboration failure. Now this part of AIPM is also same as a product manager's job, which is you need to make sure that you are a good collaborator. You need to be good collaborator not between only your teams, but also between users and your own team. So that makes sure that yes, if you go ahead and do these kind of things, there is high probability that your prototype will not fail to scale.

Ankit Chukla (13:53)

Yeah, so evals can help you in data drift. So when you are putting evals you are continuously able to monitor that. Yes, this is what is happening with my product. And if something is not going in the right direction, you'll be able to take action on it as soon as possible. So evaluations are going to act on the top of your observability, like your metrics then for cost considerations also. So what happens is because in prototypes we are using the best kind of models because we want to show the prototypes in the best light. What happens is we are not thinking about the cost, but when it goes to production, when you, and you think that you have to use the right kind of model, then if you're only using the best model, which is maybe five or something, then the issue is that the. Okay, we'll cut from here. We'll start from cost considerations. Okay. So yeah, so cost consideration is a very important part because Most of the PMs, what they do is they think that if in the, in the prototype I have used maybe the most advanced model, I should use the same model in the production as well. And then because of cost considerations, the management has to go ahead and pull the plug. Now there is a good possibility which is that maybe another model which is a cheaper model can go ahead and produce a similar kind of output. So if you go ahead and look at. So I'll just show you the pricing of different kind of models. Okay, so I'll just search for OpenAI pricing API and then we can see the pricing. Right, so you can see that the best model, GPT 5.1, which is mostly going to be used in your prototype because it's very intelligent, the output is $10 per million token. But for maybe a model such as GPT Nano, the output is maybe 0.4. Right. Which is only 40 cents. Now you do not need to use only this. You can also use this and you will only get the confidence of using this when you have created the right kind of evaluations. Right. So that is why evaluations are important over here. They can let you understand whether a cheaper model can also go ahead and perform at a similar kind of level or not. And it has. We have seen that multiple places that people just need to, let's say, do the intent matching or intent separation and still they are using for the simple task. They are also using GPT 5.1 because they are not sure. And they are not sure why, because they have not created the right kind of evaluations.

Ankit Chukla (18:10)

Cool. So understand, data is data, right? You can create a story with the selective data that you have. So now what has happened in this study is if you look all the, let's say 30, 35 pages of that particular research paper, you'll be able to understand that. First, we are right now in the very early stages of AI, which is considered 89, 91, 1981, 9, 1981, when the Internet has came. So it's very, it's too quick for us to go ahead and maybe assess the. Let's say the profitability or the, I would say the scalability of AI. That's the first or usefulness of the AI. The second part is the companies that were considered were on account of double or triple digits only. Okay. So they have not considered a lot of companies which are going ahead and maybe trying a lot of things in the AI world. However, I generally am very much aligned with the findings of because it is very logical. So if you go back to like the traditional product management product basically fail because they are not catering to any user needs and they are not even evolving with the user needs. And that is what they are saying in the end. And if you now look at this, then even in the real world world like almost 90% products fail. You would have created so many products, so many prototypes, but most of them have failed. Like at least they were not commercial success. So I think yes, the output of the study could be good, but yes, there was some kind of selective data. But then in the future I believe that if more people are taking AI initiatives then the failure rate could be mostly around the same because it's not easy to create like a very scalable and a very profitable product. Okay, yeah. So now we are talking about evals because of one major reason, which is the large language models are non deterministic. Right. And this is a major reason why we have to go ahead and do all of these things. Otherwise this could be, let's say handed over to maybe a testing team or a QA team. Right?

Ankit Chukla (19:57)

I'll give you an example. So although I've given you the job site example, now I'll go deeper, I'll give you like a more common sensical example which is maybe example of a chai or a tea. Right? So if someone asks you how would you create a table tea, then maybe your answer would be different than someone else. And then every person in the world would love a different kind of taste of the tea. But then the definition of tea is the same that you go ahead and have some water, you have some tea leaves and then some BVD would have the sugar, somebody would have the milk and then they are going to create the tea. So now when I go to different places, okay, for example, when I go to maybe a hill such as Manali or something, the quality of tea is going to be different, the taste is going to be different. When I go ahead and take a tea at home, it is going to be different. And then, then when I'm going to maybe go on a trip on railway, you'll understand that the tea is different and this is mostly the worst kind of tea that anyone has gone ahead. And so what is happening here is that although the output is the same, output is the tea. But based on the customer needs, based on the context, based on who is preparing the tea, the feeling that you get while drinking that tea is different. Right. So your product is also the same. So if I give you, let's say if a large language model, if I give you the prompt that how do I make a cup of tea because they are non deterministic, they are going to give me the right answer. They will tell me how to go ahead and maybe make tea and understand hallucination rate is not as much as it was maybe a couple of years ago. Right now hallucination has reduced a lot. Most of the answer are generally factful. But still this tea is different than this tea and this tea is different than this T. Right. This is the moral of the equation that the large language model, yes, it can be correct. We are not, not going ahead and questioning correctness at this point of time. They hallucinate but the hallucination rate has reduced. However, apart from hallucination, even if the models go ahead and became perfect in terms of not hallucinating in the future, which they are going to be, still they cannot go ahead and understand what your customers want. Because every customer is different, their needs are different. So you as a product manager need to make sure that you are able to ensure a similar kind of experience or the right kind of experience to all the customers. So that if people want to take a tier, they are going to come to your shop rather than going to some other product out there. Yes. Do you like the Akash?

Ankit Chukla (22:14)

Yeah, so now I'm going to give you like a detailed flow and I've made it like very dramatically like, sorry, I've made it in the form of a diagram so that everyone is able to understand very visually. So I'll just go ahead and share the figma. Yes. So this is sigma. Right now it might look overwhelming but I'll tell you how it works. Okay. So very first thing is whenever you go ahead and start any kind of product, you are going to define the success criteria and expected behavior. Okay, let me give you a very small example. Okay. So although we are going to talk about this example in detail later but I'll show you a gist. So let's say most of you might have used Robinhood Cortex, which is the new feature that they have launched and a similar feature like a company in India, Indie Money, which I used to work, let's say three, four years ago. They have also, let's say built that kind of feature. I'm not anyway affiliate to the company and all of this is actually reverse engineering. So the product is very simple. IND Money or Robinhood are stock trading apps where you can buy and sell your stocks. Now what they have done is they have understood that many people before buying a stock they want to do some research. And for that research they are generally going to go to Google, they are going to type something, they are going to go to ChatGPT or they are going to ask a file Financial advisor. These people thought why don't we go ahead with AI we should be able to help people understand more about this particular stock. So we have created a feature called AS IND Money Mind and what it does is whenever you click on this below any stock it is going to give you some auto populated questions and then you can also write your own question. These are commonly asked question and then when you click on that it is going to give you an answer which is powered by AI, which is it is going to fetch the right kind of documents and it is going to give you a very contextual answer answer. This is the use case, right? Now in this use case if I walk you through the whole flow now so now remember this use case well I'll walk you through the whole flow. So first is the success criteria and the expected behavior. Now understand I want to make sure that there are some guard waves that I need to follow which is whatever the input, the output that I'm getting out of this chatbot, let's say that should be limited to maybe 150 words or maybe 300 characters to make sure that it is summarizable and people are able to go ahead and see it. The second behavior is that I should make sure that the model should never recommend selling or buying a stock. Why? Because legally you are not allowed to go ahead and give any kind of recommendations. So there is a regulation in India which says that no, if you are not a registered, if you are not a registered investment advisor, then you cannot go ahead and do AI cannot do this, right? So let's say these are the behaviors that the things should be factual, they should be grounded from the data and then you should not go ahead and suggest someone to go ahead and buy or sell a stock. You should not make a direct recommendation. You should just give the information. So this is the expected behavior that we have Success criteria should be that in the end when people are going ahead and maybe they are, they are getting this information, in the end they should give a thumbs up or something so that we understand if the output was actually good or not. So that is expected behavior that we have now what we will do is we will go ahead and transform that expected behavior and success criteria into some kind of metrics, right? So the metrics could be, let's say what is the quality of information? Maybe some kind of UX metrics such as latency and all. Then the output has to be safe, it has to be performance oriented, which is again latency. And then we are going to talk about behavior which is it should not go ahead and suggest you that you should go ahead and buy or sell that stock. Right now you can understand that if I go ahead and talk about ux, maybe if I want to make sure that it is up to maybe 150 character or 300 character, I can can choose to do a code based evaluation. So now in order to make sure that these evaluations are being done correctly, I can choose multiple options. One option is I can do it through code, I can have a human to review it, I can do it through LLM or I can go ahead and choose a combination of these three, right? So this is what I'm doing at this side. But on the other side, as soon as I understand the success criteria and the expected behavior, the first step that I need to take is I'll create a base product. So I have an expectation that yes, I'm going to create the version one of the product. It should not do these things. So that is the base knowledge that I have. From that base knowledge I'm going to create a system prompt. System prompt is very simple given this stock and this question. Answer this question. Make sure that you are not suggesting, make sure that you are always getting the output from input from this kind of data and maybe some other system prompt, right? And then I'm going to choose certain kind of system prompt. I'm going to choose some kind of models, I'm going to choose some kind of tools here I'm going to use a tool called as web search, right? And then I'm going to give it certain kind of context which is a user information or its background. And then I'm going to use certain kind of orchestration. Understand all of these are variables. A major mistake that product managers make is they Think that now these things are fixed and they tend to love their solutions. Understand all of these things are variables. These are knobs on a dashboard that you need to click here and there in order to make a better product. Right? So you are going to start from some something basic that is coming out of the information initially that you have. So you are going to put an input and then you are going to get an output node. This is the waste product.

Ankit Chukla (27:08)

So orchestration layer is that how are you going to make sure that all of these four things are going to connect with each other? For example, a good example is N8N. So on N8N I'm going to have different nodes of LLMs tools memory that connect with each other. Now this orchestration layer could also be a region of failure. If it has a lot of latency, it is across the geographies or maybe there are some kind of orchestration issues. This can also give you certain kind of challenges, right? So you as a product manager don't have to fix anything right now. You should understand them as variables. You don't have to love your product. You have to make sure that your aim is to give the best experience to the users and be helpful for them. So now we are going to start from here. After this, what you will do is you need to understand how your product is performing. So what you will do is in order to see that performance, you will create a very good data set. This is where I have marked this as star, because this is where most of your efforts are going to go. You have to create this data set, which is data set is nothing. But what are the different kind of inputs that users can give your product? Right? So you are going to collect the past data. So for example, at Indie Money they already have, let's say some kind of advisors, which are humans who are sitting at the back end. So they also offer a service where you can talk to an advisor. So they can talk to advisors and they can understand that from the logs that these are the different kind of questions that people generally ask. So that will become one source of data. The second source of data is research. They are going to go to Google ChatGPT and all in order to go ahead and research, as in what do people ask when it comes to understanding about a stock. Similarly, they are going to use LLM. So with LLMs you can also generate something called as the synthetic data. You can tell that this is a product that I'm offering. Can you Go ahead and give me some kind of sample data set and then it is going to give you some kind of data set that. And then eventually there are experts. You are going to talk with real investment advisors. You are going to ask them that what are the different kind of question that people ask and then you will get them to fill certain kind of sheets. These four things are very important because they are going to make sure that you are actually dealing with real cases, right? So once you have that, then you are going to run it through your base product, right? Whatever you have created and then you will get certain kind of output. And I can assure you. And you'll also get surprised when you'll see that no, this output is was not as good as I was thinking my base product was to be. Right? And most of the times you might not be a good judge for the same. So you can also include experts. So let's say I'm a product manager, I do not know that what is a good advice, what is a bad advice in terms of finances. So I'm going to involve a financial expert in this particular case and then I'm going to ask them to tell me whether these outputs are good outputs or bad outputs, which is they have failed or passed the criteria. Okay. And then they also need to tell me me that what is that criteria, right? Otherwise what happens is most of the people, because if product manager, they are not like, if they are not subject matter expert or domain experts, they'll not be able to come up with right kind of evaluations. Once you show people data, then they'll be able to tell you that this is a mistake. So it's easy to point mistake rather than to go ahead and prepare for them in advance. Right? So what we'll do is we have this output, we have these remarks. Now these remarks are again going to go through this. So what we'll do is from the expert analysis, from the user empathy, from the success criteria, from the expected user behavior. What we'll do is we'll have a set of evaluations, metrics that now we should make sure that these things do not happen. So one of the investment advisors will see that we are going ahead and building this. Like they will say that this output machine is actually cut. So one of the evaluation experts can tell you that your product is actually generating recommendations or the information that it has given is very outdated, or they are trying to hallucinate the information. So you are going to take all of these outputs by actually giving them the input and the outputs and Then, then you are going to decide these metrics. Okay. And understand. It will take you some time to understand and decide these metrics. That is why I have also created a cheat sheet. Okay, so this is a cheat sheet that I've created with the help of, let's say some of my knowledge and Claude and GPT. If you are building any kind of product, I'll make this available in the description as well. Akash will make it available where you can understand that for what kind of product, what kind of evaluations, metrics should you go ahead and consider? Right. So this is a very exhaustive cheat sheet. After that what will happen, happen? Now I have certain kind of criterias. Now I will decide what should I use for evaluations. Things which are very definitive. I am going to use code for the same such as whether I have all the words mentioned or not, whether I am following a certain kind of criteria which is summary length or not. Then I am going to use code for the same. This is cheapest. In some ways I am going to use humans. In some of the evaluations I can use LLMs but most of the times I am going to use hybrid. Hybrid means that LLMs are going to flag situations that is not working, working. And then the human is going to go ahead and maybe give it a final call. Right? And then you are going to write evaluations. Okay? So now in the machine learning or LLP world we already have some, I would say some base level evaluations that can be done by code. For example this length, this bilingual evaluation, this RAV and word ratio. Here we are going to make word error rate. Here we are going to make sure that we are able to understand whether this is following this criteria or not. And then in some of the parts where code cannot work because it is, let's say it is something that is very subjective, then we are going to use other evaluations. So evaluations with LLMs can be of type, such as measuring the guardrails, understanding the UX tone, helpfulness, relevance. This can be done with the help of prompts that we are going to give to a large language model and we'll make LLM as a judge. Now once we have done this, rogue me. Yes. So there are two things, blue and rogue, right? So in blue what happen? In blue and rogue what happens is traditionally in machine learning we tend to see that, let's say if I am, let's say I have some output which is given to me by the machine learning model. And then I have a golden data set, right? So now what I'll do is I will not play intelligently. What I'll do is. Let's say I'm saying I am. Wait, I'll try to explain this again. I'll take the question from Roku. Okay? Yes. So Bleu and Rogue are two methods which are going to help you understand the recall value and the accuracy for your models. For example, let's say I have a case where I am getting this output from the large language model. The output is the cat is on the bed and then the golden data set. Golden data set means this is the real output. This should be the accurate output. The output is the cat is on the bed. Sorry, the cat is on the mat. Right. Now, these things are entirely different in terms of meaning, right? It is a different scenario. This is a different scenario. But what View and Rogue do is they are going to compare the words, which is if you go ahead and consider the blue and the rogue metric for the same, it is going to come around. Let's say I have 1, 2, 3, 4, 5, 6 words. And here I have 1, 2, 3,. 4, 5, 6 words. The blue and rogue are going to tell me that five of the six words are matching. Matching. That means. Yes, your output data and the golden data set are actually matching with each other. Right. But if you go out and use another LLM, you'll understand that. Boss, this is not true. The cat on the mat and cat on the bat is actually a different kind of statement, different kind of scenarios. So that is where they are used right now in traditional machine learning, they are used a lot. They can be used in order to make sure that your information is grounded or not. You can just do some matchmaking. But ultimately, if you are giving answer on the basis of Blue and Rogue only, you'll not be able to do it. That is why these are slowly getting outdated from real generic cases.

Ankit Chukla (37:02)

Yes, correct. And then what will also happen is that, yes, offline Evals are good, many people do it it. But there is also something which is equally if not more important, which is online evals, which is you have to use a platform. You can use any of the observability platforms all of them are now having. These two major popular ones are Arise and True Lens. So what they will do is you will keep on observing the product. So I have talked about data drift in the beginning, which is that now the user expectations have improved. So our current prompts, whatever you have tested in the evals or your current models or something is not not is now not working. The the world has changed. So you are going to keep on observing and maybe if not on every output or input, you are going to run the online evals which are the same evals on your production level data. Maybe you'll not run it on every input and output, but Maybe you'll choose 1 in 10, 1 in 100, 1 in thousand, whatever is the, let's say the ratio that you need to take because they are costly as well. And then you are going to make sure that you are observing them and whenever any change is made, you are going to make sure that you are able to observe them and you are able to make changes again in your our base product. And then this whole cycle is going to go ahead and repeat itself. Okay, so just to give you like everyone a summary again, we start with the success criteria and the expected behavior. On basis of that we are going to get one level of metrics and our expectations, what the prompt should do. And then we are going to create like a base product where we are going to put the very primary prompt, very primary system prompt, whatever the best thing that we can do, right? And the models and everything after that we are going to collect a lot of data from multiple places, edge cases and from the Expert, from the LLMs and all. And then we are going to go ahead and run all the data set, input data set with our base model like our base product. And then we are going to evaluate everything. We are going to create evaluations based on the mistake that we have found. And then we are going to make this into evaluation. So evaluation will become, let's say a set which is going to run all the time. Whenever you are releasing them, doing the major release, you can choose it to run every week or every month so that you know that if the data is not running shifting and then you are having something called as online evals which you are going to run in the production level data set and you are going to get informed whenever an evaluation is passing here and there. For example, we have an evaluation called as accuracy. So if we believe that accuracy is anytime going less than 98% in these evaluations, then we should get flagged and we should go ahead and maybe improve the prompt of any of the orchestration that we have. So this is like the whole end to end flow that you do while creating evaluations.

Ankit Chukla (39:33)

Yeah, so what do we do is let's say I have created, let's say a prd. Let me just go ahead. Yes. So let's say this is what a product manager will actually do, right? So this is, I have created, I reverse engineered this. I'm again saying that I'm not affiliated to the company. So if there are some similarities that are just coincidental. So I have reverse engineered this product and this is what they would have done. Okay, so I have broken down the document into these 10 sections. The first part is like before you talk about aipm, understand that you are a PM first before an AIPM first to make sure that you are setting the context correctly, which is this is the product, this is what it does so that you are able to understand it. So yes, and you have to always start with writing the value for the user, which is reduce search friction, decentralized financial. So what is the value of the user? That user do not have to spend a lot of time in order to burden search for something that is the value for the product. They'll get the advisory within the product itself. And then we have written down the value for the business because these are again going to come when you go ahead and understand the metrics for evals and your product. And now we are going to make different layers. Okay? One is the user interface layer. I'm talking about these layers because these layers are also going to decide your metrics. So user interface layer, orchestration layer, data retrieval layer, LLM layer, logging analytics, all of these are going to play an important role in your eval systems. And then eventually this is the level one. Okay, so I have not written a complete prompt, but I have written what will go into that prompt. Right? So prompts and context system prompt core and AI assistant is configured with the following key principles. Now look at this role as an analyst, not advisor. So I know that I don't have to ask it to be an advisor and suggest me something. I want it to just act an analyst and then give me the understanding. Although it is a very small line. Fine. But this is going to define the behavior of the system, right? So I'm going to use all of these things and understand that I'm not writing the cold collective prompt because I have to make it iteratable. So what I can do is I can look from these points that yes, now in the new eval, whenever I've run dual or whenever I have to change the system prompt, I can choose which line I need to pick. Otherwise, in the bigger prompt, it's very difficult to go ahead and find what you want to go ahead and edit. Right. So we have to be, let's say, taking care of productivity because these are like small frictions that let product managers to not do the right thing. And then.

Ankit Chukla (44:27)

Yes, yes. And then. Yes. So then we are going to also put. So understand, you understand what evaluations are, but engineers might not have at the top of their mind or maybe designers or the leadership might not have the top of their mind because they are only exposed to the prototypes. They are not about like familiar about this nature. So you have to clearly explain why it is happening. Right here is a high stakes domain and in India this space of fintech is heavily, heavily regulated. You cannot do this even without, let's say doing evaluation. And also understand it might happen that sometimes this is going to go wrong. At least at that point of time you'll be able to show the regulations that we have taken all the fail safe features to make sure that it is not happening. Right. And then we are going to have all of these, I'll share this document so that everyone can read in advance. And then we'll have these deciding metrics and expected behavior. So what are evaluation dimensions? Factual accuracy, compliance, groundedness, relevance. This is what it Means and this is why it matters. Super understand it will take some time to read this document and you don't have to write it by yourself. My general recommendation is what you should do is if you, you are really like an AI enabled pm, go ahead, talk through Whisper Flow to your GPT or your CLAUDE or your Google talk. Right. Talk as much as you can because that is more productive. After that, ask CLAUDE or GPT to go ahead and put it in this structure. Right. So you are, you will. And also ask GPT and CLAUDE to also fill the gaps that you have missed in this particular document. And then we have some more documents. Yes, we have also defined some thresholds. So numerical accuracy means that, that, let's say if AI is suggesting any kind of numbers, images or percentage of returns or numbers, I should make sure that my target is more than 98% of the time. They should be correct. And if it was becoming more than less than 95%, I should get flagged. And this product should not go into production unless I'm improving something. Right. And I have also mentioned this. Now this is super helpful for the online evals, which is if this metric compliance pass rate is going below this, then I should take an immediate action. Action, Right, immediate action. And then this is the expected behavior by query type. This. Now this will not come out of the blue moon. You would have gone ahead and done this process of creating inputs, outputs from the expected inputs. So what we'll do is. Yes, so here we are going to make sure where is the document. Yes. So here we are able to observe that maybe these are the things that should never be missed. Right? Right. And then we would have some edge case behaviors which is stock with missing data, penny stock, these kind of things. And this will only come once you are collecting all the data in your data set. Otherwise it will not come at the top of your mind. Right. Now I can show you the data set. Now understand this is all synthetically created data set, but it will serve the purpose of learning. Okay, so we have divided into multiple parts and these are the sources. Let's say for fundamental analysis. What I'm going to do is I have collected the data from multiple sources, synthetic data, talking to the experts, looking at my own data, doing my own research and I was able to understand these things. Let's say if someone asks how good is ITC as a dividend stock, Then this is the context that I need to give and this is what I expect. And then eventually these are going to be the red flags. Right. So now eventually what I'll do is I'll not write the evaluation prompt by myself. I will put all of these things, this information into an LLM and ask it to write a better prompt. Right. Because as a, as a human, you can go ahead and miss out on a lot of things. And then you will run all of these things again with the dataset to make sure that it is not making a mistake. Right. And I've also set some kind of priorities.

Ankit Chukla (48:22)

Yeah. And then this data set is there. And then understand that we have done multiple methods of collection, which is, we have taken it from production queries. If you are creating a support chatbot, you would be giving support before this chatbot. Right. So you can collect that data. Then there are expert curation. In this case, we can take ideas from the financial experts. And then we are going to do synthetic generation. And then we are also going to look. This is also important that we have to maintain this data, which is, although at 91, it is not needed to maintain such data very frequently. But if your product is, let's say, talking about a use case where data is very much changing, you should make sure that you are able to at least update this data once in a while. So this is kind of a test use cases which should run always before you are doing a major release. Right. And then this is evaluations. Right. So we are going to do three kind of evaluations. Automated programmatic evaluation, LLM as a judge, and human evaluations. Right. For automated eval, I can do factual accuracy checker, compliance checker, groundedness checker, structural checker. What I can do is I can just match the words, whatever is happening, numbers and all. Do I able to see it in the sources as well? Right. And then LLM as a judge, I can check relevance, balance, tone. And then I can. So I will not create a prompt by myself. I'll just give this particular thing to an LLM. It is going to go ahead and generate like a good prompt for me. Right. And then as a human evaluation protocol, I'm going to use, I'm going to use them humans because they are costly and they are going to take time. So whenever a new feature launches, something happens, like something important happens. I'm going to make sure that they are using the same. Right. And if the automated metrics, let's say the LLM is a judge or these things are failing, then I'll make sure that I'm involving a new human to check what is happening out there. Right. And then, then yes. Now this is a bit different thing, which is I can, for a human, I have two methods. I can ask them to just give me pass for fail or I can give them a rating of 1 to 5. Different people have different kind of opinions. But it's good to start with maybe a pass or fail criteria so that people are objective. But eventually as you go forward, even if people are rating 1 to 5, ask them to give you a remark which is why do they think this is the case? Because you can use that data in order to further train your, like further improve your let's say context or the models. Right. And then we have this, this is the criteria that at any point of time we have any kind of these issues, we are going to make sure that we are going to block the deployment. Right. Which it should be actually less than that. Yes. And then offline event, I think this we have already understood, but I'll share this document. Yes. So here what we are doing is that we are going to smoke test, we are going to do full regression and everything. We are going to run all time, the, the let's say all the evals and we are going to do block or no block in these particular criteria, we'll not release it. And then on the online evals, what we'll do is we will have the latency. So P50, P95, P99 means that whenever we do averaging, averaging is not going to give you the right kind of results. So let's say if you are having 100 users on your website and if you are going to see that maybe 10% people. So I think that's an important part so I can take some time here. So, so in order to measure latency, in order to measure so many people might have heard about P99, P95. Okay, so what are these things? So let's say I want to measure the latency of a product, then I cannot say that if 90% people are getting it at 100 millisecond and 10 people are getting it at. Let's say 1000 millisecond or 10,000 millisecond. If I take the average, average will come maybe something around, let's say, say a better number. It will look like, let's say not a. Not a big number, but still 10% people are facing the issues. P95 means that 5% of people, like 95% of people are actually having this kind of latency, which is let's say a good number. Right. Or maybe I can say P99, which is the most used metric, which is 99% of the people should be able to get their results within let's say a particular latency, let's say 100 millisecond or 10 millisecond because averages do not work there. So you have to maybe write these kind of latency metrics and then eventually. Yeah, so this is their sampling based quality. Well, this is good user feedback. Yes. Now this is one thing which is more important that apart from only running your evaluations, online evaluations have one more input which is after every AI tool has finished doing its job, you will see a hands up or a hands down option, like thumbs up or a thumbs down option so that you are able to integrate it back into your product. Right. This is like a hard feedback, but a soft feedback could be that people are again and again generating time to generate the same answer or they are not closing the session as soon as possible. They are just going ahead and maybe frustrated with the answers. That means that you have a soft feedback which you should also go ahead and consider and some other things such as maybe they have removed the session before going ahead and buying or selling something or they have gone ahead and escalated it to support. That means that your evaluations are maybe not working. So also consider these as evaluations. Like also consider these four things. Things also as some kind of, I would say evaluations in your product. Right. And then drift detection is already there. Yes. And then we have AB testing query which is that makes sure that you are also going to go ahead and maybe do some a B testing on various kind of evalu like with various kind of prompts and models and something so that you are able to understand how evaluations are running over there, what is the user experience. And then we have the last part.

Ankit Chukla (58:07)

Yes. So I'll give you some examples. Okay. Yes. So before I could talk about all the examples, one, there is major, I would say confusion in the world that evals are nothing but fancy QA role that has been now given to a product manager. Okay. Yes. And that cannot be far from reality because now you have seen the process. Okay. A QA is not involved with the subject matter experts. Our QA is not improving the prompt, he's not improving the product. He's going ahead and informing. So information is different from transformation. As a product manager you are transforming your product, while as a QA generally you are giving the information to the developer that no, this is not working. So there is difference between the transformational role and an informational role. So don't think that if you are doing evals, you are just doing the job of a qa. It is much more than that. Although the terms are matching. Yes. Yeah. Now talking about why evaluations matter. And I'll give you some cool examples. So reliability and trust, right? So evaluations, like good evaluations can give you liability and trust. If you consider example of Grammarly that if one tone error can change like let's say Grammarly will translate across multiple languages. So if one tone error can change meaning across 500 plus scenarios. So there is a lot of trickle down effect. Right. So if you're writing good evals, you are making sure that yes, the tones and everything is matched correctly. Similarly, this actually happened in GitHub Copilot that when it was initially launched they had a very small error. The error was that in the YAML file there was was some mistake and that was not caught by the code and they have don't have evaluations for the same. And what happened was when people started using it and when they were moving it to production, most of their products were breaking. Right. So if they would have written an evaluation, this would not have done. And now it's a scale product. So it, it faced a lot of repercussions. Then we need Klarna. So Klarna makes sure that they are not. So initially when Klarna developed their AI Chatbots bot, they were focusing on things such as how many people are looking at the shortbot, how many people are saying that it is helpful. But soon they were able to understand where they need to push people in the conversion funnel. So there are business metrics also that we need to take care of and then they transform the strategy and now they are using the AI led suggestion, increased their AI LED suggestions are now increasing the checkout conversion rates. Right. So don't think that it is only for the users. You should also evaluate with your AI on some business metrics, right? And then eventually chatbots, you will see a lot of chatbots that. Let's say you created a chatbot, you created a chatbot on, let's say some information that is available right now. But in the future, if you add more products to your system and if you are going ahead and maybe some context is changing for the users, the user behavior is changing, your products are changing, then you have to make sure that your evaluations are always running, they are always online. Right? So that is why, why a chatbot, maybe any chatbot that you're developing for this is a very common use case that chatbots are being developed for customer support by AI. Then support chatbot will keep giving old policy into drop CSAT and it will lead to. Okay, I'll repeat this again. I'll repeat the chatbot part. Okay? Yeah. And another very interesting use case is the chatbot that whenever you are building like an AI assisted chatbot to support your customers, then a major issue is that if you are only giving it older information, if you're not making it very relevant, then you are going to give outdated information to the users and then that is going to fail the efforts and you will not be able to know whether the information is old or new unless you are running the evaluations. So evaluations are going to play very important role in all of these things. And in the end, my major takeaway from the session would be for everyone is that evaluations are not optional. They are the guardrails for all the AI driven outcomes. Don't think that you'll be able to create like a very solid, complex product without having the right kind of evaluation. Also, evaluation is not, not like I would say it's a goal. It is actually an ongoing journey which will keep on evolving as your product evolves.