Loading summary
Ankit Chukla
Your AI feature fails not because of the model, but because you didn't evaluate it. If you are shipping AI features without evaluations, your product is lying to you and you have no idea.
Akash
Ankit Chukla has taught thousands of PMs AI evals and today he's open sourcing the knowledge that he normally charges thousands of dollars for free.
Ankit Chukla
So I'm going to call today's class the Masterclass on Creating Effective evaluations.
Akash
How do we actually build this?
Ankit Chukla
So first is the success criteria and the expected behavior.
Akash
The way the best AI companies work is that the AIPM defines these eval and that is basically the PRD for the AI engineers.
Ankit Chukla
If you are not doing offline events correctly then you have not even created a product that can be actually launched to the real audience.
Akash
How can I become an AI PM in 2026?
Ankit Chukla
Follow these steps. Number one is make sure that your product sensitive skills are exceptional. The second part is make yourself aware about certain technical concepts and some gen AI concepts. Most of the PMs have no idea about how to write evaluations. They have no idea about how to go ahead and build a professional level app. In these 60 minutes I'm going to assure you that you are not only going understand the fundamentals of evals, but I'll give you a real case study where we are going to help you understand how to plan your evals like a pro aip.
Akash
Let's get right into it. Before we go any further, do me a favor and check that you are subscribed on YouTube and following on Apple and Spotify podcasts. And if you want to get access to amazing AI tools, check out my bundle where if you become an anal subscriber to my newsletter, you get a full year free of the paid plans of Mobin, Arise, Relayout, Dovetail, Linear Magic Patterns, Deep Sky, Reforge, Build, Descript and Speechify. So be sure to check that out@bundle akashg.com and now into today's episode. Quick note for my audio listeners, there are some things that we showed which you can see on Spotify or YouTube, but we've edited this audio so that it's a really good listening experience nonetheless. Ankit, welcome to the podcast.
Ankit Chukla
Thanks a lot for having me again Akash. And till this day I think we have done that podcast about three months ago. Until this day I almost get like a couple of messages every day on my LinkedIn appreciating the content of that podcast. So thanks a lot for creating that content with me.
Akash
So Evergreen. Like some months it's my top podcast episode, even though I've released other new podcasts, so it continues to grow. True testament to the value we put there. What are we going to do in today's episode that's different from all the other AI evals content out there that people might have seen?
Ankit Chukla
Yeah. So to sell most of AI product management is actually product management only. But there is this one skill that is new and Most of the PMs are not aware about, and that skill is eventually to go ahead and write good kind of evaluations for your system. Now, what is happening in the market is that I could see a lot of content is being posted around evals, but most of the content is still like an introductory level or intermediate level content and we lack, let's say, real examples and most of the examples are hypothetical. So I thought that why don't we create this episode so that anyone who has an aspiration to become or pursue like the AI product management career seriously, they should be able to understand with crystal clarity that this is how I should approach evals. Understand evals is not, let's say one, one thing is going to fit everyone, but I'm going to tell you the nuances that you need to take care of. And you'll be walking a bit today with a framework so that you can write strong evals for almost any kind of product if you follow these kind of practices. So that is my promise for today.
Akash
Well, let's get straight into it and maybe we can start with fundamentally justifying why is it important for PMs to learn AI evals?
Ankit Chukla
Yeah. So I'm going to call today's class the masterclass on creating effective evaluations and understand that I'm going to follow this approach that initially. So I want to give the agenda beforehand so that you are able to take time and make sure that whenever you're watching this episode, you are sitting out with a notebook so that you can go ahead and revise and recollect things. So I'll divide this whole section into three parts. The first part is we look at an AI product because if you don't understand what are the nuances of an AI product, you, you'll not understand what evaluations are. Then I'll give you a quick five minute introduction of what evaluations are so that you understand what we're talking about. After that we'll give you like a quick introduction to evals. I'm going to talk about the nature of large language models, what are the metrics for evaluations, and then I'LL give you an end to end flow of how to create evaluations for almost any kind of product. Whether you are talking about agents, you're talking about simple chatbots, or you're talking about some enterprise grade products. And eventually I'm going to go ahead and give you some tips for writing effective evaluations and then I'm going to give you an an end to end case study of how a company might do evaluations. So that is the agenda that we are going to follow today. Now, before I could go ahead and talk about evals, let's talk about the fundamental building blocks of Genai product and why it is different. So so far most of us have been creating products which are very deterministic in nature. But if I go ahead and talk about the nature of a Genai product, these are some of the critical components. Let's say one of the component is the language model. I'm not calling it a large language model because there are also some useful small language models. So we'll only use language models. Then we have the context engineering part which is you data that you give from rag or something or the prompt that you put. Then we have tools, then we have orchestrations, how things are going to connect with each other and then we also take care of the user experience, how the user is going to interact, how do you include humans in the loop, how do you take care of latency and all right, so these are I would say five critical components. But now the issue here is that this particular part, the language models, it is not deterministic for similar kind of inputs. It can give you different kind of outputs. So it is almost like a line in a circus where you have to make sure that although you know about the nature of the line, which is it's a beast, but as a ringmaster of the circus, you need to make sure that you are able to tame that behavior and show like a good circus or a good product. So that is why we need evaluations. And there are other things also that are going to matter for the evaluations which we are going to go ahead and talk about. So this is the reason why we need evaluation because there is indeterministic nature or stochastic nature for the large language models. Now before I could go deep into the evals part to make everyone understand what evaluations are, I'll take a very simple case study. It'll only take let's say four or five minutes. So let's say we are creating a first job website and the use case is very simple, let's say I want to apply for a job. Every one of us who is applying for a job, we need some information. The information is what I'm the product is that I will crawl the major job portals of the world, maybe the LinkedIn, the Hired or the angel co of the world. After that I'm going to put the job description through a large language model. It could be any of these large language model or something better. And then I am going to enhance that job description, make it more, I would say much better for the candidates because I'm going to enhance it into summary of the job description, possible interview questions from the job description for the job. What are the skills that you need? What is the learning guide? If you are really seriously preparing for it and if you think you are prepared, we are also going to give you quiz for assessment. So this we are creating from the small piece of information that we have got from the job description. And I'm sure that many of the people who are looking for jobs would definitely find this interesting. So this is like a simple product. You can also call it a wrapper on the top of an LLM. Right now, even if I do this, I will understand that I cannot trust large language models to do this job really, really well. So what I'll do is in order to understand that whether I'm giving the right output to the users and all this information is done correctly, what I'll do is I will try to evaluate and by evaluation I mean what I'll do is I will use some or the other method to make sure that whatever the content is being generated, it is factual, accurate, correct, helpful, and maybe some other kind of parameters are being satisfied. For example, I want to make sure that the summary of the job description should be less than 300 words, otherwise it's not a summary. I want to make sure that the skills that are needed are actually real product management skills. And I also want to make sure that the learning guide that I'm creating is actually an actionable guide and the model is not hallucinating. So what I'll do is I'll generate all of this, then I'll write an evaluation and then I will understand whether my prompt, my model are working correctly or not. Because it might happen that initially I have given a prompt but it is not able to work correctly. Or maybe I need to choose between whether I should use GPT5 or I should be okay with maybe GPT3 or GPT4, which is a cheaper API, right? So that is the purpose of evals. If I need to exactly show you what this evaluation prompt is going to look like. Let me show you. So this is what the evaluation prompt is going to look like. It's a very simple use case. So I'm going to call another LLM where I'm going to tell it that you are an AI quality evaluator for product management, job listings and whatever in input output that I've got from this particular model, like the prompt, my initial prompt, I'm going to give it over there that this is the job description, this is the AI generated summary, this is the interview questions, this is the skills, these are the concepts, these are the quiz. Now you go ahead and run all of these things, which is, is the summary accurate, are the interview questions relevant, are the listed skills aligned? And all of these things. So what we'll do is that we will whatever output that you have got from the model initially from the base prompt, we are base session prompt, we are going to take it and we are going to run this particular prompt or evaluation through that, right? And then I'll understand that whether the summary was good or not, questions are good, quality or not, or skills were liked or not. And then based on that I'll be able to understand whether my prompts, my language, my language models are working correctly or not in case I was able to understand that accuracy is not good. So I'm going to tweak the prompt to make it more tight tightened. If I think that maybe I'm able to get the same kind of output with a cheaper model, which is GPT5 or something, GPT4 or something, I'm going to use that and not use the most costless model. So this is what evaluation is. I am going to get output from, let's say some of the LLMs or something. And then I want to make sure that I am able to evaluate that output on certain kind of metrics. So one of the things that I am going to monitor is the length. Length can be done by a simple evaluation of counting the characters. That can be done with the help of code. While other things which are the accuracy, whether these are the good questions or not, can be evaluated by maybe a human or LLM or other large language model as a judge, which is generally more intelligent than the other models that you use in your product. So that was like a,
Akash
right, the eval, it doesn't always have to be an LM judge. We've shown a prompt, but for the 300 words we can use a simple code. So an Eval, it can be anything, right? If we were explaining this really simply, it could be any sort of test, it could be a unit test in the old language, it could just be a counting of words here or in the most advanced form, as we've shown, it can be an LLM judge, which is kind of replicating some of that human intuition that we encoded into that prompt we saw.
Ankit Chukla
Yes. Now we are going to address two challenges today which are coming in the AI world, AI product world. Right now, the one major challenge is that prototypes fail to scale. And this has been verified in a lot of research. One of the very popular research that is being quoted everywhere is by this, the Nanda committee from the MIT that 95% of the all AI initiatives fail and the major reason for them is the learning learning app. So when I got into the paper I was able to understand that. What do they mean by learning? And learning basically means that either they do not like the people who are building AI initiatives either they are not solving the right kind of problems for the companies, which is they are not aware about the workflows and they are just trying to implement some of the other kind of gen AI solutions. And the second part, the second part of the learning app is that these systems do not evolve, which is if I'm giving feedback to the AI that please go ahead and don't do this next time, still they are going ahead and repeating the mistakes.
Akash
AI evals are not the only important new skill for product managers. The second most important skill is AI prototyping. And the problem with most of the AI prototyping tools out there on the market is that they're trying to be low code, no code tool alternatives. They're trying to allow you to create an entire startup. Today's episode is brought to you by Reforge Build, which takes a completely different approach to AI prototyping. They are putting the product manager at the center and they have architected the product to that way. So whether it's quickly picking up the design system or generating divergent alternatives, you can do it all with Reforge Build and you can get a full year free of the paid version if you use Akash's bundle. So whether you sign up for Akash's bundle or not, go check the link in the description, try out Reforge Build. And now back in today's episode with
Ankit Chukla
Ankit and to talk about the first part, why prototype fails, we have been able to do research and we are able to find that there are generally five reasons why prototypes Fail to scale. And this is very important because this is the reason between you playing with prototypes in you going ahead and becoming like a real aipm and building, let's say, some impactful products. So the reason is, one very common reason is the data drift. Data drift means that you have trained the model in different conditions, you have created the product in different conditions. But now the customers have evolved, the data have evolved, the context has evolved and the knowledge has evolved and your product is not able to keep up. The second part is cost considerations. Now AI is not like some, your like gen, AI is not like other product where the costs are fixed. So in SaaS most of the time the operating cost, although I would say the marginal cost for every other user is almost very, very small. But in AI, with each and every call you are going ahead and paying money. So sometimes what happens is that costs do not scale well. After that we have engineering limitations, which is we have not done testing, issues of scalability, asynchronous behavior we are not taking care of. And then this is an important part which is in prototype because you are going ahead and doing it with less data. You do not think of the guardrails, which is how you are going to create a feedback loop, how you are going to create the fallback logic, what are the legal and thrill lengths that you are going to play there. And then the last part, which is not the discussion for today, but still it's the major problem which is collaboration failure. Now this part of AIPM is also same as a product manager's job, which is you need to make sure that you are a good collaborator. You need to be good collaborator not between only your teams, but also between users and your own team. So that makes sure that yes, if you go ahead and do these kind of things, there is high probability that your prototype will not fail to scale.
Akash
Right.
Ankit Chukla
So yeah, you want to see something.
Akash
So do evals help address all five of these or where do evals really help you?
Ankit Chukla
Yeah, so evals can help you in data drift. So when you are putting evals you are continuously able to monitor that. Yes, this is what is happening with my product. And if something is not going in the right direction, you'll be able to take action on it as soon as possible. So evaluations are going to act on the top of your observability, like your metrics then for cost considerations also. So what happens is because in prototypes we are using the best kind of models because we want to show the prototypes in the best light. What happens is we are not thinking about the cost, but when it goes to production, when you, and you think that you have to use the right kind of model, then if you're only using the best model, which is maybe five or something, then the issue is that the. Okay, we'll cut from here. We'll start from cost considerations. Okay. So yeah, so cost consideration is a very important part because Most of the PMs, what they do is they think that if in the, in the prototype I have used maybe the most advanced model, I should use the same model in the production as well. And then because of cost considerations, the management has to go ahead and pull the plug. Now there is a good possibility which is that maybe another model which is a cheaper model can go ahead and produce a similar kind of output. So if you go ahead and look at. So I'll just show you the pricing of different kind of models. Okay, so I'll just search for OpenAI pricing API and then we can see the pricing. Right, so you can see that the best model, GPT 5.1, which is mostly going to be used in your prototype because it's very intelligent, the output is $10 per million token. But for maybe a model such as GPT Nano, the output is maybe 0.4. Right. Which is only 40 cents. Now you do not need to use only this. You can also use this and you will only get the confidence of using this when you have created the right kind of evaluations. Right. So that is why evaluations are important over here. They can let you understand whether a cheaper model can also go ahead and perform at a similar kind of level or not. And it has. We have seen that multiple places that people just need to, let's say, do the intent matching or intent separation and still they are using for the simple task. They are also using GPT 5.1 because they are not sure. And they are not sure why, because they have not created the right kind of evaluations.
Akash
And there we saw a 25x price difference. That's really not crazy because there's also open source models out there that price similar to that 40 cent per 10 million token. So you might say, oh, GPT Nano, it's so much worse than GPT 5.1. But actually there are some competitors to GPT 5.1, whether that's Deep Seeks latest model or Kimik 2 that are pretty good and also pretty cheap.
Ankit Chukla
Yeah, and also there is a concept called a chance for learning where you take a small language model like a Gemma from Google and then because your let's say your case is very specific, such as the customer support for a particular kind of company with limited kind of product. You don't need to use GPT 5.1. You can use a small language model, use transfer learning in order to create, let's say you can create synthetic data, synthetic data, do transfer learning, do fine tuning, and then you'll have a similar kind of model which is able to go ahead and maybe dedicated to all the use cases that you have.
Akash
So as long as you have a really specialized use case, you can take these smaller models, you can do transfer learning and you can make the cost much cheaper. But how do you know that you're doing it at the same quality? Evals.
Ankit Chukla
Evals. Yes, correct. And then for engineering limitations, I think this is going to be wider in scope. Yes, some of this can be solved with evals, but not all of it. And then guardrails. This is a major part where evals actually shine, which is you are going to define what should the model do, what should the product do, what should it not do where, like when should it code, when should it ask for help and what are the different kind of output that it should produce. And this is where evals are, are actually going to shine, which you are going to see in just a moment. And then this collaboration failure. Now, although this is not directly related to evals, but when you are writing good evals, you are always going to involve the subject matter experts and that will give you a different kind of empathy with the user. So I think this also is indirectly contributed because you have not cared enough to go ahead and create the right kind of evals.
Akash
100% makes sense.
Ankit Chukla
Yes. Cool.
Akash
Actually, one more question since we saw that MIT slide, is that BS? It seems like overstated. 95% of AI initiatives fail. It feels like maybe that's true in a bank or somebody where they're really tech backwards, but in tech startups it feels like it's probably the opposite. Like 95% of AI initiatives succeed.
Ankit Chukla
Cool. So understand, data is data, right? You can create a story with the selective data that you have. So now what has happened in this study is if you look all the, let's say 30, 35 pages of that particular research paper, you'll be able to understand that. First, we are right now in the very early stages of AI, which is considered 89, 91, 1981, 9, 1981, when the Internet has came. So it's very, it's too quick for us to go ahead and maybe assess the. Let's say the profitability or the, I would say the scalability of AI. That's the first or usefulness of the AI. The second part is the companies that were considered were on account of double or triple digits only. Okay. So they have not considered a lot of companies which are going ahead and maybe trying a lot of things in the AI world. However, I generally am very much aligned with the findings of because it is very logical. So if you go back to like the traditional product management product basically fail because they are not catering to any user needs and they are not even evolving with the user needs. And that is what they are saying in the end. And if you now look at this, then even in the real world world like almost 90% products fail. You would have created so many products, so many prototypes, but most of them have failed. Like at least they were not commercial success. So I think yes, the output of the study could be good, but yes, there was some kind of selective data. But then in the future I believe that if more people are taking AI initiatives then the failure rate could be mostly around the same because it's not easy to create like a very scalable and a very profitable product. Okay, yeah. So now we are talking about evals because of one major reason, which is the large language models are non deterministic. Right. And this is a major reason why we have to go ahead and do all of these things. Otherwise this could be, let's say handed over to maybe a testing team or a QA team. Right?
Akash
Yeah.
Ankit Chukla
I'll give you an example. So although I've given you the job site example, now I'll go deeper, I'll give you like a more common sensical example which is maybe example of a chai or a tea. Right? So if someone asks you how would you create a table tea, then maybe your answer would be different than someone else. And then every person in the world would love a different kind of taste of the tea. But then the definition of tea is the same that you go ahead and have some water, you have some tea leaves and then some BVD would have the sugar, somebody would have the milk and then they are going to create the tea. So now when I go to different places, okay, for example, when I go to maybe a hill such as Manali or something, the quality of tea is going to be different, the taste is going to be different. When I go ahead and take a tea at home, it is going to be different. And then, then when I'm going to maybe go on a trip on railway, you'll understand that the tea is different and this is mostly the worst kind of tea that anyone has gone ahead. And so what is happening here is that although the output is the same, output is the tea. But based on the customer needs, based on the context, based on who is preparing the tea, the feeling that you get while drinking that tea is different. Right. So your product is also the same. So if I give you, let's say if a large language model, if I give you the prompt that how do I make a cup of tea because they are non deterministic, they are going to give me the right answer. They will tell me how to go ahead and maybe make tea and understand hallucination rate is not as much as it was maybe a couple of years ago. Right now hallucination has reduced a lot. Most of the answer are generally factful. But still this tea is different than this tea and this tea is different than this T. Right. This is the moral of the equation that the large language model, yes, it can be correct. We are not, not going ahead and questioning correctness at this point of time. They hallucinate but the hallucination rate has reduced. However, apart from hallucination, even if the models go ahead and became perfect in terms of not hallucinating in the future, which they are going to be, still they cannot go ahead and understand what your customers want. Because every customer is different, their needs are different. So you as a product manager need to make sure that you are able to ensure a similar kind of experience or the right kind of experience to all the customers. So that if people want to take a tier, they are going to come to your shop rather than going to some other product out there. Yes. Do you like the Akash?
Akash
Yes, of course. I drink chai every day.
Ankit Chukla
Yes.
Akash
All right. So I think we have the intuition down now. How do we actually build this?
Ankit Chukla
Yeah, so now I'm going to give you like a detailed flow and I've made it like very dramatically like, sorry, I've made it in the form of a diagram so that everyone is able to understand very visually. So I'll just go ahead and share the figma. Yes. So this is sigma. Right now it might look overwhelming but I'll tell you how it works. Okay. So very first thing is whenever you go ahead and start any kind of product, you are going to define the success criteria and expected behavior. Okay, let me give you a very small example. Okay. So although we are going to talk about this example in detail later but I'll show you a gist. So let's say most of you might have used Robinhood Cortex, which is the new feature that they have launched and a similar feature like a company in India, Indie Money, which I used to work, let's say three, four years ago. They have also, let's say built that kind of feature. I'm not anyway affiliate to the company and all of this is actually reverse engineering. So the product is very simple. IND Money or Robinhood are stock trading apps where you can buy and sell your stocks. Now what they have done is they have understood that many people before buying a stock they want to do some research. And for that research they are generally going to go to Google, they are going to type something, they are going to go to ChatGPT or they are going to ask a file Financial advisor. These people thought why don't we go ahead with AI we should be able to help people understand more about this particular stock. So we have created a feature called AS IND Money Mind and what it does is whenever you click on this below any stock it is going to give you some auto populated questions and then you can also write your own question. These are commonly asked question and then when you click on that it is going to give you an answer which is powered by AI, which is it is going to fetch the right kind of documents and it is going to give you a very contextual answer answer. This is the use case, right? Now in this use case if I walk you through the whole flow now so now remember this use case well I'll walk you through the whole flow. So first is the success criteria and the expected behavior. Now understand I want to make sure that there are some guard waves that I need to follow which is whatever the input, the output that I'm getting out of this chatbot, let's say that should be limited to maybe 150 words or maybe 300 characters to make sure that it is summarizable and people are able to go ahead and see it. The second behavior is that I should make sure that the model should never recommend selling or buying a stock. Why? Because legally you are not allowed to go ahead and give any kind of recommendations. So there is a regulation in India which says that no, if you are not a registered, if you are not a registered investment advisor, then you cannot go ahead and do AI cannot do this, right? So let's say these are the behaviors that the things should be factual, they should be grounded from the data and then you should not go ahead and suggest someone to go ahead and buy or sell a stock. You should not make a direct recommendation. You should just give the information. So this is the expected behavior that we have Success criteria should be that in the end when people are going ahead and maybe they are, they are getting this information, in the end they should give a thumbs up or something so that we understand if the output was actually good or not. So that is expected behavior that we have now what we will do is we will go ahead and transform that expected behavior and success criteria into some kind of metrics, right? So the metrics could be, let's say what is the quality of information? Maybe some kind of UX metrics such as latency and all. Then the output has to be safe, it has to be performance oriented, which is again latency. And then we are going to talk about behavior which is it should not go ahead and suggest you that you should go ahead and buy or sell that stock. Right now you can understand that if I go ahead and talk about ux, maybe if I want to make sure that it is up to maybe 150 character or 300 character, I can can choose to do a code based evaluation. So now in order to make sure that these evaluations are being done correctly, I can choose multiple options. One option is I can do it through code, I can have a human to review it, I can do it through LLM or I can go ahead and choose a combination of these three, right? So this is what I'm doing at this side. But on the other side, as soon as I understand the success criteria and the expected behavior, the first step that I need to take is I'll create a base product. So I have an expectation that yes, I'm going to create the version one of the product. It should not do these things. So that is the base knowledge that I have. From that base knowledge I'm going to create a system prompt. System prompt is very simple given this stock and this question. Answer this question. Make sure that you are not suggesting, make sure that you are always getting the output from input from this kind of data and maybe some other system prompt, right? And then I'm going to choose certain kind of system prompt. I'm going to choose some kind of models, I'm going to choose some kind of tools here I'm going to use a tool called as web search, right? And then I'm going to give it certain kind of context which is a user information or its background. And then I'm going to use certain kind of orchestration. Understand all of these are variables. A major mistake that product managers make is they Think that now these things are fixed and they tend to love their solutions. Understand all of these things are variables. These are knobs on a dashboard that you need to click here and there in order to make a better product. Right? So you are going to start from some something basic that is coming out of the information initially that you have. So you are going to put an input and then you are going to get an output node. This is the waste product.
Akash
For example, let's see a orchestration layer there.
Ankit Chukla
So orchestration layer is that how are you going to make sure that all of these four things are going to connect with each other? For example, a good example is N8N. So on N8N I'm going to have different nodes of LLMs tools memory that connect with each other. Now this orchestration layer could also be a region of failure. If it has a lot of latency, it is across the geographies or maybe there are some kind of orchestration issues. This can also give you certain kind of challenges, right? So you as a product manager don't have to fix anything right now. You should understand them as variables. You don't have to love your product. You have to make sure that your aim is to give the best experience to the users and be helpful for them. So now we are going to start from here. After this, what you will do is you need to understand how your product is performing. So what you will do is in order to see that performance, you will create a very good data set. This is where I have marked this as star, because this is where most of your efforts are going to go. You have to create this data set, which is data set is nothing. But what are the different kind of inputs that users can give your product? Right? So you are going to collect the past data. So for example, at Indie Money they already have, let's say some kind of advisors, which are humans who are sitting at the back end. So they also offer a service where you can talk to an advisor. So they can talk to advisors and they can understand that from the logs that these are the different kind of questions that people generally ask. So that will become one source of data. The second source of data is research. They are going to go to Google ChatGPT and all in order to go ahead and research, as in what do people ask when it comes to understanding about a stock. Similarly, they are going to use LLM. So with LLMs you can also generate something called as the synthetic data. You can tell that this is a product that I'm offering. Can you Go ahead and give me some kind of sample data set and then it is going to give you some kind of data set that. And then eventually there are experts. You are going to talk with real investment advisors. You are going to ask them that what are the different kind of question that people ask and then you will get them to fill certain kind of sheets. These four things are very important because they are going to make sure that you are actually dealing with real cases, right? So once you have that, then you are going to run it through your base product, right? Whatever you have created and then you will get certain kind of output. And I can assure you. And you'll also get surprised when you'll see that no, this output is was not as good as I was thinking my base product was to be. Right? And most of the times you might not be a good judge for the same. So you can also include experts. So let's say I'm a product manager, I do not know that what is a good advice, what is a bad advice in terms of finances. So I'm going to involve a financial expert in this particular case and then I'm going to ask them to tell me whether these outputs are good outputs or bad outputs, which is they have failed or passed the criteria. Okay. And then they also need to tell me me that what is that criteria, right? Otherwise what happens is most of the people, because if product manager, they are not like, if they are not subject matter expert or domain experts, they'll not be able to come up with right kind of evaluations. Once you show people data, then they'll be able to tell you that this is a mistake. So it's easy to point mistake rather than to go ahead and prepare for them in advance. Right? So what we'll do is we have this output, we have these remarks. Now these remarks are again going to go through this. So what we'll do is from the expert analysis, from the user empathy, from the success criteria, from the expected user behavior. What we'll do is we'll have a set of evaluations, metrics that now we should make sure that these things do not happen. So one of the investment advisors will see that we are going ahead and building this. Like they will say that this output machine is actually cut. So one of the evaluation experts can tell you that your product is actually generating recommendations or the information that it has given is very outdated, or they are trying to hallucinate the information. So you are going to take all of these outputs by actually giving them the input and the outputs and Then, then you are going to decide these metrics. Okay. And understand. It will take you some time to understand and decide these metrics. That is why I have also created a cheat sheet. Okay, so this is a cheat sheet that I've created with the help of, let's say some of my knowledge and Claude and GPT. If you are building any kind of product, I'll make this available in the description as well. Akash will make it available where you can understand that for what kind of product, what kind of evaluations, metrics should you go ahead and consider? Right. So this is a very exhaustive cheat sheet. After that what will happen, happen? Now I have certain kind of criterias. Now I will decide what should I use for evaluations. Things which are very definitive. I am going to use code for the same such as whether I have all the words mentioned or not, whether I am following a certain kind of criteria which is summary length or not. Then I am going to use code for the same. This is cheapest. In some ways I am going to use humans. In some of the evaluations I can use LLMs but most of the times I am going to use hybrid. Hybrid means that LLMs are going to flag situations that is not working, working. And then the human is going to go ahead and maybe give it a final call. Right? And then you are going to write evaluations. Okay? So now in the machine learning or LLP world we already have some, I would say some base level evaluations that can be done by code. For example this length, this bilingual evaluation, this RAV and word ratio. Here we are going to make word error rate. Here we are going to make sure that we are able to understand whether this is following this criteria or not. And then in some of the parts where code cannot work because it is, let's say it is something that is very subjective, then we are going to use other evaluations. So evaluations with LLMs can be of type, such as measuring the guardrails, understanding the UX tone, helpfulness, relevance. This can be done with the help of prompts that we are going to give to a large language model and we'll make LLM as a judge. Now once we have done this, rogue me. Yes. So there are two things, blue and rogue, right? So in blue what happen? In blue and rogue what happens is traditionally in machine learning we tend to see that, let's say if I am, let's say I have some output which is given to me by the machine learning model. And then I have a golden data set, right? So now what I'll do is I will not play intelligently. What I'll do is. Let's say I'm saying I am. Wait, I'll try to explain this again. I'll take the question from Roku. Okay? Yes. So Bleu and Rogue are two methods which are going to help you understand the recall value and the accuracy for your models. For example, let's say I have a case where I am getting this output from the large language model. The output is the cat is on the bed and then the golden data set. Golden data set means this is the real output. This should be the accurate output. The output is the cat is on the bed. Sorry, the cat is on the mat. Right. Now, these things are entirely different in terms of meaning, right? It is a different scenario. This is a different scenario. But what View and Rogue do is they are going to compare the words, which is if you go ahead and consider the blue and the rogue metric for the same, it is going to come around. Let's say I have 1, 2, 3, 4, 5, 6 words. And here I have 1, 2, 3,. 4, 5, 6 words. The blue and rogue are going to tell me that five of the six words are matching. Matching. That means. Yes, your output data and the golden data set are actually matching with each other. Right. But if you go out and use another LLM, you'll understand that. Boss, this is not true. The cat on the mat and cat on the bat is actually a different kind of statement, different kind of scenarios. So that is where they are used right now in traditional machine learning, they are used a lot. They can be used in order to make sure that your information is grounded or not. You can just do some matchmaking. But ultimately, if you are giving answer on the basis of Blue and Rogue only, you'll not be able to do it. That is why these are slowly getting outdated from real generic cases.
Akash
Yeah, so these are the generic ones. And I've also heard this as rouge. So that's R O U G E which is the recall oriented understudy for distinct evaluation. And I think you mentioned these come from the traditional ML and nlp. People don't know machine learning and neuro linguistic programming language. So these are the initial ones, but it's these evaluation prompts that you have below the functions. That's where a lot of the meat and the success is going to be driven from, right?
Ankit Chukla
Yes, correct. But it does not mean that you will not use these. Because when you can use a needle, why would you use a sword? So we should not Repeat the mistake that we are only using the costly effective methods because we don't want to go ahead and engineer things in the best possible way. So don't try to say I'm. Because these costs are going to combine, going to compound in the future. Yeah, right now. So now once we have set the guardrails, we have created the evals, what we'll do is we will now run these evaluations. Okay, we are going to run these evaluations. Now what we'll do is, as we are running these evaluations, there are two methods of running evaluations. One is the offline evals, which is we are going to run this on the product before we launch the product or before we make a major release. Make sure that whatever changes that we have made, they're actually correct. Right. This is like the alpha beta testing that we go ahead and into. Right. So you are going to define the system prompts, model selection and all the other parameters that I have mentioned here.
Akash
And I think that this is a really important point to double click on for people. The way the best AI companies work is that the AI PM defines these evals and that is basically the PRD for the AI engineers. Then the engineers say, okay, here's how we're performing on the evals. We're at 36%, 42%, 56%, 80% and 10%. What they're going to try to do is all those low ones, 10%, 34%, 62%, they're going to try to get those to like 80% or 90% and then you'll actually go ahead and ship something. So there's this hill climbing mechanism and that's why evals are so important. And when you keep hearing this term offline evals, don't think offline equals unimportant. Offline is actually the critical pre development process evals that are your prd. Is that right?
Ankit Chukla
Yes, correct, correct, correct. So what happens is if you're not doing offline evals correctly, then you don't even, I would say you have not even created a product that can be actually launched to the real audience. Right. I'm not saying that offline email should be always perfect, but at least you, whatever, you know, you should try to go ahead and implement that before you go ahead and launch the product.
Akash
Yeah, this is how you're traditionally we define like the edge cases, the corner cases in the prd. We're defining those in the form of evals now.
Ankit Chukla
Yes, correct. And then what will also happen is that, yes, offline Evals are good, many people do it it. But there is also something which is equally if not more important, which is online evals, which is you have to use a platform. You can use any of the observability platforms all of them are now having. These two major popular ones are Arise and True Lens. So what they will do is you will keep on observing the product. So I have talked about data drift in the beginning, which is that now the user expectations have improved. So our current prompts, whatever you have tested in the evals or your current models or something is not not is now not working. The the world has changed. So you are going to keep on observing and maybe if not on every output or input, you are going to run the online evals which are the same evals on your production level data. Maybe you'll not run it on every input and output, but Maybe you'll choose 1 in 10, 1 in 100, 1 in thousand, whatever is the, let's say the ratio that you need to take because they are costly as well. And then you are going to make sure that you are observing them and whenever any change is made, you are going to make sure that you are able to observe them and you are able to make changes again in your our base product. And then this whole cycle is going to go ahead and repeat itself. Okay, so just to give you like everyone a summary again, we start with the success criteria and the expected behavior. On basis of that we are going to get one level of metrics and our expectations, what the prompt should do. And then we are going to create like a base product where we are going to put the very primary prompt, very primary system prompt, whatever the best thing that we can do, right? And the models and everything after that we are going to collect a lot of data from multiple places, edge cases and from the Expert, from the LLMs and all. And then we are going to go ahead and run all the data set, input data set with our base model like our base product. And then we are going to evaluate everything. We are going to create evaluations based on the mistake that we have found. And then we are going to make this into evaluation. So evaluation will become, let's say a set which is going to run all the time. Whenever you are releasing them, doing the major release, you can choose it to run every week or every month so that you know that if the data is not running shifting and then you are having something called as online evals which you are going to run in the production level data set and you are going to get informed whenever an evaluation is passing here and there. For example, we have an evaluation called as accuracy. So if we believe that accuracy is anytime going less than 98% in these evaluations, then we should get flagged and we should go ahead and maybe improve the prompt of any of the orchestration that we have. So this is like the whole end to end flow that you do while creating evaluations.
Akash
Awesome. And it sounds like a lot of the art here, or actually I know a lot of the art here is in writing those LLM judge system prompts, creating those metrics. So how do we see this in action?
Ankit Chukla
Yeah, so what do we do is let's say I have created, let's say a prd. Let me just go ahead. Yes. So let's say this is what a product manager will actually do, right? So this is, I have created, I reverse engineered this. I'm again saying that I'm not affiliated to the company. So if there are some similarities that are just coincidental. So I have reverse engineered this product and this is what they would have done. Okay, so I have broken down the document into these 10 sections. The first part is like before you talk about aipm, understand that you are a PM first before an AIPM first to make sure that you are setting the context correctly, which is this is the product, this is what it does so that you are able to understand it. So yes, and you have to always start with writing the value for the user, which is reduce search friction, decentralized financial. So what is the value of the user? That user do not have to spend a lot of time in order to burden search for something that is the value for the product. They'll get the advisory within the product itself. And then we have written down the value for the business because these are again going to come when you go ahead and understand the metrics for evals and your product. And now we are going to make different layers. Okay? One is the user interface layer. I'm talking about these layers because these layers are also going to decide your metrics. So user interface layer, orchestration layer, data retrieval layer, LLM layer, logging analytics, all of these are going to play an important role in your eval systems. And then eventually this is the level one. Okay, so I have not written a complete prompt, but I have written what will go into that prompt. Right? So prompts and context system prompt core and AI assistant is configured with the following key principles. Now look at this role as an analyst, not advisor. So I know that I don't have to ask it to be an advisor and suggest me something. I want it to just act an analyst and then give me the understanding. Although it is a very small line. Fine. But this is going to define the behavior of the system, right? So I'm going to use all of these things and understand that I'm not writing the cold collective prompt because I have to make it iteratable. So what I can do is I can look from these points that yes, now in the new eval, whenever I've run dual or whenever I have to change the system prompt, I can choose which line I need to pick. Otherwise, in the bigger prompt, it's very difficult to go ahead and find what you want to go ahead and edit. Right. So we have to be, let's say, taking care of productivity because these are like small frictions that let product managers to not do the right thing. And then.
Akash
Yeah, I think you bring up an important point which is that you need to be very sensitive to your organization and you need to go to your development lead, your AI engineers, your head of product and clearly define where does the role of PM end and the role of engineer begin. I think this is a nice line you've shown here where there's guidance for the system prompt. But you're not necessarily saying this is exactly what the LLM judge and system prompt should be. That's going to often be the case in a larger company where your AI engineers are still going to be the ones writing the system prompt for your evals. But you've defined at a high level how it should look at a smaller company. You might actually be writing the system prompt. You might actually be in an arise or something like that. So it's good to have the skill. But it's really important to understand where PMs should be in this. And I think your example here, this is more for a slightly larger company. Is that fair?
Ankit Chukla
Yeah, I think I would say it would depend upon the autonomy that a PM gets. I have seen that in a larger companies also some people are able to be more agentic and have some more agencies and they are able to do all of these things. But in small companies generally you don't have to do a lot of things as a product manager because then you are figuring things out as the light becomes more clear.
Akash
Makes sense.
Ankit Chukla
Yeah. Right. So now these are the response guidelines and then these are also some context variables that we are going to inject in some queries. And understand this document has to be created so that everyone is on the same page. Page. Right. So don't just think that you have written an evaluation prompt and then your task is done. You have to give all of this context so that people can read it and they can understand. Because understand you will not be the eventual person who is coding. And the engineers have to be aware about all of these things right now. Yes. So now I have also mentioned that for maybe fundamental analysis, technical analysis, these are, let's say, other things, I need to make sure that I'm giving to the context to the AI so that it is not able to hallucinate.
Akash
What do some of these acronyms mean here?
Ankit Chukla
For example, if I talk about fundamental analysis, then I need to collect the information. So whenever I have to understand a stock fundamentally I need to look at a variable called as P ratio, which is profit to equity ratio. Right here I need to. And I need to make sure that I'm giving this context to the LLM that what these terms are so that it is not able to do a mistake of hallucination. Right now understand I could have not done this and I could have used GPT5 5 but that would be a very lazy decision. Right? Because in the long run the costs are going to combine what can I do in order to give more context to the AI and then maybe I'm okay with using a lesser capable model. That is what you need to understand as a pm.
Akash
So almost putting on the hat that you're not working with the best model.
Ankit Chukla
Yes, yes, yes, correct. So aim for the best, but designed for the worst.
Akash
Makes sense.
Ankit Chukla
Yes, yes. And then. Yes. So then we are going to also put. So understand, you understand what evaluations are, but engineers might not have at the top of their mind or maybe designers or the leadership might not have the top of their mind because they are only exposed to the prototypes. They are not about like familiar about this nature. So you have to clearly explain why it is happening. Right here is a high stakes domain and in India this space of fintech is heavily, heavily regulated. You cannot do this even without, let's say doing evaluation. And also understand it might happen that sometimes this is going to go wrong. At least at that point of time you'll be able to show the regulations that we have taken all the fail safe features to make sure that it is not happening. Right. And then we are going to have all of these, I'll share this document so that everyone can read in advance. And then we'll have these deciding metrics and expected behavior. So what are evaluation dimensions? Factual accuracy, compliance, groundedness, relevance. This is what it Means and this is why it matters. Super understand it will take some time to read this document and you don't have to write it by yourself. My general recommendation is what you should do is if you, you are really like an AI enabled pm, go ahead, talk through Whisper Flow to your GPT or your CLAUDE or your Google talk. Right. Talk as much as you can because that is more productive. After that, ask CLAUDE or GPT to go ahead and put it in this structure. Right. So you are, you will. And also ask GPT and CLAUDE to also fill the gaps that you have missed in this particular document. And then we have some more documents. Yes, we have also defined some thresholds. So numerical accuracy means that, that, let's say if AI is suggesting any kind of numbers, images or percentage of returns or numbers, I should make sure that my target is more than 98% of the time. They should be correct. And if it was becoming more than less than 95%, I should get flagged. And this product should not go into production unless I'm improving something. Right. And I have also mentioned this. Now this is super helpful for the online evals, which is if this metric compliance pass rate is going below this, then I should take an immediate action. Action, Right, immediate action. And then this is the expected behavior by query type. This. Now this will not come out of the blue moon. You would have gone ahead and done this process of creating inputs, outputs from the expected inputs. So what we'll do is. Yes, so here we are going to make sure where is the document. Yes. So here we are able to observe that maybe these are the things that should never be missed. Right? Right. And then we would have some edge case behaviors which is stock with missing data, penny stock, these kind of things. And this will only come once you are collecting all the data in your data set. Otherwise it will not come at the top of your mind. Right. Now I can show you the data set. Now understand this is all synthetically created data set, but it will serve the purpose of learning. Okay, so we have divided into multiple parts and these are the sources. Let's say for fundamental analysis. What I'm going to do is I have collected the data from multiple sources, synthetic data, talking to the experts, looking at my own data, doing my own research and I was able to understand these things. Let's say if someone asks how good is ITC as a dividend stock, Then this is the context that I need to give and this is what I expect. And then eventually these are going to be the red flags. Right. So now eventually what I'll do is I'll not write the evaluation prompt by myself. I will put all of these things, this information into an LLM and ask it to write a better prompt. Right. Because as a, as a human, you can go ahead and miss out on a lot of things. And then you will run all of these things again with the dataset to make sure that it is not making a mistake. Right. And I've also set some kind of priorities.
Akash
So the role of the human is figuring out what are the expected elements, what is the overall guidance for the prompt. Then use AI to create the final prompt. And there are studies that show that AI is better at writing prompts than humans. So put AI where AI is better. Put humans where humans are better.
Ankit Chukla
Yes, correct. And I think collaboration is, is. Is the best way out there. Rather than competing, you should collaborate with an LLM. And then you are. You both, both of you are going to unlock different kind of powers.
Akash
Yes, for sure.
Ankit Chukla
Yeah. And then this data set is there. And then understand that we have done multiple methods of collection, which is, we have taken it from production queries. If you are creating a support chatbot, you would be giving support before this chatbot. Right. So you can collect that data. Then there are expert curation. In this case, we can take ideas from the financial experts. And then we are going to do synthetic generation. And then we are also going to look. This is also important that we have to maintain this data, which is, although at 91, it is not needed to maintain such data very frequently. But if your product is, let's say, talking about a use case where data is very much changing, you should make sure that you are able to at least update this data once in a while. So this is kind of a test use cases which should run always before you are doing a major release. Right. And then this is evaluations. Right. So we are going to do three kind of evaluations. Automated programmatic evaluation, LLM as a judge, and human evaluations. Right. For automated eval, I can do factual accuracy checker, compliance checker, groundedness checker, structural checker. What I can do is I can just match the words, whatever is happening, numbers and all. Do I able to see it in the sources as well? Right. And then LLM as a judge, I can check relevance, balance, tone. And then I can. So I will not create a prompt by myself. I'll just give this particular thing to an LLM. It is going to go ahead and generate like a good prompt for me. Right. And then as a human evaluation protocol, I'm going to use, I'm going to use them humans because they are costly and they are going to take time. So whenever a new feature launches, something happens, like something important happens. I'm going to make sure that they are using the same. Right. And if the automated metrics, let's say the LLM is a judge or these things are failing, then I'll make sure that I'm involving a new human to check what is happening out there. Right. And then, then yes. Now this is a bit different thing, which is I can, for a human, I have two methods. I can ask them to just give me pass for fail or I can give them a rating of 1 to 5. Different people have different kind of opinions. But it's good to start with maybe a pass or fail criteria so that people are objective. But eventually as you go forward, even if people are rating 1 to 5, ask them to give you a remark which is why do they think this is the case? Because you can use that data in order to further train your, like further improve your let's say context or the models. Right. And then we have this, this is the criteria that at any point of time we have any kind of these issues, we are going to make sure that we are going to block the deployment. Right. Which it should be actually less than that. Yes. And then offline event, I think this we have already understood, but I'll share this document. Yes. So here what we are doing is that we are going to smoke test, we are going to do full regression and everything. We are going to run all time, the, the let's say all the evals and we are going to do block or no block in these particular criteria, we'll not release it. And then on the online evals, what we'll do is we will have the latency. So P50, P95, P99 means that whenever we do averaging, averaging is not going to give you the right kind of results. So let's say if you are having 100 users on your website and if you are going to see that maybe 10% people. So I think that's an important part so I can take some time here. So, so in order to measure latency, in order to measure so many people might have heard about P99, P95. Okay, so what are these things? So let's say I want to measure the latency of a product, then I cannot say that if 90% people are getting it at 100 millisecond and 10 people are getting it at. Let's say 1000 millisecond or 10,000 millisecond. If I take the average, average will come maybe something around, let's say, say a better number. It will look like, let's say not a. Not a big number, but still 10% people are facing the issues. P95 means that 5% of people, like 95% of people are actually having this kind of latency, which is let's say a good number. Right. Or maybe I can say P99, which is the most used metric, which is 99% of the people should be able to get their results within let's say a particular latency, let's say 100 millisecond or 10 millisecond because averages do not work there. So you have to maybe write these kind of latency metrics and then eventually. Yeah, so this is their sampling based quality. Well, this is good user feedback. Yes. Now this is one thing which is more important that apart from only running your evaluations, online evaluations have one more input which is after every AI tool has finished doing its job, you will see a hands up or a hands down option, like thumbs up or a thumbs down option so that you are able to integrate it back into your product. Right. This is like a hard feedback, but a soft feedback could be that people are again and again generating time to generate the same answer or they are not closing the session as soon as possible. They are just going ahead and maybe frustrated with the answers. That means that you have a soft feedback which you should also go ahead and consider and some other things such as maybe they have removed the session before going ahead and buying or selling something or they have gone ahead and escalated it to support. That means that your evaluations are maybe not working. So also consider these as evaluations. Like also consider these four things. Things also as some kind of, I would say evaluations in your product. Right. And then drift detection is already there. Yes. And then we have AB testing query which is that makes sure that you are also going to go ahead and maybe do some a B testing on various kind of evalu like with various kind of prompts and models and something so that you are able to understand how evaluations are running over there, what is the user experience. And then we have the last part.
Akash
Can you say a little bit more about that? So are you a B testing the evals?
Ankit Chukla
Yes. So we are testing. We are, are okay. So evals cannot be AB tested. What we'll do is we are ev testing the prompts and the models and then we are going to see in production that how do evals performs there evals plus the hard and soft feedback from the users.
Akash
Okay.
Ankit Chukla
So it might be possible that your evaluations are running correctly, but the user has something else to say. Right. So these only testing will make sure that you are paying the right stuff for the users and you're not too much short sighted by our synthetic evaluations.
Akash
Yeah, makes sense. You can't just only rely on the online evals, you have to actually look at the user data.
Ankit Chukla
Yes, yes, yes, correct. Because understand your evaluation did not come out of your own mind. You have gone ahead and maybe done a lot of research on subject matter experts. Right. And here also you have to understand that you don't have to stick yourself or you don't have to go to a fixed mindset. You are going to be evolving with the user feedback as well. And there is no better method than to understand the user with the help of AP test testing. Right, Got it. Yes. And then this is going to be continuous evaluation improvement which is online eval and offline evil are going to be in continuous loop with each other. And then eventually we are going to have this coverage which is what are we testing, what is done through online, what is done through offline and what is the frequency out that this is very important because many teams, what they do is they think of this process as a set it and forget it part, which is I have just done the wells, the product is done now what do I need to do? But if you're not observing, if you're not setting a cadence for improving your awareness, improving your prompts, then you are going to suffer from data drift. And it happens with many of the products that initially when it launches it looks very good, but after a certain period of time the expectation of the users are also increasing. That is one thing. But the quality of the things get reduced because now the data has changed. Like the evaluations need to go ahead and evolve. Right. So this is like the end to end how you are going to go ahead and do like a real life production level AI evaluation documentation. As a product manager I understand you don't need to like create this detailed documentation. At least create something basic and take the help of LLMs and just make sure that before you lease to the team you are able to take a quick look on the sale.
Akash
You know that feeling when you try to prototype something with AI and it spits out something completely generic, then you spend hours tweaking colors, fonts, copy and features just to make it feel like your actual product product here's the problem. Most AI app builders aren't built for product teams. They're built for those starting from scratch. But product teams aren't building from zero. You have an existing product, real customers, design guidelines, a backlog full of ideas you need to explore and validate fast. That's what Reforge Build does. AI prototyping that starts from your product. Add your customer feedback, strategy docs and product features as context. Create reusable templates using your product design. Explore multiple variants side by side. Collaborate with your team in one place. Reforge Build generates prototypes that reflect your real pricing tiers, real features, real customer language, not generic placeholder Stop fighting tools built for founders Start prototyping like a product team. Reforge build AI prototyping built for product teams try it free at reforge.com Akash that's R E F O-R-G-E.com A- and use the code build for one month free of premium. Yes please edit it and don't just send them some LLM slop or they'll start to lose respect for you.
Ankit Chukla
Yes and one more advice I would give to people is now there is a major issue that I'm sure that most of you when I was scrolling through this document, you are looking forward to a download link that when can I share this document with you? But I can assure you that 99% of you are not going to read this document ever. And this also happens with our engineers. They don't have time to read all of this. So generally what I do is like this is just a stakeholder management hack which is taught at Amazon which is that if you really want people to read these documents, get them on a meeting and maybe for 15 minutes of that initial part of that meeting, 15, 20 minutes, get them to go out and read that document while you are in the meeting. So if the document is important, first of all don't write too many document. You can write them for your own clarity, but don't expect other people to go ahead and read it. But in some cases like this, if you have to really have to go ahead and write a document, make sure that other people are reading it. And the best hack is to make sure that if you are considering a 20 minute meeting, make it a 35 minute meeting and dedicate the initial 15 minutes for people to go ahead and only read that document.
Akash
Makes sense. So we've covered what this all looks like. Can you maybe bring this back to us with some real life Examples of why evals matter.
Ankit Chukla
Yes. So I'll give you some examples. Okay. Yes. So before I could talk about all the examples, one, there is major, I would say confusion in the world that evals are nothing but fancy QA role that has been now given to a product manager. Okay. Yes. And that cannot be far from reality because now you have seen the process. Okay. A QA is not involved with the subject matter experts. Our QA is not improving the prompt, he's not improving the product. He's going ahead and informing. So information is different from transformation. As a product manager you are transforming your product, while as a QA generally you are giving the information to the developer that no, this is not working. So there is difference between the transformational role and an informational role. So don't think that if you are doing evals, you are just doing the job of a qa. It is much more than that. Although the terms are matching. Yes. Yeah. Now talking about why evaluations matter. And I'll give you some cool examples. So reliability and trust, right? So evaluations, like good evaluations can give you liability and trust. If you consider example of Grammarly that if one tone error can change like let's say Grammarly will translate across multiple languages. So if one tone error can change meaning across 500 plus scenarios. So there is a lot of trickle down effect. Right. So if you're writing good evals, you are making sure that yes, the tones and everything is matched correctly. Similarly, this actually happened in GitHub Copilot that when it was initially launched they had a very small error. The error was that in the YAML file there was was some mistake and that was not caught by the code and they have don't have evaluations for the same. And what happened was when people started using it and when they were moving it to production, most of their products were breaking. Right. So if they would have written an evaluation, this would not have done. And now it's a scale product. So it, it faced a lot of repercussions. Then we need Klarna. So Klarna makes sure that they are not. So initially when Klarna developed their AI Chatbots bot, they were focusing on things such as how many people are looking at the shortbot, how many people are saying that it is helpful. But soon they were able to understand where they need to push people in the conversion funnel. So there are business metrics also that we need to take care of and then they transform the strategy and now they are using the AI led suggestion, increased their AI LED suggestions are now increasing the checkout conversion rates. Right. So don't think that it is only for the users. You should also evaluate with your AI on some business metrics, right? And then eventually chatbots, you will see a lot of chatbots that. Let's say you created a chatbot, you created a chatbot on, let's say some information that is available right now. But in the future, if you add more products to your system and if you are going ahead and maybe some context is changing for the users, the user behavior is changing, your products are changing, then you have to make sure that your evaluations are always running, they are always online. Right? So that is why, why a chatbot, maybe any chatbot that you're developing for this is a very common use case that chatbots are being developed for customer support by AI. Then support chatbot will keep giving old policy into drop CSAT and it will lead to. Okay, I'll repeat this again. I'll repeat the chatbot part. Okay? Yeah. And another very interesting use case is the chatbot that whenever you are building like an AI assisted chatbot to support your customers, then a major issue is that if you are only giving it older information, if you're not making it very relevant, then you are going to give outdated information to the users and then that is going to fail the efforts and you will not be able to know whether the information is old or new unless you are running the evaluations. So evaluations are going to play very important role in all of these things. And in the end, my major takeaway from the session would be for everyone is that evaluations are not optional. They are the guardrails for all the AI driven outcomes. Don't think that you'll be able to create like a very solid, complex product without having the right kind of evaluation. Also, evaluation is not, not like I would say it's a goal. It is actually an ongoing journey which will keep on evolving as your product evolves.
Akash
So there you have it folks. We walked you through. If you recall at the very beginning the intuition for what evals are. If you're making and getting instructions for making jai, even the same LLM model is going to give you three different responses. That's non determinism in action to check are those responses acceptable? Are they not hallucinating? Are they at quality? Are they at the length that we want? We would create evals for those. Some of those would be code based evals, some of those would be lmudge evals, Some of those would be hybrid human evals where you bring a system, a domain expert, a subject matter expert into help you. And so this eval is not just a QA rebranded, it is a critical skill. It's almost part of the prd. We saw the long document that Ankit created. We are going to include all those links for all these things that we shared in the description below of this episode or in the newsletter accompanying and summarizing this episode. So that'll give you all the resources to go build these evals yourself. Don't be scared of adding AI to your features guys. AI is just another API and with these avals you will be able to handle the non determinism of it. Ankit, thank you so much for this masterclass in evals.
Ankit Chukla
Thanks for aakash. Happy to be here and happy to be helpful.
Akash
See y' all later.
Ankit Chukla
Take care guys.
Akash
I hope you enjoyed that episode. If you could take a moment to double check that you have followed on Apple and Spotify podcast, subscribed on YouTube, left a rating or review on Apple or Spotify and commented on YouTube, all these things will help the algorithm distribute the show to more and more people. As we distribute the show to more people, we can grow the show, improve the quality of the content and the production to get you better insights to stay ahead in your career. Finally, do check out my bundle@bundle.akashgi.com to get access to nine AI products for an entire year for free. This includes Dovetail, Mobin, Linear, Reforge, Build, Script and many other amazing tools that will help you as an AI product manager or builder succeed. I'll see you in the next episode.
Host: Aakash Gupta
Date: February 19, 2026
In this “masterclass” episode, Aakash Gupta is joined by Ankit Shukla—AI PM educator and practitioner—to demystify AI evaluations (“evals”) for product managers. The episode breaks down what evals are, why they’re critical to shipping successful AI features, and walks through a rigorous, actionable framework (with a real-world case study) for implementing evals in generative AI products. The show highlights the nuance, strategy, and collaboration needed, moving beyond basic QA to show how evals underpin quality, trust, business impact, and compliance in AI-driven products.
(Timestamp: ~12:15–18:00)
Five Key Failure Modes:
Evals address:
"Evals are not optional. They are the guardrails for all the AI-driven outcomes." — Ankit (60:42)
(Timestamp: ~22:10 onwards)
(Timestamp: ~22:14–53:34)
| Time | Segment / Topic | |---------------|---------------------------------------------------------------| | 00:00-03:26 | Importance of AI evals, skills for AI PMs | | 03:26-10:33 | What are evals? Simple vs advanced forms, job site example | | 12:15-18:10 | Five failure modes of AI prototypes, how evals help | | 22:14-39:21 | Framework: From criteria to data to evals (stock chatbot) | | 35:49-37:02 | Offline vs online evals, PM as definition owner | | 41:43-44:20 | Documentation, context, and collaboration tips | | 53:34-54:12 | AB testing, role of user feedback in evals | | 56:57-57:57 | Stakeholder hacks: how to ensure people actually read docs | | 58:07-61:54 | Why this isn’t QA: business/scaling case studies |
Relevant links, cheat sheets, and templates mentioned in the episode will be in the episode description and newsletter.