Ankit Chukla (27:08)
So orchestration layer is that how are you going to make sure that all of these four things are going to connect with each other? For example, a good example is N8N. So on N8N I'm going to have different nodes of LLMs tools memory that connect with each other. Now this orchestration layer could also be a region of failure. If it has a lot of latency, it is across the geographies or maybe there are some kind of orchestration issues. This can also give you certain kind of challenges, right? So you as a product manager don't have to fix anything right now. You should understand them as variables. You don't have to love your product. You have to make sure that your aim is to give the best experience to the users and be helpful for them. So now we are going to start from here. After this, what you will do is you need to understand how your product is performing. So what you will do is in order to see that performance, you will create a very good data set. This is where I have marked this as star, because this is where most of your efforts are going to go. You have to create this data set, which is data set is nothing. But what are the different kind of inputs that users can give your product? Right? So you are going to collect the past data. So for example, at Indie Money they already have, let's say some kind of advisors, which are humans who are sitting at the back end. So they also offer a service where you can talk to an advisor. So they can talk to advisors and they can understand that from the logs that these are the different kind of questions that people generally ask. So that will become one source of data. The second source of data is research. They are going to go to Google ChatGPT and all in order to go ahead and research, as in what do people ask when it comes to understanding about a stock. Similarly, they are going to use LLM. So with LLMs you can also generate something called as the synthetic data. You can tell that this is a product that I'm offering. Can you Go ahead and give me some kind of sample data set and then it is going to give you some kind of data set that. And then eventually there are experts. You are going to talk with real investment advisors. You are going to ask them that what are the different kind of question that people ask and then you will get them to fill certain kind of sheets. These four things are very important because they are going to make sure that you are actually dealing with real cases, right? So once you have that, then you are going to run it through your base product, right? Whatever you have created and then you will get certain kind of output. And I can assure you. And you'll also get surprised when you'll see that no, this output is was not as good as I was thinking my base product was to be. Right? And most of the times you might not be a good judge for the same. So you can also include experts. So let's say I'm a product manager, I do not know that what is a good advice, what is a bad advice in terms of finances. So I'm going to involve a financial expert in this particular case and then I'm going to ask them to tell me whether these outputs are good outputs or bad outputs, which is they have failed or passed the criteria. Okay. And then they also need to tell me me that what is that criteria, right? Otherwise what happens is most of the people, because if product manager, they are not like, if they are not subject matter expert or domain experts, they'll not be able to come up with right kind of evaluations. Once you show people data, then they'll be able to tell you that this is a mistake. So it's easy to point mistake rather than to go ahead and prepare for them in advance. Right? So what we'll do is we have this output, we have these remarks. Now these remarks are again going to go through this. So what we'll do is from the expert analysis, from the user empathy, from the success criteria, from the expected user behavior. What we'll do is we'll have a set of evaluations, metrics that now we should make sure that these things do not happen. So one of the investment advisors will see that we are going ahead and building this. Like they will say that this output machine is actually cut. So one of the evaluation experts can tell you that your product is actually generating recommendations or the information that it has given is very outdated, or they are trying to hallucinate the information. So you are going to take all of these outputs by actually giving them the input and the outputs and Then, then you are going to decide these metrics. Okay. And understand. It will take you some time to understand and decide these metrics. That is why I have also created a cheat sheet. Okay, so this is a cheat sheet that I've created with the help of, let's say some of my knowledge and Claude and GPT. If you are building any kind of product, I'll make this available in the description as well. Akash will make it available where you can understand that for what kind of product, what kind of evaluations, metrics should you go ahead and consider? Right. So this is a very exhaustive cheat sheet. After that what will happen, happen? Now I have certain kind of criterias. Now I will decide what should I use for evaluations. Things which are very definitive. I am going to use code for the same such as whether I have all the words mentioned or not, whether I am following a certain kind of criteria which is summary length or not. Then I am going to use code for the same. This is cheapest. In some ways I am going to use humans. In some of the evaluations I can use LLMs but most of the times I am going to use hybrid. Hybrid means that LLMs are going to flag situations that is not working, working. And then the human is going to go ahead and maybe give it a final call. Right? And then you are going to write evaluations. Okay? So now in the machine learning or LLP world we already have some, I would say some base level evaluations that can be done by code. For example this length, this bilingual evaluation, this RAV and word ratio. Here we are going to make word error rate. Here we are going to make sure that we are able to understand whether this is following this criteria or not. And then in some of the parts where code cannot work because it is, let's say it is something that is very subjective, then we are going to use other evaluations. So evaluations with LLMs can be of type, such as measuring the guardrails, understanding the UX tone, helpfulness, relevance. This can be done with the help of prompts that we are going to give to a large language model and we'll make LLM as a judge. Now once we have done this, rogue me. Yes. So there are two things, blue and rogue, right? So in blue what happen? In blue and rogue what happens is traditionally in machine learning we tend to see that, let's say if I am, let's say I have some output which is given to me by the machine learning model. And then I have a golden data set, right? So now what I'll do is I will not play intelligently. What I'll do is. Let's say I'm saying I am. Wait, I'll try to explain this again. I'll take the question from Roku. Okay? Yes. So Bleu and Rogue are two methods which are going to help you understand the recall value and the accuracy for your models. For example, let's say I have a case where I am getting this output from the large language model. The output is the cat is on the bed and then the golden data set. Golden data set means this is the real output. This should be the accurate output. The output is the cat is on the bed. Sorry, the cat is on the mat. Right. Now, these things are entirely different in terms of meaning, right? It is a different scenario. This is a different scenario. But what View and Rogue do is they are going to compare the words, which is if you go ahead and consider the blue and the rogue metric for the same, it is going to come around. Let's say I have 1, 2, 3, 4, 5, 6 words. And here I have 1, 2, 3,. 4, 5, 6 words. The blue and rogue are going to tell me that five of the six words are matching. Matching. That means. Yes, your output data and the golden data set are actually matching with each other. Right. But if you go out and use another LLM, you'll understand that. Boss, this is not true. The cat on the mat and cat on the bat is actually a different kind of statement, different kind of scenarios. So that is where they are used right now in traditional machine learning, they are used a lot. They can be used in order to make sure that your information is grounded or not. You can just do some matchmaking. But ultimately, if you are giving answer on the basis of Blue and Rogue only, you'll not be able to do it. That is why these are slowly getting outdated from real generic cases.