
LLM -powered systems continue to move steadily into production, but this process is presenting teams with challenges that traditional software practices don’t commonly encounter. Models and agents are non-deterministic systems,
Loading summary
A
LLM powered systems continue to move steadily into production, but this process is presenting teams with challenges that traditional software practices don't commonly encounter. Models and agents are non deterministic systems which makes it difficult to test changes, reason about failures and confidently ship updates. This has created the need for new evaluation tooling designed specifically around the properties of LLMs. Comet is a platform with roots and mlops that has evolved to support teams building modern LLM powered applications. The company recently launched opic, which is an open source platform focused on evaluation, optimization and observability for LLM agents. Together, the tools aim to bring the rigor of traditional engineering and ML workflows to the rapidly evolving world of agent based systems by treating prompts, tools and workflows as optimizable components that can be evaluated and improved over time. Gideon Mendels is the co founder and CEO of Comet. He previously worked at Google on hate speech and deception detection and he founded Groupwise, which trained and deployed NLP models processing billions of chats in this episode, Gideon joins Kevin Ball to discuss how agent development sits between software engineering and ML, why evals are the missing foundation for most AI teams, prompt optimization as a search problem and the future for continuously improving agents in production. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website K Ball LLC.
B
Gideon, welcome to the show.
C
Yeah Kevin, thanks for having me. I'm a big fan of the podcast so I was looking forward for this one.
B
Yeah, I'm excited. Well, let's start with you. So can you give a little bit about your background, how you ended up at Comet and then some of what Comet about?
C
Absolutely. So originally I started as a software engineer, kind of moved throughout the stack in the first kind of few years and then about 10, 12 years ago I shifted to working on machine learning. I was a grad student and then I went to Google. Funny enough I worked on language models. This is, you know, 2016 so they weren't large nor very good. Right. These like pre transformer days. Unfortunately lcms, if anyone still knows what that is. And you know, as someone coming from a software engineer background where, you know, we take a lot of pride of how we build software, obviously a lot of that changing right now. I'm sure we'll talk about it. But you know, a lot of pride of how we build software, the tools that we use, and then joining an ML team with amazing, very, very smart and talented people. But just seeing how the whole thing is kind of like a little bit like the Wild west, it was very, very challenging. You know, we worked on hate speech detection and YouTube comments. If you remember the YouTube comments section back in the day, I think someone call it the worst place on the Internet. So we had a hard time getting these models to work. And from that point I was like, okay, look, we had data, we had compute, we had smart people, we still couldn't do it. What is it? And it's not considered necessarily a hard ML problem. And I realized it's just kind of like around the process of how do you drive these projects? And I called my co founder here at Comet, who we worked on another startup building ML models and I was like, hey, you remember making fun of my ML workflows and how everything is stitched together. So I'm at Google and it's exactly the same thing, just at massive scale. So that's really how we got started. This is 201718 and we started with specifically what my team and myself needed back then at Google, which was around model experiment tracking. You train a bunch of these models, there's all these moving pieces, hyperparameters, data set versioning, all these results, and it's really hard to know that you're making progress to understand what you're doing next. Collaborating is completely out of the question because no one has access to anything. So we started with that and then over the years we kind of expanded that side of the platform so dataset versioning, model registries, model monitoring and such. And then about two years ago or so, obviously there was quite a big shift in the industry and we started seeing a lot of our customers and users started telling us, hey, for this use case, we're not going to train a model anymore, we're going to try to build it on top of OpenAI API, but it's still very similar because we're testing all these different stuff and we still want to use Comet for that. How can you help us? And at first we built, we started to add some features and such to help with that, but eventually we realized, okay, there's a lot of similarities, but there's also enough differences not to try to bake it into a slightly different workflow. In September 2024, we launched Opik Opik, which is our open source product focused on team building Agents, any type of LLM powered applications really focusing on the end to end from early dev through kind of this deployment process to production and specifically things around observability and evaluation and optimization of these agents automatically. But yeah, it's been a fun ride. We power some amazing AI teams and Uber, Netflix, Etsy, Shopify, Autodesk. We have roughly 150,000 engineers over the world using our products. Great adoption on the open source front. So yeah, it's been a fun ride so far.
B
Well, let's dig into a few pieces of that. I definitely remember like MLOps has been a term for a while of like there's just all this operational stuff that is hard and different and as you highlight like now in the LLM world it's everybody's getting to play with this. In fact everybody's having to be in this space and start to deal with non determinism and how do you deal with data flows and all of this different stuff, but it's also a little bit different. So let's maybe. Can we dial in what are the particular characteristics of development with agents that were different enough that you said, okay, this is a whole new product, this is a thing we need to build?
C
Yeah, yeah, that's a great point. I mean when you're training a model, first of all, you typically have some kind of a training data set and the algorithms are mostly a commodity, right? Like everyone's mostly using the same algorithm. So you do spend some time maybe changing some stuff, but it's mostly out of the box. And the majority of the time you spend on figuring out what is the right data set or what variation you want to augment it in different ways. The model hyperparameters how to formulate the business problem as a machine learning problem, which is not trivial in many cases. And then you get these massive binaries on the other side of the model weights and how do you do retraining? So at that level the majority of our users and customers consume the LLM as an API. It could be like that they're deploying it and it's open source, but majority of them use the commercial ones. So you're no longer in control over the weights whatsoever. You have like three production hyper, like you know, temperature, stuff like that that you can play with. But it's mostly static, right? Like sure, you know, OpenAI will release a new version every quarter or so, but it's mostly static. And what you do control as a builder, whether it's a simple LLM workflow or a full on Agent are slightly different things, right? You control the system prompt, which is the equivalent of the weights. If you train the model, you control tool calls context vector, dbs, all these. But the similarity is the reality is you have all these variables, system prompt, configuration of your chunking, the tool called descriptions. You got all these variables and you're trying to find a combination that gives you the best results. So from that perspective, it's quite similar in the day to day it tends to look quite different. So we had a lot of experience in this workflow, but the SDKs, the UI, all those things tend to look quite different. So that was kind of the main motivation to separate the two.
B
That makes sense. So looking at now that development problem, right, you have these different pieces that you can coordinate. And definitely one of the things that I see teams struggling to come up to speed with, or a lot of engineers at least is just grappling with non determinism. If you're coming from machine learning background, that may be old hat, right? Machine learning has always been statistical and Delia. But if you're coming from a traditional software background, which now more and more people are and still having to deal with this stuff, non determinism is weird. That's scary. We used to try to get rid of all of that. So, like, what changes in the process need to happen?
C
You got kind of multiple levels of non determinism, maybe some of them not officially non determinist. Right. Obviously you got the LLM. Technically you're supposed to be able to get deterministic output if you, you know, set the temperature to zero, technically. But you know, there's actually an interesting write up on why they're not deterministic. And it's just, it's actually because of how they're deployed with a mixture of experts. Because at the end of the day it's a combination of matrix multiplications. It should be deterministic if the input is the same. So that's. But that's, I'm digressing. So you, you got that. I think, to your point, what is a slightly different concept to software engineers and much more familiar for data scientists or people training models is in software we write these unit tests and at some point we have a pretty good coverage. Of course, something could happen in production. Another edge case, we'll add a unit test. But you have this level of confidence that this new version that you built is not going to break everything right before you deploy. And with these agents. How do you translate that concept? Right? Because like you can't first of all in the small details, you can't do string matching and stuff because obviously you can have the exact, semantically the same output, but from a string content completely different. Right. So you sort of have to think, how do I compare or do assertions between two of these outputs? Right. So that's a hard problem. But also like this concept of generalization, right, which is very common in ML, is like, how do I write a test suite that gives me enough coverage and variance so I have confidence that this thing is working. Right. So I think that's a concept that's relatively new for a lot of people who haven't trained models before. And that's the reality is like this building these agents is somewhere in between software engineering and ML. It's definitely not pure ML, it's definitely not pure software engineering. But there's a lot of learnings from both of these paradigms that you could bring together and actually ship stuff that works. Like that's the reality. Like, it's hard to get this stuff right, but like when you get it right, it's magical. I think we're all experiencing that on a day to day basis.
B
Yeah. So let's talk about some of those primitives and one in particular that comes up a lot and is talked about a lot. But then also when I talk to teams, everybody's like, yeah, we need that, but we're not really sure what it means is evals. So how do you think about evals?
C
Evals, if you put in the work, they are extremely powerful, useful and kind of give you that level of certainty. Now I completely agree with you that most teams don't do it right. And I have a very strong thesis on why that's the case. But EVOS is really kind of mapping this concept of a test suite to your application. Right. So, okay, I have in the most basic form a list of inputs, I have a list of the expected output. Now, it's not a simple assertion like you typically want to have some way to measure the distance between the agent's answer and the expected output. You could do it with deterministic metrics like blue score, stuff like that. Or you could do it with LLM as a judge. But at the end you get a score. Hey, this passed, this failed. This is your overall score across the board. So I think they're extremely powerful. From our customer base, the one that put in the work and invested in adding those, they're definitely on the more successful ones in getting these agents to work well in production. And I think the reason not that many people do them is because it's extremely hard to construct these data sets. It's painfully hard for various reasons. First of all, often the person that's building the application is not an expert on what's the right answer. If I'm a software engineer and I'm building a tool for the sales team or the HR team, I do not know what is the right PTO policy in Norway for an employee that's been in the company for two and a half years, right? So you kind of have a domain knowledge gap, which is hard. In addition, like, you know, if you're doing this kind of end to end testing, completely agnostic to what happens in the process, which is one way of doing it, that's maybe fine. But if you're trying to test something that does take to an account, which tool it should have called constructing these like graph traversals and compare, it's really hard work. It's extremely challenging. And look, there's a lot of ways we and others help with that. We have UIs for subject matter experts to clearly annotate stuff and add things. There's attempts, which I'm not a big believer in, to solve this with synthetic data, but I think that the solution is a product solution or like a product approach, right? So the reality is like whether you built evals or not, you put the stuff in production and some point someone comes complaining, right? Hey, this user tried to do it, I gave it completely wrong answer, or this blew up and so on. And then as a person owning this, whether you're an engineer or a pm, you go ahead and you try to fix it, right? Because that's what we do and what we're spending a lot of time on is like, how can I take this activity that you're going to do anyway and use that to bootstrap the eval data set. So I'll give you a specific example. Let's say we take that example. It's an HR chatbot, a user asked about PTO policy in Norway. There's some context injected with like their tenure and all those kind of things. HR comes and say, hey, that gives. That's the wrong answer. So you would go in there and you would say, hey, if the employee, like you write it in free text, if the employee is located in this country and it has this kind of tenure, like you should look at this document or the answer should be this. So people will do it anyway and might go and change the system prompt. But like, like what we're trying to do is like actually build this, like, test suite based on that. So now you have a new test sample with this input and this output with an assertion that says the answer should be X or it should be greater than certain years. And then the next time you try to iterate on your agent, whatever, you're changing the LLM, the system prompt, the tool calls, you're going to run through this eval suite automatically. So it's a product solution, right? It's not like a algorithmic solution to this problem. From speaking to users and customers, I think that's the right solution. And we're spending a lot of time really nailing that workflow. You know, whether it's UI first or it's like terminal first. And there's all these questions because I do believe in evals. Like, they're very powerful.
B
One of the things that I find myself wondering here is, and you said something about like, oh, are you doing this end to end, or are you doing sub pieces and diving in? And what you described in terms of a product solution potentially supports either. But, like, how do you think about the different levels at which an evaluation makes sense? Coming back to the software world, are there relative equivalents to unit tests versus integration tests versus system tests or what have you?
C
Yeah, absolutely. What I typically recommend customers and teams to do is start with like the end to end, like the system test, right? Like, you're not able to test every single scenario that way because, you know, there's side effects and all those kind of things. But that's typically the easiest data set to compile, and it does gives you tons of value. Another thing we're seeing a lot of people do is, you know, when you start adding these tools, a very, very common failure mode is it's not calling the right tool, even though if your tool is perfect, if it's not calling the right one, like, it's not going to work. So we see people build data sets that are essentially. It's like a classification problem. Like given a certain context, you know, user context, previous message history, anything that came from, you know, previous tool calls, like, basically the entire graph execution up until this point. Did it call the right tool? That's relatively easy. If you have the right product to help you, you don't have to manually create the graph context. That's relatively easy to generate, and it also provides relatively good results. And then of course, a certain tool could have an LLM call in it, or it could be a complete sub agent. Often you want to test that in isolation as well. So I think it does kind of correlate to the software engineering testing concepts that we're familiar with at a high.
B
Level, at least so far we've talked about it, using it. I think your example was, oh, this edge case came in, I got a complaint or whatever. My expert looks at this and says you should do this to me, that maps and I'm going to keep mapping to software engineering. That's like a regression test. We had a failure, we're introducing a test to make sure that we don't and then we go and introduce a fix. What other types of use do you see? Is there an equivalent to for example test driven development? Is there an equivalent to other types of sort of process level utilization of tests for these evals?
C
The approach I talked about like introducing these. Yeah, I agree. Regression tests is mostly a way to try to bootstrap that data set in the most user friendly without requiring you to do a lot of work. But when you put together like a true evaluation data set with the subject matter expert, not necessarily based on, you know, production regressions, then at that point like you can do something very similar to test driven development, right? Like you could one shot the system prompt, just write two sentences, put something together really quickly, which is easy today, right? Test it on a data set and then okay, it's failing in these samples and these samples. I need to add another like tool. Maybe I need. So you do see that transparently. Like I don't see a lot. I think we're going that way. I'm not going to say that like I see a lot of teams do that but this space is moving so fast.
B
It's interesting because it reminds me of. Are you familiar with dspy?
C
Yeah, of course.
B
So DSPY is a very researcher, developer centric version of this of like I'm going to describe the intent of what this thing should do and go. It's extremely not productized, user friendly and I think at least the last time I looked it was very optimized for like single inference types of tasks. When you're talking about at the agent level and thinking about this, what does that kind of loop look like and how do you think about like for example, let's maybe walk through your example of you have a particular case. My agent is complex, it has different tools, it's got a sub agent it can call, it can go down this and I want to like something at the very top level. My eval is failing. How do I step through and figure out the right places to fix or debug this, does that also involve evals or tooling? Like, how do you see people doing this?
C
Yeah. So the common workflow is like you identify the failure and then you use the UI like traceview, which really shows you very clear breakdown at what happened in every step, every tool call, every LLM call, every function call internally. Right. Typically what we see is people just go through that and then, okay, this is the step. It failed or why. Right. So, for example, if the output from the RAG database or the vector DB was bad, then clearly you're not going to expect the right answer if the output was good, but the LLM still provided. So it really allows you to pinpoint where the issue is. And that's super powerful. Super powerful. But I think where we are as an industry today, I think we're going to look back a couple, maybe, I don't know, in this space, maybe not a couple of years, maybe a couple of months, but we're going to look back and say, hey, that didn't make sense that we did that so manually. Right. If you go back to the early days of neural nets, people manually tried to put the weights on the neurons. This is before SGD and that kind of stuff. And now you look at it, I mean, obviously they have an intuition that the architecture could work, but the reality is, like, everything we're talking about is a search problem. Like, you have a system with a bunch of variables and parameters and configurations. And if you have an eval test suite, you have a certain score of how well you're doing against it. And we are searching for hopefully the global maxima for this search problem. Right. So let's look at it as a search problem, as an optimization problem. So this is something we're spending a lot of time thinking about. And we shipped a bunch of stuff on the product. And I truly feel that, like, we're not completely there yet, but I truly feel that's like, how things are going to be in the future where we're going to stop doing this, like manual inspection of a trace to try to figure out where it failed. We'll have this flywheel of an evil suite that continues to grow over time and then a nightly optimization process that tries to find a new global maxima.
B
Let's talk about what that might mean. Right. Because I think a mental model, we're now moving mental models in a lot of ways. Right. We had been taking this mental model. Okay, this is a unit test. What do I do if a unit test is failing? I'm going to debug which is what's kind of a manual process or maybe a coding agent assisted process. But it's still like pinpoint to this thing, a thread going down based on a failure and trying to correct it. And what you're describing now is something quite different from that and much more algorithmic and broad facing. So let's flesh out what do you mean when you say hey, this is an optimization problem?
C
I completely agree. Right. LLMs, kind of because of their fuzzy nature, lend more to this versus unit tests that are by next year, did they pass or did they fail? So if we know what are the variables that we control, which in agents is typically system prompt, tool descriptions, if you have a rag step chunking strategy, all that kind of stuff, how many values you're returning from VectorDB, you get all these values, let's call them hyperparameters, and you have a score at the other side of it with this tuple of these values for these variables, I'm getting 80% and I'm not getting into how you compute it, but we can get to that in a while. The most naive approach is let's brute force search all the possible values to try to get the best result. I mean putting kind of compute constraints aside, you would eventually find the best result.
B
The space of possible system prompts is very large.
C
Yeah, yeah, it's not actually feasible, but if you think about it as a problem, you would eventually find that. And then you say okay, let's kind of walk through what typically it looks like. And then you're like, okay, there's like random search. Okay, let's randomly select. The user would define. Here are the ranges that I think make sense. Let's randomly sort of. And then you got methods from ML like Bayesian search, which are a little bit more informed on how to kind of search that space. But then with these LLMs as part of the search or the optimization algorithms, you can do things that are a lot more robust and a lot more kind of compute efficient than kind of searching through all the possibilities. So especially for things like system prompts, which is, you know, even random searches, it's going to be quite challenging. You know, there's obviously a lot of work on this topic. I think the leading lab would probably say Stanford. There's like JEPA is like a very well known algorithm is in space. We've built a couple of our own as well. But the idea is, okay, so I have a system, I have the configuration, that tuple, I ran through all the inputs from my evaluation suite. I Got a result. Now I need to come up with new candidates for these variables. So think of, let's talk about system prompt. So you can use an LLM in that process to look at the ones that failed. So you say, okay, Here are the 10 samples that failed in my data set. Let's look at the tree, like the trace of what happened, where it failed, what common failure modes. And let's suggest a new system prompt candidate. It's a suggestion, it's a candidate that we think might fix it. You put this candidate in, you rerun the whole process. Am I doing better, am I doing worse? And so on and so on. Now these algorithms get more sophisticated. JETPA is doing this like evolutionary approach, whereas, you know, suggesting multiple candidates and kind of trying to merge the best ones. We can get into it if you want, but the reality is if you have a good eval data set, they work extremely well. Like, it's really cool to see. And the best part, that this is my favorite part is, you know, in ML, especially LLMs, but generally ML, you need at least thousands of samples. With these algorithms, like even 20, 30 samples will take you a very, very long way. So it's extremely powerful. Happy to give you. We've ran some kind of public.
B
I would love actually, yeah, let's talk about some examples of that and mapping to. Because my brain immediately goes to. Okay, we have an LLM in the loop. We're running LLM evals on all of this. Those token costs are going to rack up. So what's the order of magnitude we're talking about? There's what's the length of time this tends to take for a loop or to converge and all those different pieces.
C
Yeah. So I'll give you kind of one, let's call it a relatively simple but real life example. Right. Not a toy one. So LangChain, obviously a very popular library, has baked in the library a system prompt or a prompt that is used when you're using an LLM that doesn't have structured outputs, which is a lot of the kind of smaller open source ones. And it basically asks the LLM, hey, return your response adhering to a specific JSON schema because you need to parse it and do stuff afterwards. And we just saw from our user base how often it's failing that we said, okay, let's try to look at this as an optimization problem. We created a simple data set of JSON schemas essentially, and then a certain output with random values that adheres to that schema and then we ran it through the original system prompt they had in LangChain and I think we got about 12% pass, which is we're just making sure that the schema matches. Like it's a simple deterministic metric. Then we ran it through the optimizer and within two iterations we got it from 12% to 96%. It was that powerful. Now you're talking LLM cost. It was like less than a dollar. It was like a frontier model, but it was like less than a dollar, like very cheap. We opened the PR to the LangChain team, it got committed the same day. They were like, thank you so much. This was painful. So it works very, very well. Yeah, we have a couple of like, we have some customers using it as well for also softer, like softer evaluators. But yeah, it truly depends of how much do you believe that metric on your evaluation data set.
B
Fascinating. Yeah. For structured data, your evaluation can be deterministic, you know, very regular. I think LLM is judge is where I start to wonder is like, okay, I have an LLM as judge, I have a subject matter expert that's inserted a bunch of things about why, you know, how they would evaluate, et cetera. One, it's a little fuzzier, two, it's probably a little more fragile and three, it's expensive to run a lot of times.
C
Yeah, yeah. I don't disagree. Right. Like in no way I'm saying like this is like perfect. Right. But I'm comparing it to the status quo, which is like.
B
Oh, for sure.
C
Yeah. And I think that's a big reason why we're not seeing more of these agents out there. There's obviously a few that are extremely popular and common. But like I would argue to expect to see more out there. Right. Is because of kind of what is the status quo, which is this kind of like more vibe checking, which is essentially I'll test a few inputs, I'll see what it looks. It looks good, let's put it in production. Right. And one argument I always kind of think about when people talk about cost is like, you know, if you look at the cost per token curve over.
B
The last three years, it does keep dropping tremendously.
C
Yeah, tremendously. Right. But yeah, it does, you know, incurs additional costs. There's no argument about that.
B
Let's maybe talk a little bit about what is the life cycle around these then. Because I think one off costs or one time costs are very different than recurring every day costs, things like that. So let's say you're going all in, you're doing an optimization, agent optimization. You have your eval suite. One, how frequently are you building, updating that suite? Two, what are the situations in which you're, you're running it? And three, how often do you do this, like optimization, pass over all of your different agents, et cetera.
C
So I think there's like where we are today as an industry and I think where we're going to be in, you know, 612 months. So what we see from our user customer base is people don't kind of run these optimization every day or every week or maybe not even every month. Right. Like they spend the time to build the evaluation suite and then they do it in dev maybe a couple of times to get to like a version that's kind of good enough. But that goes back to the challenge of like their evaluation suite is not growing as much as you'd expect. Right. So which is the bottleneck for everything here? I think that's where we are today. I do think that if you fast forward 6 12, 18 months, I think what it should look like is a lot more what ML teams or pipelines used to look like. All of my ML customers retrain their models on a daily, weekly, monthly basis. Obviously some of these algorithms are completely online, so they update as you use them. So I think it will look something like that. We have an agent in production, we are getting feedback, whether it's from user feedback, LLMs that judge evaluators to flag, you know, different edge cases, behaviors and so on. You either have or don't have a human in the loop to kind of like improve that data. It goes into the evaluation data set and then, yeah, once a day, once a week, you re optimize, retrain the agent. Because I completely get the cost side. But like, you know, the fundamental equation with ML is like more data equals better model. Yeah, but these agents is like, you know, you ship this thing, whether you have a one user using it or a billion, it's stale. Like it would not get better unless you do something about it. So I think if we are able to like close that loop where we can learn from production data, I think the sky's the limit. So I'm very excited. I mean, I feel like it's, I'm very fortunate. Like, to me it's one of the most exciting problems out there.
B
So let's dig in then. One, is anybody doing this today in terms of being able to learn, live? Like, I think of this as like in vivo learning. Right. Like it's happening in the flow of what's going on. Are any of your customers doing that? Like, how does that work? Or what are the barriers that are keeping people from doing it?
C
With this type of optimization, there's no RL stuff that obviously, you know, people do on policy and stuff like that. But I don't think any of my customers are at a point where it's completely like autonomous, you know, in the loop yet. I would also say, like, we just launched this optimization open source SDK about six months ago or five months ago. So it's like early days for every, for this entire category. And sorry, remind me the second part of your question.
B
Oh, what are the barriers? I mean, I can think of a few in terms, I mean, immediately I go to like, okay, what about privacy, right? Like, where is this eval suite living if it's using production data? Is there cross contamination? Like, how do you deal with that? But what are the barriers that you're seeing?
C
Yeah, there's a lot of them, right? So first of all, you, you know, kind of what we talked about is like when you add stuff to this evaluation suite and it's like the ground truth, it needs to be right. So you can't just put a bunch of noise or garbage in there. So like, is it human in the loop? Do you have a really good way to do that? So that's kind of one fundamental issue. Second one is kind of like the operation side of it, right? Like where do you store these data sets? Like, how do you split between different users, customers, deployments, like, you know, all these operational problems, which are real problems. And then I think the algorithms still have a lot more room to improve. Like, all of this work is like super nascent, super exciting, but like it's very early days. And then the other part is like from a deployment model, how do you test this stuff? Right? So for example, right. Like, you know, one way people do it is like they kind of bake the system prompt in their repos, some configuration or whatever, a YAML file or even in the source code, they manage it. Like classic version control, you got a new prompt, you want to test, you redeploy the entire application with a new prompt. That's one way to do that. But if you think about it, you could test, we talked about evaluation suite testing, but a lot of teams AB test these things in production or canary test or whatever. But canary deployments, but it's such a small component typically that does it require a full on redeployment? And when you think about AB Testing platforms, you kind of inject the framework in your code, and if test this, do that and so on, or display this button feature flags, all that kind of stuff. So one of the approach we're looking into is like, hey, maybe when your application spins up, it fetches the prod prompt from our platform, for example, right? And then when you have a new candidate that you want to test because you ran an evil suite, or you manually went and changed something, the next time it will make this inference, it will just get the new prompt. And that suddenly opens up this whole new world of like, hey, I can correlate production performance to my eval test suite performance. In production, you can often look at the downstream actions. Did they actually book the flight? Right, for example? So it kind of opens up a lot of interesting use cases, which is, I think if you tell a lot of software engineers, you tell them, hey, your application is dependent on a third party to fetch a critical configuration value.
B
Well, it's a fascinating question, right? Because prompts in some ways, like a mental model I Sometimes use is LLMs are like this big virtual machine and you're throwing in a combination of code and data. That is text. Your system prompt is code. Any user input is code. But it's also kind of data, right? It's also kind of like, hey, we're using this to shape it. Anything coming back from a tool call is some combination of this is data or this is code. This is instructions to this virtual machine to keep going. So how do you manage it? Do you manage it like code in a repository? I can make arguments for that. It's important to know the version control, and this is shaping the core behavior of my system. Or do you manage it like data? Because also it's got a different lifecycle and maybe you're optimizing it or changing it or tweaking it or all these other. It's a really interesting question.
C
Absolutely. And the coolest, the scariest part of this is often when you confront these problems, there's some teams or some companies that have been doing this for a few years. They're so much ahead that you can ask or read their blog posts and okay, they might have seen, but here it's so new that even the bleeding edge teams are figuring stuff as we go.
B
Totally. I was having this conversation with an engineer. He's like, isn't there an established solution for this? And I'm like, are you kidding? This whole field is a year old. Like, there's no established anything.
C
Absolutely. We've seen a Little bit of it with like the. In the MLOps world. But again, then we kind of leaned. We had like these early customers that were, you know, doing MLM production for 10 years. So we leaned a lot of them. But it's funny, a lot of like, the solutions came from some weird constraints. Like, I'll give you an example, right? Like, so when you think about how do you manage model binaries, right? Like, they have their versions, they're tied to a production version because the inference code needs to match. You got all these kind of moving pieces. And the biggest limiting factor was like git lfs, which wasn't good fast enough, Right. So it never really made it into like repos, because, like, I would argue it should probably live in a repo.
B
I mean, there's coupling. I mean, this is another thing, right? Like some prompts, you have code that is handling that, that is parsing it, that is not a part of the prompt. So there's coupling between your code and prompt as code.
C
Yeah. And most teams on that front use like a model registry, which is like a wrapper around object store. Right. And does that coupling exactly. That you're referring to. But it also allows you to do these hot switches in production. Right. So that's actually quite powerful in the ML world. A lot of people use that. You don't have to redeploy the entire application, unless, of course, the inference kind of code had to change. But it gives a lot of flexibility because often the team's building the models are not the team deploying the software. Right. It's like, so, yeah, it's really fun.
B
So are you seeing that metaphor working out in the LLM world where you have essentially like a registry for this prompt that has whatever associated code and you say like, okay, this prompt, probably you have an eval that explains like, what is the type of data that can come out of this so that my coupled registry is able to deal with it.
C
The first approach we took because we weren't sure is like, yeah, you could do it either way, right. Like, you could kind of take the prompt through cic. Like, you could do it in a lot of ways. Like, you decide what you want to do, right? Like, we weren't sure what's the right approach, but now, you know, about a year and a half in and seeing kind of a lot of teams doing it. That's definitely the approach we're taking. And it's not just a prompt. What we're building is essentially like a configuration manager, which is. Has all the prompts has all the other kind of variables that impact the agent. And then a product manager, which is often the person that owns these agents, can go and change stuff. You need to be able to revert to older version. You need to have some kind of process, you don't accidentally click a button and break production. You need to make sure that your APIs as the configuration registry meet a certain SLA because other people are dependent on it. But then you can start doing all this cool stuff, you can start doing overnight optimization, you can start doing canary deployments of agents versions and so on. So I do think that's how things are going to shape out. And again, we have kind of all flavors in the product and it's clear that it's getting there. And I think it's predominantly driven by the fact that how involved product managers are versus just a pure. You're smiling. That makes total sense. But yeah, so the engineering builds it and kind of throws it to the PM team and they need to manage this entire thing.
B
So I'm curious, right? And full disclosure, this is literally a problem that I'm tackling in. My day job right now is like how do you navigate non technical stakeholder, doing prompt optimizations, doing these things and the coupling between all the different pieces. So what you're doing is you've got this prompt configuration system. It has versioning and release gates and all this stuff, but it has within it a prompt, some amount of configuration. Are there dependency graphs like this prompt needs to run, it needs to have access to this type of data injected or these types of tools. Is there a, you know, an SLA for like this is outputting structured data that like, is there an SDK that is matching? Like how does all the pieces fit together there?
C
Yeah, yeah, we're still figuring out some stuff. But generally speaking, so you have an SDK where you manage to commit or write prompts or read prompts and it kind of supports the templating and all the stuff that you want to do there. So for example, you'd be like opic, get prompt and then prompt name latest, for example, something like that. And then that will fetch it when that API or that function call happens. But you kind of want your it works when your agent kind of architecture is somehow stable. But it could have multiple prompts that are definitely dependent on each other. Right? So it's not like this prompt. It's like we call it a blueprint. Like it's all together. It might have 10 prompts and have a bunch of configuration like which LLM you're using or all these different things that you want to control. And then because you're kind of fetching it when someone calls, you're fetching the configuration when someone calls invoke, you can change it in the UI very easily and get a good result and you can respond to production incidents really quickly. So I think it's like a very powerful approach.
B
Yeah, that's super cool. Looking forward then what is like the edge that you're working on? What are the things that you see like coming, you know, in the next. As you said earlier, like times feel compressed. Right. I don't try to look out multiple years anymore, but what is coming in the next few weeks or months? Right.
C
And specifically the OPIC front or just my prediction of, well, let's start with opic.
B
But I think you have a window into this space that I'd be curious to go to look at also beyond opic.
C
So I think on our front, like we're spending a lot of efforts on a bunch of the stuff we talked about. How do we help teams bootstrap eval regression suites like very easily and a non technical person can do it. So I think that's like the biggest blocker at the moment for everything. So if we figure that out, I think a lot of people will be much more successful with their agents. Like that's like that's the first block. And then really this concept of blueprints, this concept that you can run optimization, get a new blueprint or variation, test it in production and a certain percentage of your traffic. So all of that is kind of deep in the pipeline and it should be coming soon. All going to be open source, like everything we do. So that's that. And then I think once we start getting customers and users in this like what you call nvivo, right. Like in production, like I think we'll probably learn a bunch more. I don't know how to predict how that would look like yet. So that's maybe a few more months in the future. I mean generally in the industry, I think first of all it's clear that these models are still getting better. I think everyone who used cloud code, somewhere in the December time frame something changed, right?
B
Opus 4, 5 possibly.
C
Yeah, yeah, probably. But there's been versions, but there's something more substantial of how good. And OPUS is like a huge part. But I think cloud code as a harness is like extremely impressive. Like they did such a good job there.
B
Well, and that is one of the things that's really interesting to Me here is like the models are incredible and they do a lot of things and they're continuing to go, but there's still tremendous nuance into how to build an effective harness and put those pieces together. And what I'm hearing from you is like the eval needs to be capturing all of those pieces and looking at all of those. You can't look at a prompt and a model in isolation.
C
Yeah, I agree. And that's why the SWE bench benchmark is not actually a good. Because that tests only the model usually. Right. So there's a lot there. Right. And one of the things everyone were talking about, vector database like a year ago or two years ago, and cloud code, I think they tested embeddings at some point, but it grabs most of the stuff and it does it in a really smart way so it doesn't kind of blow up the context and it works so well. Right. So I think that's the other thing that I'm starting to see more and more is a lot of teams, when started hitting challenges with their agent as engineers, we were like, okay, let's start building some structure around it. We kind of have these user flows and if you're in this flow, I'll inject this context and if you're in that flow, I'll inject that context. But I think the trend is you actually want to give the models more freedom versus more constraints, which is interesting.
B
It's really interesting. And I think it's probably problem dependent. Right. Like I see examples where it's like. And it depends on how flexible is the problem you're attacking. Right. I have a well constrained problem where I know what good looks like. Great, Let me lock everything down, get a reproducible pipeline, a bunch of small, tightly scoped steps. Awesome. I have a general purpose problem. This person's coding who the heck knows in what the heck language. I can't lock that down. I need to be able to flow with all those different changes.
C
Yeah. And another thing is like, you know, this is something I'm asking myself a lot. Right. Like it's so, you know all these companies are building agents, right. And you go to this product and it pops up in the UI on the right side and chat with agent and all that kind of stuff. And if like we're going to a world where everyone runs their Maltbot or whatever, you know, it's called what whatever the name is today, like hopefully one.
B
That is less of a self hacking pathway.
C
Yeah, yeah, yeah. I was like, I set it up And I was like, do you want to set up connection to your email? I was like, no, hell no.
B
Let me not only have the supply chain hack of NPM packages, but like this thing is actively going out. I mean we're talking about prompts as code. Let me go out and ask Randos on the Internet what code I should be running on your machine that you've given me all these permissions on.
C
Yeah, I think people were like privacy and security oriented, like Sandbox did pretty well. But I guess like to the point it's like what is going to be the interface? Like is this if everyone's just going to expose an MCP and you're going to just use your agent to call it and if so, how much do you control out of the harness? You know? Right. Like it starts becoming like a different interaction or like you will as a company you will have your own agent, people will type in your chat window. And you know, I'm thinking a lot about like what's the role of UIs? Like we spent, everyone who's been in the software industry, we spent so much effort on like UIS and UX and all of those things. But is the future where every UI is on demand generated by your agent to just show you what you need to see right now? I don't know. Right. I don't have an answer to all of these. But exciting times to be building.
B
There is something there that might be worth digging into when we talk about agent optimization and how to do this. So when we talk about UIs, I think one of the really interesting things that LLMs allow you to do in a UI is you can build kind of more intent based interactions. Right. So traditional software is very imperative. I am going to click on this thing and drag it over here. I'm going to go this way, what have you. And even if you don't have a full on chatbot, which I think we over index on chatbots when we talk about LLMs, but like you have something that can interpret fuzzy direction whether it's voice driven or language or even like I think can still be some amount of gesture or other different things, but it's able to kind of make inferences based on incomplete information and do the imperative pieces for you and that just opens this tremendous opportunity in terms of streamlining people's experiences even if your core product has nothing to do with an LLM.
C
Yeah, I completely agree. I think the question is like, okay, I have my application code and I write all these LLM workflows to Try to kind of determine that kind of stuff. But when you think about a chatbot session and doesn't actually have to be exposed to the user, text is great for some things, horrible for other things. I'm kind of thinking UIs will need to be in sync with your chat session. It's kind of like what you said. The UI keeps changing based on what the context of the session is. And for some things you're going to go and type it. Some things, if talking about data tables, I really rather have a couple buttons to filter and sort and look at things, but they have to be in sync and to your point. And then you can start doing all these things that you can infer implicitly about what the user is trying to do. So much better than, oh, that was the user flow we thought about when we did this product meeting.
B
Now, you have an interesting eval question too, right? How do I eval that this UI that got generated on the fly. And I've, I have seen some fascinating experiments with like generating, you know, generative UIs. You can do really interesting things. Some of them are terrible, some of them are great. But like, what does that eval look like?
C
Yeah, I don't have all the answers. Right. Like, the reality is, obviously it's a very different modality, but like, it'll be similar to like what we're seeing today, where you're going to have some form that human reviewed, provided the feedback and then you're going to try to align your or optimize your LLM as a judge to try to be as close as possible to the human evaluator. Right. And that's like a continuous thing. It's not scalable to have humans review everything. LLM as a judge by itself has introduced so much more fuzziness and noise, so it's not helpful. So I think we'll have to be somewhere there to make this stuff work. But yeah, indeed, it gets hard. This is a slightly different use case, but there are all these evaluation data sets for browser use models which are different but not completely. Yes, the UI is not generated, but the state is determined based on the actions that the web use agent did. So I think they do it on the DOM level, to be honest.
B
Well, and you could do something interesting with that. And one of the models I've seen for doing this is essentially you have a UI layer that is controlled by Reactor Redux or something like that, and then you have a JS sandbox and you let your LLM just YOLO code into the sandbox. And all it can do is issue redux things, Right? But that's enough to run your UI layer. And so you've got it sandbox, it's safe, it's not messing with stuff. But then you end up with REACT code, which you could test it in any of the react testing libraries or it does boil down to DOM level and you can do DOM level stuff. That's interesting.
C
Yeah, I haven't seen that. I can follow up later. I'd love to see that. Like, I've seen some attempts, there's like a library, I forgot the name but some attempts to sync sessions between like some, you know, agent session and a ui. There's a bunch of stuff of like in chat UI components that try to do that. I don't know what the right modality is, but clearly like we have tons of chatbots, tons of UI products and like at the moment it's just like a widget to the right side. Like I feel like it's going to change a lot.
B
So we're getting closer to the end of our time. So let's maybe close with this, right? You're seeing a lot of teams building agents grappling with these problems. Do you have like a set of advice or guidelines or like things you would say, like, hey, you're tackling agents. There is no established, you know, years old best practices. But here's what I'm seeing in the field. Like, you should be doing these things.
C
There's a few learnings I can share, right? The first one is there's tens, if not hundreds of agent frameworks out there. 80% of the people I see are not using any of them. You can spend months in testing all these different frameworks and they do add value. I'm not diminishing them, but the reality is some of the most successful ones out there are home brewed, vanilla built. So I wouldn't spend too much effort on that. My next piece of advice is it's kind of annoying. It's kind of, you know, you go to the doctor and tells you, oh, you have to work out more, right? It's like you know the answer. But a lot of people struggle with like doing or eat healthier and all those kind of things. But like spend some time on building like a very small evaluation data set, 20 samples, like just 20, like it will pay off big time, right? Big time. Like, and then you can YOLO the whole thing, you know, vibe code your agent. Like, you'll be so much more successful. The other thing is, and we talked a little bit about it. I don't typically suggest worry too much about costs early on mostly because they tend to go down by roughly 90% year over year by design and it's a little bit of premature optimization to use the best model, use the frontier model, make it work first and then you can figure out can I make this work with a cheaper smaller model. And then just generally if you're online, if you're on Twitter it seems like everyone figure this out and everyone has Hundreds of Fortune 500 CEOs say we have 10,000 employees or actual agents. The reality is like everyone in the industry is trying to figure this out and it's hard for everyone including OpenAI. They just put a great post on their data agent and you can see they're struggling with the same challenges like all of us. So don't feel that pressure that everyone figured it out and you didn't. So I would say that's in a nutshell Sam.
Episode: Optimizing Agent Behavior in Production with Gideon Mendels
Date: February 17, 2026
Host: Kevin Ball (K. Ball)
Guest: Gideon Mendels (Co-founder and CEO of Comet)
This episode delves into the evolving landscape of deploying LLM (Large Language Model) agents in production, especially focusing on agent evaluation, optimization, and observability. Gideon Mendels, CEO of Comet, shares the lessons learned from years of building ML tooling, discusses the nuances of agent-based system development, and explores the emergence of new workflows, tools, and best practices for monitoring and continuously improving AI agents in production.
"As someone coming from a software engineer background... just seeing how the whole thing is kind of like a little bit like the Wild West, it was very, very challenging." — Gideon (03:12)
"Building these agents is somewhere in between software engineering and ML. It's definitely not pure ML, it's definitely not pure software engineering. But there's a lot of learnings from both of these paradigms..." — Gideon (10:13)
"A lot of people struggle with doing [evals]... But spend some time on building like a very small evaluation dataset, 20 samples, like just 20, it will pay off big time." — Gideon (50:35)
Optimization loop:
Practical example:
Scalability:
"The reality is, everything we're talking about is a search problem ... and if you have an eval test suite, you have a certain score of how well you're doing against it. And we are searching for hopefully the global maxima..." — Gideon (19:54)
Current state (2026):
The future:
Production prompt/config management:
"Most teams building agents are not using any framework ... some of the most successful ones out there are home-brewed, vanilla built. So I wouldn't spend too much effort on that." — Gideon (50:20)
"Is the future where every UI is on demand, generated by your agent to just show you what you need to see right now? I don't know... But exciting times to be building." — Gideon (45:13)
On ML vs. LLM agent development:
"You have all these variables ... and you're trying to find a combination that gives you the best results. So from that perspective, it's quite similar. In the day to day, it tends to look quite different." — Gideon (06:46)
On the status of agent deployment best practices:
"Are you kidding? This whole field is a year old. Like, there's no established anything." — K. Ball (35:25)
On the reality of production agent feedback loops:
"Whether you built evals or not, you put the stuff in production, and at some point someone comes complaining..." — Gideon (13:34)
On industry hype vs. reality:
"If you're online, if you're on Twitter, it seems like everyone figured this out... The reality is everyone in the industry is trying to figure this out and it's hard for everyone, including OpenAI." — Gideon (50:56)
| Timestamp | Segment Description | |-----------|---------------------| | 02:23 | Gideon's background, Comet origins, and why they built Opik | | 06:22 | Differences between ML and agent/LLM workflows | | 08:51 | Non-determinism: difficulties in testing/evaluating LLM systems | | 11:10 | The power and pitfalls of evals; productized approaches to enable them| | 15:34 | Types/levels of evals: system, integration, unit for agents | | 19:06 | Debugging failing evals: moving towards search/optimization loops | | 21:46 | Treating agent improvement as a search/optimization problem | | 25:27 | Real-world prompt optimization case study (LangChain example) | | 28:48 | Frequency of optimization & continuous improvement; production process| | 31:42 | Barriers: privacy, eval set curation, operational/logistical issues | | 34:05 | Prompt and config management: code vs. data; versioning | | 40:42 | Roadmap for Opik and near-future developments | | 45:40 | The coming shift to intent-driven UIs powered by LLM agents | | 47:55 | Evaluating generative UIs, leveraging DOM-level and human feedback | | 50:19 | Gideon’s practical advice for teams shipping agents |
Focus on value, evaluation datasets, and using the best-performing models first—optimize later for cost.
A basic suite of 20 high-quality eval samples will make agent development and debugging dramatically more robust and repeatable.
Leverage trace tools, configuration registries, and human-in-the-loop feedback to improve and audit your agents over time.
Build your systems in a way that new real-world data can flow into evals, which then drive agent optimization and redeployment—mirroring ML retraining pipelines.
Everyone—from startups to OpenAI—is learning as they go. Most "real" production agents are homegrown and under constant evolution.
For more: