
Evaluations are critical for assessing the quality, performance, and effectiveness of software during development. Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements,
Loading summary
Shawn Falconer
Evaluations are critical for assessing the quality, performance and effectiveness of software during development. Common evaluation methods include code reviews and automated testing and can help identify bugs, ensure compliance with requirements, and measure software reliability. However, evaluating LLMs presents unique challenges due to their complexity, versatility, and potential for unpredictable behavior. Ankur Goyal is the CEO and founder of BrainTrust Data, which provides an end to end platform for AI application development and has a focus on making LLM development robust and iterative. Ankur previously founded Empira, which was acquired by Figma, and he later ran the AI team at figma. Ankur joins the show to talk about Braintrust and the unique challenges of developing evaluations in a non deterministic context. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Ankur Goyal
Ankur, welcome to the show.
Thank you so much for having me.
Yeah, absolutely. Thanks for being here. So let's talk about Bearing Trust. How did you guys get started? What was the original inspiration?
Yeah, so prior to BrainTrust I used to lead the AI team at Figma and before that I started a company called Impera, which Figma acquired. And at Empira we were in like the stone ages of AI pre chatgpt and built a product that helped you extract data from documents. And basically every time we would change something, like whether it was changing our models or when we started using language models like the prompts or even the code that fed stuff into the models or process the stuff that came out, we would break something. Like for example, we had banking customers and we might improve our invoice processing model by like 2% or something but break a banking workflow in the process and obviously that can't happen. So we had to figure out how to avoid that. And we ended up building internal tooling that helped us do evals. And then Figma acquired us and we basically had the same set of problems with LLMs and built roughly the same tooling. So after doing that a couple times I was chatting with Elad Gill, who's one of our investors, and he was like, you know, hey, you've built the same thing twice, maybe other people have this problem too. So we talked to a bunch of companies including the folks at Notion, Zapier, Airtable, Instacart, and a bunch of other companies who are now customers and they were like, yeah, we do have this problem and we need a solution. So we partnered with a bunch of really great companies early in our journey and built a product and we've just been kind of going since.
Why do you think that no one had kind of put a product offering out there for this type of problem? Is it too niche and bespoke or is it the sort of the classic software engineering thing where people were building and rolling their own auth for a really long time before companies came along and offered that as a service?
Yeah, I think a lot of this stuff is timing. So had we not done this, I think someone else would have. Now I think we've kind of set the standard for how to do evals really well as a software engineering team and kind of built the primary workflow that people are using or in some cases copying into their product. But I think the reason that something like this didn't exist is that ChatGPT, GPT3 in particular, represented a fundamental engineering paradigm shift in AI. So prior to GPT3 being available over a REST API and accessible with simple natural language, it was really hard as a software engineer to actually use AI models. I've been trying for a long time. I'm not a stats PhD by background. I am kind of like a traditional old school software engineer. And I struggled through using ML models for a really long time and it just became dramatically easier when that happened. And so I think that paradigm shift was the first time that software engineers were actually able to use AI models in an effective way. And yet AI engineering and ML engineering is a totally different discipline than traditional software engineering. So it's kind of like an old or well known workflow for doing evals, but a new group of people that are trying to do it with different preferences and skills that they bring to the table. And so that vacuum is basically what created the opportunity for BrainTrust.
Yeah, so essentially you took this thing that was probably always a problem, but before was a little bit more narrow in terms of the number of people that were maybe facing that problem. And now because of, as you mentioned, you can essentially talk to a large language model or some other generative AI model through an API endpoint, suddenly the scale of that issue becomes much, much larger. So why is the problem of running something like an eval test more challenging when you're talking about interacting with the model versus sort of traditional ways that we might do this for software engineering?
Yeah, I think the biggest difference is that it's non deterministic. So in traditional testing you want your test to be 100% green and if something's flaky, then it's usually actually a bug. I was debugging something this morning where one of our Services would occasionally return the wrong result. That's a bug. On the other hand, in AI that is something that you deal with literally every day. And no AI model is going to be perfectly deterministic. I think that's kind of the interesting opportunity for engineering with AI. And so being able to visualize, interpret, characterize non deterministic results is a different paradigm. Many people, myself included, start by trying to kind of cram this into the traditional pytest or jest or VI test of the world and it gets very confusing very quickly. You start to have to do stuff like, okay, I'm going to try running this thing four times and if it succeeds three out of the four times, then maybe it's good enough. But why three? Why not two? Or why not run it ten times? And then of course, when you're actually looking at the results, you want to, for example, see all four things that you may have tried in one place and try to see what the variance is. You can't really do that on a terminal screen and there's no UI around something like VITest that makes that easy. So I think that's the biggest problem. The second thing is that you're not just testing based on the code. So unit tests are like a pure function of code or sometimes infra. You also need to test based on data. And being able to source good data to build good evals is challenging. But once you embrace the challenge, I think the art of finding the right data to actually evaluate on becomes the highest leverage tool that you have when you're building AI software. And so that is a completely new activity. And it's this kind of strange thing that requires reaching into your production logs, finding interesting examples and then utilizing them in your evals. A lot of the teams that we meet before using braintrust, they're doing this just in like JSON L files. So they'll log stuff to a postgres database or log it in their traditional observability system and then find stuff and then like click in a ui, download it and then copy paste it into a JSON L file and then try to use that in their evals. And that is, you know, obviously not the optimal workflow. And I think the third thing that's interesting is that compared to previous generations of software engineering, the role that non technical people have is quite different in AI. So non technical folks, for example support people, are often incredibly sharp at looking at poor user interactions with an LLM and characterizing why the interaction was bad. And that is an incredibly good way to get good eval data. And so you need to figure out ways to incorporate non technical folks into the workflow and utilize them effectively. And again, that's different than traditional unit testing.
And then in terms of doing this for sort of like, you know, what we considered old school predictive AI versus generative AI, are there differences in terms of how you need to do something like evals for those?
Yeah, I think there's a lot of differences. I was talking to someone at a ride sharing company earlier today and I think a good example is, let's say you work at Uber and you're trying to figure out what the optimal price is for a ride. That's like a traditional ML problem really. I think optimizing an aggregate is the right thing to do. So if one or two riders things are too cheap or something is too expensive, you know, Airbnb is another similar example where they use ML to figure out the price or to suggest prices for hosts to list their properties. With those kinds of problems, if the answer isn't quite right, then the user impact is somewhat low. And I think optimizing over the aggregate of the correct price is much more important than every single price. Whereas a lot of the problems that people are using for AI, for example, if you're working on a codegen related tool, if you spit out the wrong code, then it's just not a good user experience. It's not like you need to optimize some aggregate number there. And so I think the cost of being wrong is somewhat different. And accordingly, I think it's very important to look at individual examples and not just try to optimize in aggregate, which is a really popular sort of old school academic thing to do in AI. So that's one thing that's quite a bit different. I think the other thing that's quite different is that LLMs are very interactive. So in the old world of ML, when we were training models at Empira, if we notice that something was wrong, we would sort of collect a bunch of data and then retrain everything. And you could do that maybe if you're very, very sophisticated once a day. But realistically, many companies that are building traditional models will retrain their models like once a quarter. Whereas with LLMs, because you can do prompt engineering, you can change things really, really quickly. One of the most popular features of BrainTrust is that you can go into our logs and if there's an LLM interaction, you can hit tri prompt and open it directly in a modal and then actually play around with the prompt and even save it right there. And so that kind of very fluid engineering is quite different. I think it feels to me quite a bit like the difference between using Python and using something like C or even writing assembly code. You can just move a lot faster.
Yeah, there's essentially the sort of feedback loop is much more immediate than you have with sort of the predictive models where you're doing an update maybe once a quarter or something like that, like you can immediately make a change and see the change and the impact of that change, which is really powerful. But it's also because of the non deterministic nature, hard to always know if the change is actually better or you just sort of like the vibes feel better and you're in that direction.
Yeah, no, I mean I think it feels exactly like building software. Like, I mean I'm old enough to remember when building web servers and writing web apps was, you know, like pre php, it was just really, really slow and really hard. And at that point it was inconceivable to me that you'd be able to save a file in your IDE and without even refreshing the page, the browser would reflect the change, which is now something that everyone experiences. And honestly we all take for granted. So I think the sort of power of software is really how quickly you can iterate things and LLMs have really unlocked that for AI.
Yeah, that's absolutely true. And I think that's true even outside of the technology itself. When it comes to like what companies succeed and what companies fail, especially in competitive markets, is like how fast can the company learn and make adjustments because inevitably the things that you do are probably going to be wrong in some fashion. But can you learn from that iterate really quickly? That's why it's engineering organizations. You want things like CICD set up and be able to do multiple releases a day because that execution cycle allows you to essentially out compete people who don't have those types of things in place.
Yeah, I mean, I'll give you an example. Simon, who's one of the founders of Notion and one of our early adopters at BrainTrust, said that prior to BrainTrust they were able to solve on the order of three issues per day. And now with BrainTrust they're able to solve more than 30 issues per day. And so you're exactly right. I think with the sort of analog to CI, cd, observability, et cetera in AI, you're just able to move a lot faster.
So can you walk me through the process of actually Creating an eval. I want to use BrainTrust. I have some AI application that on building and now I want to integrate this. Like, how do I go about doing that?
Yeah, I think one of the things that we did that really was very popular in the early days and I think has kind of now become a standard among products and tools in the space is we broke an eval down into just three simple parts. So one part is the data. And data is, you know, it's just a list of inputs and then optionally expected ground truth values. You don't always have them, but sometimes you do. So you have to figure out how to get that. Sometimes if you're just starting, you might just hard code it in a TypeScript file or Python file. Maybe you have it in a SQL database. BrainTrust has datasets you can use, but somehow or another you provide some data and then you provide a task function. A task function is very simple. It takes some input and then it generates some output. And a simple task function could just be a single prompt that you plug the input into the prompt and then you generate the output and you save it. It could be agent, it could be multiple agents. Now it could be something that runs across services. So that can get increasingly complex, but it's just a pointer into your application and then the last thing that you provide is scoring functions. And scoring functions take the generated output and then the expected value, if one exists, the input, maybe some additional metadata. And their job is very simple. They produce a number between 0 and 1. We have an open source library called Auto Evals, which has a bunch of scoring functions built into it. Some of them are heuristics like Levenshtein distance, which is kind of a good old trick that still works. Some of them are very fancy LLM based scorers. They're all open source, so you can actually look at the prompts and tweak them yourself. But the scoring functions, you kind of itemize them into these little functions whose job it is just to assess your output on some criteria. And that's it. You just plug those three components into an eval and then you can run BrainTrust Eval on your code in Python, Typescript, or now a bunch of other languages. And that's it, you've run an eval.
How did you kind of like, you know, come to this design? Like, how did you know that this was going to work?
Well, I mean, I've been doing evals as a software engineer struggling through ML for 7 years now. And so this is not the first attempt at trying to simplify this, I think at empyra, I remember the first time I tried doing it, I was working with our researchers who are unbelievably smart and they had Python notebooks and matplotlib and for loops and matrices and I barely understand numpy. So it was just really, really complicated. And I think through literally years of iteration, I sort of realized that it's just these three pieces. And I think I spent when we started working on BrainTrust, probably like two or three months thinking about how to really boil down evals into something that was very easy for people to use. Brian, who's the CTO of Zapier, was very helpful because he was also new to AI and a very, very sharp software engineer. So I would sort of send him a draft and say like, hey, is this like sufficiently easy for you to digest? And he would just say no. And so working with him and a few others, I think we kind of arrived on this design. And I remember when I sort of first wrote an eval this way, it just felt right.
Yeah, so it sounds like you had some really good sort of like early stage design partners that helped you validate what you were planning before you actually went about implementing it and rolling out.
Yeah, our core bet was that there are some early adopters and Elad and I actually wrote down a list of these companies before we really started the company. But there's like a list of early adopters that were building software in a way that represented how others would build software with LLMs in the future. And I think one of the most interesting characteristics of these companies is that most of them did not have ML teams prior to ChatGPT coming out. So they were kind of starting from a fresh slate and thinking from first principles about how to build with AI. And we basically reached out to all these companies and I think now almost all or all of the companies on that list are customers. But we sort of bet on what Zapier notion, Airtable, Instacart, you know, browser company, ramp companies like this, what they would be doing and how they would be building AI software. We sort of assumed that others would build it that way as well. And I think that's largely turned out to be true.
Yeah, I mean that's a really, I think like fantastic approach and like insight that you had early to sort of. You ended up identifying what is probably like your icp, your ideal customer profile. And then it's kind of like this is like your initial like account based selling strategy. Here's our account list. I mean, let's go now we close those. Let's go look for these other ones that are kind of similar.
Yeah, I think it's. Yeah. You know, what Notion was doing six months ago, I think a lot of companies are trying to do now, and so it's definitely worked really well for us.
That's awesome.
Shawn Falconer
This episode of Software Engineering Daily is brought to you by Capital One. How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the business. Real Time Data at Scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That's technology at Capital One. Learn more about how Capital One's modern tech stack data ecosystem and application of AI ML are central to the business by visiting capitalone.comtech and then in terms.
Ankur Goyal
Of the eval itself, like, what am I doing as a company or an engineering organization wants to incorporate this? Is this a library that I'm running, you know, locally? Like, where's this kind of run and situation?
Yeah, I think it's a lot like unit testing in the sense that it usually starts by you just running it on your local environment. And when you do that, you basically generate a bunch of logs which get uploaded to Braintrust automatically, and then you go to our UI to visualize it. And we've tried to make the DX really, really fast. So the sort of time from when you run an eval to when you see it in the UI is just a handful of seconds and it feels very real time. And that's usually how it starts. And then the next thing that you would do is like send your friend or your colleague a link to the eval and say, hey, look at this. What do you think? Or I discovered this. You know, what should we do? And then maybe your colleague starts running them as well. And soon you kind of realize, hey, it would be really great if we could actually run these on all of our PRs. And we built a GitHub Actions integration that's been really popular. That just makes it very easy to kind of translate what you probably have already built locally into running as part of your PR workflow. We do a bunch of stuff behind the scenes, like if you don't change any of the prompts, then we'll automatically cache everything. And if everything is automatically cached, we don't pollute your PR with a bunch of repetitive information about the thing that's changed. You know this very well, but there's a lot of those little things that you have to do to make the workflow feel right. But that's usually how it goes. And then once it's in your PR workflow, then you start actually baselining against main. And every time you run an eval, it gets automatically compared to the latest deployed version and you can get increasingly more sophisticated from there.
And then in terms of any increased inference costs that I might be incurring with this, where I'm going to be running these PRs through evals that might also one, they're probably, you know, interacting with some AI component of my actual product which is going to have a cost associated with it. And then it also might, in the sort of scoring function be using some sort of LLM based scoring. Like how do companies think about those types of things?
Well, first of all, it's all cached. So both for speed and cost reasons, I think it's quite useful in evals not to just unnecessarily rerun evals. And by the way, there's another cost that you didn't even mention yet, which is doing online evals. So we also make it really easy for you to run eval functions on your logs at like a sampling rate that you specify. And that also adds cost. But I think the other thing is that, you know, I think the cost of doing evals is really low relative to the cost of running a production workload, and yet the value is disproportionately high because evals, you know, for every eval that you run, you're basically kind of ensuring that end users have a really positive experience with your product. And so in the spirit of iteration and getting to the best possible product and product market, fit with your AI company or your AI feature as soon as possible. I think evals end up feeling like a really low cost way to get there compared to users suffering, for example, with your low quality app. So to be honest with you, I think asides from some competitors or companies trying to create noise about how evals are expensive, it hasn't come up as a practical concern for any of our customers.
Yeah, I guess you're really weighing in against like, what is the cost of a horrible user experience.
Right. And it's a fraction of the cost of actually running your application. Right.
So, and how does this like start to work when things start to get more complicated? Like if I have some sort of like agentic workflow where there's going to be multiple, like sort of Planning, evaluation cycles. Maybe I even have like an agent workflow where multiple agents are communicating and passing information. Like, am I breaking these down sort of piece by piece and running evals a little bit more task specific, or am I doing some things that's a little bit more aggregate?
Yeah, no, it's a great question. And I think at a certain point, how you engineer your evals becomes a core part of how you actually build the agent or, you know, the more complex system itself. And I think what I'll say really quickly there is the best systems are often the systems that can be evaluated really well. And so you sometimes pick the abstraction boundaries in the software that actually allow you to do evals. But yeah, I think the optimal way to build increasingly complex AI systems is to do evals end to end as well as for the components. And the more you evaluate individual components, like for example, a planner module, the more reusable and modular it is. So you can use it as its own standalone thing and evaluate it as its own standalone thing and then somewhat reliably and comfortably plug it into larger systems and kind of know that it will do its task really well.
And as a business, how do you turn something like essentially evals and testing into like an actual business? Clearly there's pain and people want to solve it. But how does this become something that you can actually like, monetize and scale?
Yeah, I mean, one of the things I'll share just as like a fun anecdote to other entrepreneurs who are potentially starting companies. When we started braintrust, we weren't the only company that was thinking about building LLM tooling, but I think we were the only company that really focused on evals. And the reason is that a lot of VCs will give you advice about problem spaces, and multiple VCs told us, you know, CICD was not the most lucrative set of venture outcomes for them in the previous generation of software. And so it's a very bad place to start a company. And instead you should focus on things like observability. And I think we just knew better than to really listen to VCs that much and from personal experience knew that the pain was really around evals. I think now the mindset around that has shifted quite a bit. And in retrospect, I think we were right to focus on evals. But with that context in mind, I think monetizing BrainTrust has not been that big of a challenge for us because evals represent such a critical part of the development workflow and represent so much pain. That it's a problem that people feel very motivated to solve. The other thing about evals, and I mentioned this kind of earlier on, is that the data component, like how you actually source data to do evals, turns out to be really important and therefore people actually want to log their production workloads in braintrust as well. And so at this point we've built a really, really powerful and seamless integration across logging and evaluation. And as soon as people start logging stuff in BrainTrust for the purpose of finding data to do evals, they start asking us for other stuff as well. Like, hey, you know, I'm logging stuff here. Can you tell me how much I'm spending over time? Or can you tell me how much I'm spending per project? Or can you help me understand when my app is really slow? Or can you help me understand when users are liking or not liking the experience that they have with the product? And so that's kind of naturally expanded what we do to be more than sort of just evals. But yeah, I mean, I think people are willing to pay for tools that help them move quickly and therefore I think it hasn't been a really big blocker for us to actually monetize the product.
How do you go about like writing evals for really like open ended tasks? Like if really the application is, you know, customer support or something that's like chat based, like how do I go about like writing evals there where I don't really know necessarily what the inputs are going to be?
So first of all, I would say some evals are better than no evals and a lot of people, myself included, often have analysis paralysis before you actually start writing evals. And the best thing to do is just to start writing them and then kind of iterate and improve them as you go. In terms of something like customer support and chat specifically, I think the hardest problem is probably finding good data. There's a lot of like little things. For example, when you evaluate a chat interaction, the best thing to do if you have a multi turn interaction is to evaluate like individual steps of the multi turn interaction. So there's a lot of those little tricks and if anyone wants to dig into that more, we have some docs where I'm happy to chat with anyone about that. But I think the hardest thing is just finding good data. And there's really two things you can do. One is just to log stuff in the right format. We have obviously tools that make that easy. But the other thing is to actually collect useful signals that help you find the signal like the useful interactions among the noise. So one thing you can do is capture end user feedback like thumbs up and thumbs down stuff. What we found from working with customers is that thumbs up rarely means that something is definitely good and thumbs down rarely means that something is definitely bad. But it is still a useful filter to look at the things that people actually took the time to rate and comment on as potentially good data. The other thing you can do is you can use online scoring so you can actually have an LLM, review particular interactions and say like this one was like uncharacteristically long or rambly, or the user seemed like they got confused or something like that, and actually use those signals to help you narrow down the data. And then once you find good data to actually do these kind of like open ended evals, I think the problem becomes much easier. Like most engineers, you feel free to correct me if you think otherwise, but I would probably posit that if I said like you're working on a support bot, I could give you 100 really, really representative interactions of what your support bot will actually look at and you have to manually sort of look at the output that's generated and try to improve your app. But I guarantee you that these hundred interactions are pretty representative of what a user would actually see. I think you'd actually still find that pretty compelling because not having to look at like an abyss of production logs or just wait to find stuff out, it's way better to actually just look at a bunch of stuff and so, or, sorry, like a constrained set of things and improve them. And so I actually just think even if you just get to that point, it's pretty powerful.
Even outside of doing evals for like AI models. Like do you think AI is going to change significantly the way that we craft and think about regular software engineering tests?
Oh, for sure. I think English is the new language. There's a lot of debate about whether English is the assembly language of LLMs and people will be writing traditional code, or if English is the new language. And I think if you look at two parallel trends, I would say like Cursor, for example, represents one trend which is using AI to build traditional software really effectively. And then maybe braintrust represents the other parallel trend which is bringing good software engineering practices to this new sort of wild west area of AI development. And you look at the commonality of both of these trends. The big common theme is that if you express what you're trying to do in English, you're able to get a lot more Done. And so I definitely think that the most effective software engineers of tomorrow are going to be writing a higher fraction of English than they do today. I'm not so sure it's going to be 100% English, or if it's going to be a combination of more English or language of your choice and programming and kind of orchestrating a bunch of work that's happening at once. But I definitely think that the traditional world of software engineering, it's going to change in many of the same ways.
And in terms of AI development and productionization, what are some of the other areas that teams really struggle with today that doesn't have good tooling, that essentially is an opportunity for other people? There needs to be better tooling for us to do this. Basically, the current state of things is not ideal and we need to fix it.
Yeah, I think one of the areas that I'm quite fascinated by is automatic optimization. And automatic optimization is like, in the old world, you could say that's training a model or fine tuning a model. But I think in the new world, automatic optimization is the problem of taking observations or data points and instead of changing a prompt yourself, automatically updating your AI system to perform better. I think that's a really interesting area for a variety of reasons. The first is that once you achieve a significant kind of level of scale, it becomes feasible. Like you collect enough data to actually represent what's happening in the real world. A lot of people jump to fine tuning too quickly. And I think the problem is that the English general description that you have in your head ends up being closer to what users are actually trying to do than the like, 50 data points that you collect about what users are trying to do. At some point that flips. Like if you look at cursor, for example, I think they're like at a level of scale where that has flipped and there's enough data to actually represent what people are trying to do. Another thing is that automatic optimization will get you, you know, if you have good data, significantly better performance than just sort of like manually trying to tweak things yourself. That's kind of been the story of AI and ML in general. And yet the tooling to be able to do that is still very, very early. Like, I think many tools that are focused on automatic optimization, they focus on the fine tuning part of it, or actual orchestration of GPUs or whatever to improve the model or the system. But I actually think the hardest problem is that problem of creating the right data flywheel and then assessing whether the automatically optimized thing is actually better than the previous iteration and if not finding the right data to then improve it again. I think that's a big area and if anyone's working on that and wants to collaborate with us, we'd love to chat. I think another area that's really under explored and is going to be very interesting is the security sort of implications of using AI to give you a small data point. Many observability challenges in AI stem from the fact that you need to store the prompts themselves to actually be able to measure the stuff that you want to measure and collect good data to do evals and so on. But that information often contains pii, which, you know, if you're storing like traces and spans in datadog or Grafana, you usually don't need PII to actually look at like the performance metrics that you, that you want to look at. And so doing that effectively I think is hard and it's a new muscle for a lot of companies and I think it's going to take some time for really good best practices to develop around it.
How do you guys deal with that today, given that you have this, you know, logging and monitoring part of brain trust?
Early on we knew that data security just kind of from like a systems standpoint was going to be probably one of the most important things about braintrust. It's another thing, by the way, that VCs told us not to do that. We sort of ignored them and did. And I think again, not everything we do is right, but this was another thing that was right. We've supported running BrainTrust in your own cloud from day one. And we actually built this really powerful hybrid architecture which is embedded at every layer of our product. It lets you basically only run the data plane in your own cloud. But we run all the annoying bits like the UI metadata auth, all that stuff that requires setting up a bunch of DNS names and connecting a bunch of things together, but doesn't actually store the data. And so what that means is you can store all the data in your own environment. Our servers never need to or can access it. Yet you're able to use the latest and greatest version of our software, just like a SAS tool and your browser connects directly to that data. So that architecture has allowed us to help kind of solve the very base layer of that problem, which is that customers don't need to surrender their most sensitive data to us to be able to use the product. But I think there's a lot of really interesting tooling that we're going to build over the next couple quarters that actually allows companies to implement the best practices within the company itself. And so like yes, now we have all the data in our own servers, but we actually want to have a really good workflow around letting Sean run evals, but not necessarily see the data that he is running the evals on or maybe not see exactly that data. So I can't say everything we're doing there yet, but we're going to do some pretty exciting stuff.
I want to talk a little bit more about that, but just one quick thing back to on the optimization thing where you were talking about how essentially as a company scales, maybe the problem gets flipped like cursor essentially has enough data to understand in general what users actually want to accomplish. Is the problem before that essentially just a matter of when you jump into fine tuning too early is your data set is not representative. So you're going to end up with an overfitting scenario for sure.
I think fine tuning, it's kind of like popping the hood of the model before you necessarily have all the data or resources to know how to do it. One of the most common things that I encounter when I talk to customers, almost none of our customers, by the way, use fine tuning in production right now.
I think very few people are.
Yeah, yeah. But one of the most common things that people do is they collect like 50 examples and then they fine tune a model and then they run the model on their 50 examples that they fine tuned it on and they say this thing is great and then they deploy it in production and then it doesn't work very well. And I think that has less to do with the fine tuning process itself and more that they didn't have necessarily the guardrails in place to be able to actually tackle that problem effectively. The reality is that a prompt is actually a very powerful mechanism for using general instructions, reasoning and language to represent a problem. And so I think the point at which you have enough data to fine tune is one where you actually have enough data to like approximate all of the information you're trying to cram into a prompt. But in almost all cases fine tuning doesn't work very well. And actually I'll even provide a more extreme example. There's a wide variety of tools that are pretty cool, like DSPY for example, that don't just support fine tuning, but support stuff like automatic few shot optimization. And people have exactly the same issue, so they'll automatically optimize and few shot a prompt and then observe it on their 50 examples and say this is great and then deploy it in production and it doesn't work very well. And the best teams today are actually manually curating the set of few shot examples that they put in their prompts because they have to kind of use their human general logical understanding of the problem and sort of try to represent that in the few shots themselves. So even that much safer mechanism, you know, is very, very prone to the same problems. So, you know, on the flip side, I think it's a really big opportunity to actually help people do it.
Well. Yeah, absolutely. Back to like the original issue that we're talking about where they're, you know, since we fine tuning on Data set of 50 and then running the test against the same 50, like you should at least be doing some simple twofold testing from like traditional ML there, where you're not testing against the same data set that you use for training.
Yeah, but even doing that is hard.
You still have to do some manual essentially evaluation of the.
Yeah, I mean, creating good train test splits is a really, really hard problem.
Yeah. And then on the, essentially the setup where you're running the data plane inside the customer's cloud and then you're running essentially the equivalent of the control plane inside your cloud. Is the control plane like a multi tenant setup?
Yeah, the control plane is multi tenant. And the invariant that we maintain, which I think was kind of. I didn't realize this at the time, but I've learned is fairly unique, is that our control plane never does or needs to access your data plane. And so that means that you can run BrainTrust, you know, in your own VPN, for example, or you could run the data plane on your laptop. You can run it in a variety of very, very constrained ways. Because the only thing that needs to happen is the data plane needs to do some auth checks against the control plane so it reaches out to do those checks. And your browser, your SDK code needs to be able to talk to the data plane, but nothing else.
Since I'm running that in my cloud, am I covering essentially the cloud cost for that?
You are, yeah.
Okay, how does that deployment work?
So we have two mechanisms today. We have a very kind of crafted experience for doing it in aws. And when you do that, we spin up kind of like the sort of right spec of various databases and services and we deploy a bunch of stuff on lambda functions, which is a whole nother rabbit hole. But it works out to be pretty effective for this use case because of how bursty it is. And we Also have a docker based option, and we've actually boiled the docker option down so that you only need to run a single Brain Trust container. Of course, you can run multiple of those containers and they're stateless and scalable, but you actually only need to run one. And then you can hook us into kind of like the managed versions of a few different common database services, like Postgres, for example, within your environment. And so it's just very, very easy to set it up.
So storage is going to be in the data plane. Like, essentially is the, the actual, like, database that's running behind the scenes? Is that all abstracted away?
It is, yeah.
Just kind of like bigger picture around some of the challenges that people are sort of highlighting in generative AI right now. Like, I think one of the things that's sort of a topic of conversation is around how basically we run out of the new public information to train models on. Models have been scaling up, but essentially performance hasn't sort of scaled with the amount of inputs. What are your kind of thoughts on that? Have we reached the limits of what we can do with sort of the transformer style models?
Yeah, I mean, I think those discussions are very cool. However, I personally don't care at all. And the reason is that if you froze what we have right now, we have at least 10 years of engineering ahead of us to make use of what we have. I'm so excited about what's coming. Like, please don't interpret it as any less excitement about that, but I think there's just so much we can do with what we have right now that anything else is gravy. And there's very smart people working on it. So I'm sure. And there's a lot of capital, right? So I'm sure they'll figure something out. But who cares? I mean, like, there's even this like Riverside tool that we're using right now for doing this conversation. I can think of so many different ways that AI could make the overall podcast experience like much, much, much better. And you know, I think there's so many things you could do with what we have.
Yeah, I agree. I mean, I think that people get a little bit too sort of fixated on some of the rough edges that exist with the technology today around. Like, oh, like, you know, sometimes there's hallucinations, sometimes, you know, like I, I get a generic response or whatever. But like, compared to what you could do in this space, like, you know, a year ago or two years ago, it's like it's pretty insane.
That's an age old. I mean, when I was working on AI pre LLM stuff, that's just always what people are concerned about. It's Nothing specific to LLMs or the time that we live in. It's actually just the fact that as humans we struggle to come to terms with non determinism and non determinism is an inherent characteristic of AI. So this thing is never, it's never going to change. It's just, you know, people are always going to be, they're always going to look ahead to like, how do I, how does the model make this thing better? All the really smart, successful, good AI builders and product folks, they sort of flip the switch and really embrace the fact that AI is like this and they just engineer, engineer, engineer to work around it. And as things get better, it's just gravy for them. Right. Like they've already built a system that doesn't necessarily need things to get better, but as things do get better, it just unlocks things that maybe were difficult before.
What advice would you have for somebody who's interested in building on generative AI technology or getting into AI? And they need to become comfortable with the non deterministic nature.
Yeah. So I think the first thing that's really important is to pick a very, very specific problem that you can solve and attach yourself to the problem rather than AI. So a lot of people are like, now that there's AI, I can do X or AI can do X but it can't do Y and therefore I can do Z. And I think those people never really find their way. I think the best teams are the ones that say something like, okay, wow, O1 just came out and O1 is incredibly good at reasoning. Maybe now I finally have a way of helping doctors have a real shot at differential diagnosis. I'm just making this up, but like a real shot at differential diagnosis. Let me go and see if I can work on the problem of building it. What would a UI look like for a doctor to actually do differential diagnosis together with an AI model? I don't know. Right. And I think focusing on a specific problem like that is very important because it sort of motivates you to talk to users who have the problem and then actually understand what characteristics of the problem are challenging and what you maybe need to engineer around the limitations or, you know, sort of characteristics of AI today. The second thing is obviously to run evals. I think the very good folks that build AI software, they flip from like only doing vibe checks to double checking their work with evals to using evals as a way to actually motivate what they're able to build and see what products they can ship. I'm biased, of course, but I recommend, you know, really, really focusing on. On evals. And I think the third thing I would say, maybe my hot take is like, don't waste your time learning Python or getting involved in the Python ecosystem. I think there's a lot of kind of garbage software and tooling that exists in Python because it's the language of AI and ML in particular. But all the really great software that we use ends up being implemented in the typescript world and is really built by people that are very, very passionate about product engineering. And I think the same is true with AI. Vercel is a really great example of a company that's both building great tools internally and helping to improve the overall ecosystem around this. We've been working with them since the very early days of braintrust, and they've used us to, for example, run evals on V0, and I think we've helped them ship a number of features as a result. And I think that mindset around building great UI and great products is the sort of prevailing factor for what ends up being good AI software. So I would just go deep into the AI TypeScript ecosystem and, you know, there's a lot of fun stuff to work with in there.
Awesome. Well, Ankur, thanks for being here.
Awesome. Yeah, thanks so much for having me. It was a fun discussion.
Absolutely. Cheers.
Podcast Summary: Software Engineering Daily - The Challenge of AI Model Evaluations with Ankur Goyal
Release Date: June 10, 2025
In this episode of Software Engineering Daily, host Shawn Falconer welcomes Ankur Goyal, CEO and founder of BrainTrust Data, to discuss the intricate challenges associated with evaluating AI models, particularly Large Language Models (LLMs). The conversation delves into the complexities of non-deterministic behavior in AI, the design and deployment of evaluation tools, and the evolving landscape of software engineering in the age of AI.
Ankur Goyal brings a wealth of experience to the table. Before founding BrainTrust Data, he led the AI team at Figma and previously established a company called Empira, which was later acquired by Figma. At Empira, Ankur grappled with the early stages of AI development, especially before the advent of models like ChatGPT. This background laid the foundation for BrainTrust Data, a platform focused on making LLM development both robust and iterative.
Quote:
"Prior to BrainTrust I used to lead the AI team at Figma and before that I started a company called Impera, which Figma acquired."
— Ankur Goyal [01:25]
Evaluating AI models, especially LLMs, introduces unique challenges not typically encountered in traditional software engineering. These models are inherently complex, versatile, and exhibit non-deterministic behavior, making standard evaluation methods like code reviews and automated testing insufficient.
Key Challenges Discussed:
Quote:
"The biggest difference is that it's non deterministic... you start to have to do stuff like, okay, I'm going to try running this thing four times and if it succeeds three out of the four times, then maybe it's good enough."
— Ankur Goyal [05:10]
BrainTrust Data addresses these challenges by providing a structured platform for AI evaluations, breaking down the eval process into three core components:
This modular approach ensures flexibility and ease of use, allowing engineers to integrate evaluations seamlessly into their workflows.
Quote:
"We broke an eval down into just three simple parts... data, task function, and scoring functions."
— Ankur Goyal [14:24]
BrainTrust is designed to integrate smoothly into existing development workflows. Initially run locally to generate and visualize logs, evaluations can scale to incorporate team-wide processes through integrations like GitHub Actions. BrainTrust employs caching mechanisms to minimize redundant operations, thereby controlling both speed and cost.
Key Points:
Quote:
"With BrainTrust, they're able to solve more than 30 issues per day... with the analog to CI/CD, observability, et cetera in AI, you're just able to move a lot faster."
— Ankur Goyal [11:49]
Despite initial skepticism from venture capitalists (VCs) regarding the profitability of CI/CD-like tools, BrainTrust focused on the critical pain point of AI evaluations. This strategic focus paid off as companies recognized the value of robust evaluation mechanisms in enhancing user experiences and product reliability. Additionally, the platform's capabilities expanded to include logging and monitoring, further increasing its value proposition.
Quote:
"Our core bet was that there are some early adopters... almost all the companies on that list are customers."
— Ankur Goyal [15:50]
As AI systems become more intricate, BrainTrust emphasizes the importance of both end-to-end and component-specific evaluations. By evaluating individual modules—such as a planner in an agent-based system—developers can ensure reliability and reusability, facilitating the construction of more sophisticated AI applications.
Quote:
"The best systems are often the systems that can be evaluated really well."
— Ankur Goyal [22:02]
Ankur posits that AI is fundamentally transforming software engineering. The ability to interact with AI through natural language (e.g., English) is streamlining and accelerating development processes. This paradigm shift is making AI tools more accessible and integrating them deeply into standard engineering practices.
Quote:
"English is the new language... the most effective software engineers of tomorrow are going to be writing a higher fraction of English than they do today."
— Ankur Goyal [28:24]
BrainTrust addresses data security by allowing customers to run the data plane within their own cloud environments, ensuring that sensitive data remains under their control. The platform’s architecture separates the control plane from the data plane, preventing BrainTrust’s servers from accessing customer data and enabling secure, scalable deployments.
Quote:
"Our control plane never does or needs to access your data plane... your browser connects directly to that data."
— Ankur Goyal [32:41]
For those interested in developing with generative AI, Ankur advises focusing on specific problems rather than the technology itself. By anchoring AI applications to tangible, user-centric challenges, developers can create more effective and impactful solutions. Additionally, he emphasizes the importance of implementing robust evaluation processes to continuously refine and improve AI models.
Key Recommendations:
Quote:
"Don't waste your time learning Python or getting involved in the Python ecosystem... I recommend, you know, really, really focusing on the AI TypeScript ecosystem."
— Ankur Goyal [42:02]
The episode provides invaluable insights into the evolving challenges of AI model evaluations and the innovative solutions BrainTrust Data offers. Ankur Goyal’s expertise underscores the importance of adapting traditional software engineering practices to accommodate the unique demands of AI development. As AI continues to integrate into various facets of technology, tools like BrainTrust are pivotal in ensuring the reliability, efficiency, and scalability of AI-driven applications.
Final Quote:
"The most effective software engineers of tomorrow are going to be writing a higher fraction of English than they do today."
— Ankur Goyal [28:24]
Note: This summary encapsulates the core discussions and insights from the podcast episode, integrating direct quotes for emphasis and clarity. For a deeper understanding, listeners are encouraged to access the full episode.