Transcript
A (0:00)
LLM powered systems continue to move steadily into production, but this process is presenting teams with challenges that traditional software practices don't commonly encounter. Models and agents are non deterministic systems which makes it difficult to test changes, reason about failures and confidently ship updates. This has created the need for new evaluation tooling designed specifically around the properties of LLMs. Comet is a platform with roots and mlops that has evolved to support teams building modern LLM powered applications. The company recently launched opic, which is an open source platform focused on evaluation, optimization and observability for LLM agents. Together, the tools aim to bring the rigor of traditional engineering and ML workflows to the rapidly evolving world of agent based systems by treating prompts, tools and workflows as optimizable components that can be evaluated and improved over time. Gideon Mendels is the co founder and CEO of Comet. He previously worked at Google on hate speech and deception detection and he founded Groupwise, which trained and deployed NLP models processing billions of chats in this episode, Gideon joins Kevin Ball to discuss how agent development sits between software engineering and ML, why evals are the missing foundation for most AI teams, prompt optimization as a search problem and the future for continuously improving agents in production. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website K Ball LLC.
B (2:09)
Gideon, welcome to the show.
C (2:11)
Yeah Kevin, thanks for having me. I'm a big fan of the podcast so I was looking forward for this one.
B (2:15)
Yeah, I'm excited. Well, let's start with you. So can you give a little bit about your background, how you ended up at Comet and then some of what Comet about?
C (2:23)
Absolutely. So originally I started as a software engineer, kind of moved throughout the stack in the first kind of few years and then about 10, 12 years ago I shifted to working on machine learning. I was a grad student and then I went to Google. Funny enough I worked on language models. This is, you know, 2016 so they weren't large nor very good. Right. These like pre transformer days. Unfortunately lcms, if anyone still knows what that is. And you know, as someone coming from a software engineer background where, you know, we take a lot of pride of how we build software, obviously a lot of that changing right now. I'm sure we'll talk about it. But you know, a lot of pride of how we build software, the tools that we use, and then joining an ML team with amazing, very, very smart and talented people. But just seeing how the whole thing is kind of like a little bit like the Wild west, it was very, very challenging. You know, we worked on hate speech detection and YouTube comments. If you remember the YouTube comments section back in the day, I think someone call it the worst place on the Internet. So we had a hard time getting these models to work. And from that point I was like, okay, look, we had data, we had compute, we had smart people, we still couldn't do it. What is it? And it's not considered necessarily a hard ML problem. And I realized it's just kind of like around the process of how do you drive these projects? And I called my co founder here at Comet, who we worked on another startup building ML models and I was like, hey, you remember making fun of my ML workflows and how everything is stitched together. So I'm at Google and it's exactly the same thing, just at massive scale. So that's really how we got started. This is 201718 and we started with specifically what my team and myself needed back then at Google, which was around model experiment tracking. You train a bunch of these models, there's all these moving pieces, hyperparameters, data set versioning, all these results, and it's really hard to know that you're making progress to understand what you're doing next. Collaborating is completely out of the question because no one has access to anything. So we started with that and then over the years we kind of expanded that side of the platform so dataset versioning, model registries, model monitoring and such. And then about two years ago or so, obviously there was quite a big shift in the industry and we started seeing a lot of our customers and users started telling us, hey, for this use case, we're not going to train a model anymore, we're going to try to build it on top of OpenAI API, but it's still very similar because we're testing all these different stuff and we still want to use Comet for that. How can you help us? And at first we built, we started to add some features and such to help with that, but eventually we realized, okay, there's a lot of similarities, but there's also enough differences not to try to bake it into a slightly different workflow. In September 2024, we launched Opik Opik, which is our open source product focused on team building Agents, any type of LLM powered applications really focusing on the end to end from early dev through kind of this deployment process to production and specifically things around observability and evaluation and optimization of these agents automatically. But yeah, it's been a fun ride. We power some amazing AI teams and Uber, Netflix, Etsy, Shopify, Autodesk. We have roughly 150,000 engineers over the world using our products. Great adoption on the open source front. So yeah, it's been a fun ride so far.
