Podcast Summary
Podcast: Software Engineering Daily
Episode: Optimizing Agent Behavior in Production with Gideon Mendels
Date: February 17, 2026
Host: Kevin Ball (K. Ball)
Guest: Gideon Mendels (Co-founder and CEO of Comet)
Episode Overview
This episode delves into the evolving landscape of deploying LLM (Large Language Model) agents in production, especially focusing on agent evaluation, optimization, and observability. Gideon Mendels, CEO of Comet, shares the lessons learned from years of building ML tooling, discusses the nuances of agent-based system development, and explores the emergence of new workflows, tools, and best practices for monitoring and continuously improving AI agents in production.
Key Discussion Points & Insights
1. Gideon's Background and Comet's Evolution
- Gideon's journey: Started as a software engineer, transitioned to ML at Google, worked on language models and hate speech detection for YouTube (pre-transformer era).
- Motivation for Comet: Frustrated by the "Wild West" ML workflows—even at Google—that lacked the rigor and tooling of traditional software engineering.
- Comet's path: Began as an experiment tracking platform (2017-2018), expanded to encompass dataset versioning, model registries, and monitoring.
- Industry shift: Customers began using LLM APIs (like OpenAI) instead of training custom models, requiring new workflows.
- Launch of Opik (Sep 2024): An open-source platform for evaluation, optimization, and production observability of LLM agents—distinct from traditional ML tooling.
"As someone coming from a software engineer background... just seeing how the whole thing is kind of like a little bit like the Wild West, it was very, very challenging." — Gideon (03:12)
2. Agent Development: Between ML and Software Engineering
- Unique Challenges:
- In traditional ML, the model is a fixed binary, typically with robust experimentation on datasets and hyperparameters.
- In LLM-powered agents, you control prompts, tool calls, and context (not model weights), making the process more dynamic and less deterministic.
- The "variables" are now system prompts, tool call descriptions, chunking strategies, vector DB setups, etc.
- Non-determinism:
- Real challenge for software engineers new to ML/AI—unit tests are hard to translate because deterministic assertions (e.g., string matching) break down with LLMs.
- Output can be semantically equivalent but textually different; need new methods to assert correctness (semantic similarity, LLM-judged scoring).
"Building these agents is somewhere in between software engineering and ML. It's definitely not pure ML, it's definitely not pure software engineering. But there's a lot of learnings from both of these paradigms..." — Gideon (10:13)
3. Evals: The Missing Foundation
- What are evals?
- Bridging software test suites and ML metrics: providing lists of inputs and expected outputs, with non-string-matching assertions.
- Evals can use deterministic metrics (e.g., BLEU score) or LLM-as-a-judge for more nuanced grading.
- Essential for confidence in shipping changes, tracking regressions, and enabling continuous improvement.
- Challenges in adoption:
- Painful to create high-quality eval datasets (requires subject matter expertise, domain knowledge).
- Most teams currently under-invest in evals, leading to brittle, hard-to-debug agents.
- Productized solutions (integrated with UI/feedback loop) can help bootstrap the process from real-world failures or human corrections.
- Levels of evals:
- Analogous to unit, integration, and system tests.
- Begin with end-to-end/system tests (easier to compile, offers broad coverage), then move towards more granular (e.g., did the agent pick the right tool?).
"A lot of people struggle with doing [evals]... But spend some time on building like a very small evaluation dataset, 20 samples, like just 20, it will pay off big time." — Gideon (50:35)
4. Optimization: Agents as Search Problems
-
Optimization loop:
- Treat improving agents as a search over hyperparameters: prompt wording, tool configurations, chunking, etc.
- Use evals as your "objective function"—optimize to maximize throughput.
- Possible approaches: brute-force (impractical), random search, Bayesian optimization, evolutionary LLM-guided rewriting of prompts.
- LLM-assisted optimizers can suggest prompt rewrites/fixes by analyzing failures in the eval suite, then test and iterate.
-
Practical example:
- LangChain’s JSON schema prompt: optimization improved the pass rate from 12% to 96% in 2 iterations, at <$1 compute cost (25:27).
-
Scalability:
- Unlike ML models requiring thousands of training samples, agent optimization can show big gains with just 20–30 eval samples.
"The reality is, everything we're talking about is a search problem ... and if you have an eval test suite, you have a certain score of how well you're doing against it. And we are searching for hopefully the global maxima..." — Gideon (19:54)
5. Continuous Improvement, Feedback Loops, and Production Deployment
-
Current state (2026):
- Most teams don't re-optimize agents continuously—eval suites don’t grow fast due to the difficulty of capturing and labeling new scenarios.
-
The future:
- ML pipelines commonly retrain; the industry is headed towards agents that retrain/re-optimize as new feedback is collected (via production data, user feedback, or LLM-judged anomalies).
- Key challenge: Bootstrapping a robust, ever-growing eval suite with the right quality, privacy, and operational or AB-testing safeguards.
-
Production prompt/config management:
- Shift towards configuration registries—manage prompts and related hyperparameters as dynamic, versioned data.
- Teams increasingly separate code and prompt "blueprints," allowing for dynamic prompt updates without full redeploys.
- Enables canary/AB tests and correlation of eval suite performance with real user behavior.
"Most teams building agents are not using any framework ... some of the most successful ones out there are home-brewed, vanilla built. So I wouldn't spend too much effort on that." — Gideon (50:20)
6. The Future of Agent Interfaces and Workflows
- Dynamic UIs and Agents:
- LLMs enable intent-based or generative UIs that can change based on agent state or chat history (imperative vs. declarative).
- Dream: UIs may be dynamically constructed by agents to serve the real user context, not just static product specs.
- Evaluation challenges for generative UIs:
- Human-in-the-loop + LLM-as-judge hybrid models.
- For browser/DOM agents, evals may compare final DOM state; React/Redux sandboxes may help keep agent-generated code safe and testable.
"Is the future where every UI is on demand, generated by your agent to just show you what you need to see right now? I don't know... But exciting times to be building." — Gideon (45:13)
Notable Quotes & Memorable Moments
-
On ML vs. LLM agent development:
"You have all these variables ... and you're trying to find a combination that gives you the best results. So from that perspective, it's quite similar. In the day to day, it tends to look quite different." — Gideon (06:46)
-
On the status of agent deployment best practices:
"Are you kidding? This whole field is a year old. Like, there's no established anything." — K. Ball (35:25)
-
On the reality of production agent feedback loops:
"Whether you built evals or not, you put the stuff in production, and at some point someone comes complaining..." — Gideon (13:34)
-
On industry hype vs. reality:
"If you're online, if you're on Twitter, it seems like everyone figured this out... The reality is everyone in the industry is trying to figure this out and it's hard for everyone, including OpenAI." — Gideon (50:56)
Important Segment Timestamps
| Timestamp | Segment Description | |-----------|---------------------| | 02:23 | Gideon's background, Comet origins, and why they built Opik | | 06:22 | Differences between ML and agent/LLM workflows | | 08:51 | Non-determinism: difficulties in testing/evaluating LLM systems | | 11:10 | The power and pitfalls of evals; productized approaches to enable them| | 15:34 | Types/levels of evals: system, integration, unit for agents | | 19:06 | Debugging failing evals: moving towards search/optimization loops | | 21:46 | Treating agent improvement as a search/optimization problem | | 25:27 | Real-world prompt optimization case study (LangChain example) | | 28:48 | Frequency of optimization & continuous improvement; production process| | 31:42 | Barriers: privacy, eval set curation, operational/logistical issues | | 34:05 | Prompt and config management: code vs. data; versioning | | 40:42 | Roadmap for Opik and near-future developments | | 45:40 | The coming shift to intent-driven UIs powered by LLM agents | | 47:55 | Evaluating generative UIs, leveraging DOM-level and human feedback | | 50:19 | Gideon’s practical advice for teams shipping agents |
Key Takeaways & Actionable Advice
1. Don’t over-optimize early on frameworks or minor costs.
Focus on value, evaluation datasets, and using the best-performing models first—optimize later for cost.
2. Start small but invest in evals.
A basic suite of 20 high-quality eval samples will make agent development and debugging dramatically more robust and repeatable.
3. Embrace productized, UX-supported workflows for agent management.
Leverage trace tools, configuration registries, and human-in-the-loop feedback to improve and audit your agents over time.
4. Prepare for a future of continuous, automated improvement.
Build your systems in a way that new real-world data can flow into evals, which then drive agent optimization and redeployment—mirroring ML retraining pipelines.
5. Don’t get discouraged by the "lack of best practices."
Everyone—from startups to OpenAI—is learning as they go. Most "real" production agents are homegrown and under constant evolution.
For more:
- Explore Opik open source for agent evaluation/optimization.
- Follow K. Ball and Gideon Mendels on social for their ongoing insights.
