Podcast Summary: Software Engineering Daily - The Challenge of AI Model Evaluations with Ankur Goyal
Release Date: June 10, 2025
Introduction
In this episode of Software Engineering Daily, host Shawn Falconer welcomes Ankur Goyal, CEO and founder of BrainTrust Data, to discuss the intricate challenges associated with evaluating AI models, particularly Large Language Models (LLMs). The conversation delves into the complexities of non-deterministic behavior in AI, the design and deployment of evaluation tools, and the evolving landscape of software engineering in the age of AI.
Ankur Goyal’s Background and BrainTrust Data
Ankur Goyal brings a wealth of experience to the table. Before founding BrainTrust Data, he led the AI team at Figma and previously established a company called Empira, which was later acquired by Figma. At Empira, Ankur grappled with the early stages of AI development, especially before the advent of models like ChatGPT. This background laid the foundation for BrainTrust Data, a platform focused on making LLM development both robust and iterative.
Quote:
"Prior to BrainTrust I used to lead the AI team at Figma and before that I started a company called Impera, which Figma acquired."
— Ankur Goyal [01:25]
Challenges in AI Model Evaluations
Evaluating AI models, especially LLMs, introduces unique challenges not typically encountered in traditional software engineering. These models are inherently complex, versatile, and exhibit non-deterministic behavior, making standard evaluation methods like code reviews and automated testing insufficient.
Key Challenges Discussed:
- Non-Determinism: Unlike traditional software tests that expect consistent outcomes, AI models can produce varied results even with the same input.
- Data Sourcing: Effective evaluations require high-quality, relevant data, which can be difficult to obtain and manage.
- Integration of Non-Technical Insights: Leveraging feedback from non-technical team members is crucial for identifying and improving poor user interactions with AI models.
Quote:
"The biggest difference is that it's non deterministic... you start to have to do stuff like, okay, I'm going to try running this thing four times and if it succeeds three out of the four times, then maybe it's good enough."
— Ankur Goyal [05:10]
BrainTrust’s Solution and Design Philosophy
BrainTrust Data addresses these challenges by providing a structured platform for AI evaluations, breaking down the eval process into three core components:
- Data: A collection of inputs and optionally expected ground truth values.
- Task Functions: Simple functions that process inputs to generate outputs, which can range from single prompts to complex multi-agent interactions.
- Scoring Functions: Functions that assess the generated outputs against expected outcomes, producing a score between 0 and 1.
This modular approach ensures flexibility and ease of use, allowing engineers to integrate evaluations seamlessly into their workflows.
Quote:
"We broke an eval down into just three simple parts... data, task function, and scoring functions."
— Ankur Goyal [14:24]
Deployment, Integration, and Cost Management
BrainTrust is designed to integrate smoothly into existing development workflows. Initially run locally to generate and visualize logs, evaluations can scale to incorporate team-wide processes through integrations like GitHub Actions. BrainTrust employs caching mechanisms to minimize redundant operations, thereby controlling both speed and cost.
Key Points:
- Local Execution: Developers can run evaluations on their machines, with results quickly visualized in BrainTrust’s UI.
- CI/CD Integration: Seamless integration with GitHub Actions allows evaluations to become part of the pull request workflow.
- Cost Efficiency: By caching results and optimizing evaluation runs, BrainTrust ensures that inference costs remain low relative to the value delivered.
Quote:
"With BrainTrust, they're able to solve more than 30 issues per day... with the analog to CI/CD, observability, et cetera in AI, you're just able to move a lot faster."
— Ankur Goyal [11:49]
Business Strategy and Monetization
Despite initial skepticism from venture capitalists (VCs) regarding the profitability of CI/CD-like tools, BrainTrust focused on the critical pain point of AI evaluations. This strategic focus paid off as companies recognized the value of robust evaluation mechanisms in enhancing user experiences and product reliability. Additionally, the platform's capabilities expanded to include logging and monitoring, further increasing its value proposition.
Quote:
"Our core bet was that there are some early adopters... almost all the companies on that list are customers."
— Ankur Goyal [15:50]
Building Complex AI Systems and Future Tooling
As AI systems become more intricate, BrainTrust emphasizes the importance of both end-to-end and component-specific evaluations. By evaluating individual modules—such as a planner in an agent-based system—developers can ensure reliability and reusability, facilitating the construction of more sophisticated AI applications.
Quote:
"The best systems are often the systems that can be evaluated really well."
— Ankur Goyal [22:02]
AI’s Impact on Software Engineering Practices
Ankur posits that AI is fundamentally transforming software engineering. The ability to interact with AI through natural language (e.g., English) is streamlining and accelerating development processes. This paradigm shift is making AI tools more accessible and integrating them deeply into standard engineering practices.
Quote:
"English is the new language... the most effective software engineers of tomorrow are going to be writing a higher fraction of English than they do today."
— Ankur Goyal [28:24]
Security and Data Privacy in AI Tooling
BrainTrust addresses data security by allowing customers to run the data plane within their own cloud environments, ensuring that sensitive data remains under their control. The platform’s architecture separates the control plane from the data plane, preventing BrainTrust’s servers from accessing customer data and enabling secure, scalable deployments.
Quote:
"Our control plane never does or needs to access your data plane... your browser connects directly to that data."
— Ankur Goyal [32:41]
Advice for Building on Generative AI
For those interested in developing with generative AI, Ankur advises focusing on specific problems rather than the technology itself. By anchoring AI applications to tangible, user-centric challenges, developers can create more effective and impactful solutions. Additionally, he emphasizes the importance of implementing robust evaluation processes to continuously refine and improve AI models.
Key Recommendations:
- Problem-Focused Development: Address concrete issues to drive meaningful AI applications.
- Embrace Evaluations: Use evals to validate and enhance AI outputs systematically.
- Choose the Right Tools: Prefer TypeScript-based ecosystems for better product engineering practices over traditional Python-heavy environments.
Quote:
"Don't waste your time learning Python or getting involved in the Python ecosystem... I recommend, you know, really, really focusing on the AI TypeScript ecosystem."
— Ankur Goyal [42:02]
Conclusion
The episode provides invaluable insights into the evolving challenges of AI model evaluations and the innovative solutions BrainTrust Data offers. Ankur Goyal’s expertise underscores the importance of adapting traditional software engineering practices to accommodate the unique demands of AI development. As AI continues to integrate into various facets of technology, tools like BrainTrust are pivotal in ensuring the reliability, efficiency, and scalability of AI-driven applications.
Final Quote:
"The most effective software engineers of tomorrow are going to be writing a higher fraction of English than they do today."
— Ankur Goyal [28:24]
Note: This summary encapsulates the core discussions and insights from the podcast episode, integrating direct quotes for emphasis and clarity. For a deeper understanding, listeners are encouraged to access the full episode.
