Podcast Summary: "Evals, Feedback Loops, and the Engineering That Makes AI Work"

AI + a16z | Host: a16z (Martin Casado)
Guest: Ankur Goyal (Founder & CEO, BrainTrust)
Release Date: February 17, 2026

Main Theme / Purpose

This episode examines the hidden discipline and “real engineering” that differentiates AI products that work from those that don’t. Martin Casado (a16z general partner) and Ankur Goyal (BrainTrust) dive into the importance of evals, feedback loops, and testing harnesses, contrasting brute force scaling with thoughtful system engineering. They touch on the limitations of capital-driven AI development, open-source vs. closed-source model dynamics, surprising benchmarks (Bash vs. SQL for agents), and the nuances of deploying AI in enterprise environments.

Key Discussion Points & Insights

The Fundamental Tension: Brute Force vs. Engineering

Brute Force Dominates AI Progress
- Frontier labs leap ahead by “throwing more computer, more data” at problems:
  
  "These frontier labs don't have that problem. They can literally just raise money and build a model based on the money." [00:13, Martin]
- This “bitter lesson” (Sutton) is anti-engineering:
  
  "It's almost like anti-engineering... just throw a bunch of data and compute at this stuff. And what comes out is basically the thing." [09:17, Martin]
Systems vs. AI Mindsets
- Systems people value predictability and deterministic solutions; AI is inherently less deterministic and more probabilistic:
  
  "AI is continuous and systems are discrete. Humans fundamentally think a little bit more in terms of systems and... predictability and reliability... than they do non-determinism." [07:45, Ankur]
- Harmonizing these approaches is where real product value emerges.

What is an “Eval”? Bridging Scientific Method and Engineering

Defining Evals
- Ankur likens evals to “the scientific method applied to software engineering with non-deterministic systems”:
  
  "You come up with a hypothesis... simulate running the system on a set of inputs... observe outputs... You quantitatively look at the difference." [03:42, Ankur]
- Emphasizes both quantitative and qualitative measurement.
- Important: Evals are not about understanding the model, but protecting your app from its unpredictability:
  
  "The problem is not to understand [the model]... you're almost protecting your app from them." [15:08, Martin]
Evals Are the Natural Evolution of Product Management
- "I think evals are the natural evolution of a PRD. By creating a really good eval, you are making a declarative representation of what your product should be." [15:27, Ankur]

Open Source vs. Closed Source Model Dynamics

Rapid Intelligence “Sublimation”
- Models from frontier labs quickly inform/bleed into open-source or other third-party models.
- Chinese models show high token volume, but low dollar-weighted adoption.
Why Not More Adoption of Cheap/High-Volume (Chinese) Models?
- Worse APIs, higher error rates, limited rate limits, and delivery issues.
  
  "None of the open source providers give you good rate limits unless you beg them." [19:44, Ankur]
- Cost is not always the limiting factor—the API reliability and engineering polish still matter greatly.
Cyclic Industry Pattern
- Every time a closed-source model leaps ahead, everyone moves to it and forgets open-source; after stagnation, open-source catches up, and adoption shifts back, only for next leap to reset the cycle.
  
  "It’s this push and pull... as soon as one of these things comes out, the entire industry forgets about open source models..." [21:44, Ankur]

Limits of Brute Force: Engineering Still Matters

Brute-forced AI offers fast progress, but engineering disciplines provide reliability and efficiency.
Ankur argues opportunities for “engineering God to be more efficient” emerge once you can’t brute-force another quantum leap:

"When you can't make God 1% smarter, there is like an insane opportunity to engineer God to be more efficient." [01:15, Ankur]
The Unsung Hero: The Testing Harness
- The companies shipping AI products that “actually work” are those with strong engineering around testing harnesses and feedback loops—not those with the best models.

Benchmarking: Bash vs. SQL for Agents (A Comical Result)

The Setup
- Industry folk claim LLMs are naturally best at bash (i.e., throw them into a Unix environment), so people hack agent systems around bash interfaces.
The Reality
- Ankur’s team ran direct benchmarks: SQL-based interfaces (for the same agent tasks) were more accurate, efficient, and faster—even “the worst models perform better on SQL than... on bash.” [38:06, Ankur]
- Quote:
  
  "SQL is more accurate, it's more efficient, it's more token efficient, it's faster. The worst models perform better on SQL than they do on [Bash]—like everything." [39:27, Ankur]
Insight
- Engineering to the “sweet spot” of the model’s actual capabilities (vs. assuming bash is more accessible) yields better outcomes.

Pricing, Economics & Fraud in AI SaaS

Value Alignment
- Pricing models trend toward usage-basing (tokens/bytes), aligning customer value and costs.
- Experience mirrors shift from perpetual to recurring SaaS pricing.
Fraud Issues
- Present, but less severe in B2B settings with higher entry points. Still, abuse patterns (like gaming free plans) do exist.

What Will Limit AI Frontier Labs: Money or “The Sun”?

Frontier Labs Scale Faster Than Traditional Software
- Unlike prior software companies, rate-limited by engineering talent and time, AI labs are only limited by capital—for now.
- Real constraint may be on the consumption side:
  
  “The speed at which the planets... can actually ingest the Heat is potentially the first limiting factor.” [27:36, Ankur]

Notable Quotes & Memorable Moments

On AI Systems vs Traditional Software:
- "Imagine writing an operating system for a chipset every time like a new version comes out. Like you had an entirely different machine code or instruction set." [10:52, Martin]
On Protecting Apps from LLMs:
- "Literally the problem is not to understand them. Because a lot of people, when they think of Evals, they think of, I'm understanding this thing, but that's not what's happening. It's like you're almost protecting your app from them." [15:08, Martin]
On Open Source Model Adoption:
- "People use these models which have worse APIs, higher error rates. None of the open source providers give you good rate limits unless you beg them." [19:44, Ankur]
On Team Habits:
- "We have a few very shrewd customers who've observed that certain high volume use cases just don't change over time. And they've specifically instructed their staff not to get caught up in this stuff." [21:44, Ankur]
On Brute Force vs. System Design:
- "You're really underestimating the intelligence of the model if you force it to do the brute force thing." [39:58, Ankur]
On the Bash-vs-SQL Benchmark:
- "SQL is more accurate, it's more efficient, it's more token efficient, it's faster... the results are just like comical." [39:27, Ankur]

Timestamps for Key Segments

[00:00] – Intro: Systems thinking vs. AI non-determinism
[02:12] – Ankur Goyal’s background: Relational DBs to AI, early lessons in feedback loops
[03:42] – What is an eval? Evals vs intuition; feedback to product iteration
[07:45] – Tension between systems and AI mindsets
[09:17] – The “bitter lesson”; AI as anti-engineering
[15:08] – Evals to protect apps, not understand models
[16:52] – Open vs closed source models, “sublimation” of intelligence into open models
[17:17] – Chinese models: why high token but low dollar use
[19:44] – Limits of open-source models: rate limits, APIs, delivery
[21:44] – Cyclic adoption of open-source after closed-source leaps
[24:28] – New equilibrium: rate limits, growth, and funding models
[26:52] – Capital as limiting factor; when does engineering become necessary?
[27:36] – Does enterprise demand become the bottleneck?
[36:14] – The Bash vs. SQL benchmark for agents; surprising results
[39:58] – Don’t underestimate models by reducing to brute force
[41:20] – Engineering at BrainTrust: types, specs, and state guarantees

Overall Takeaways

True AI product success is not just about bigger, smarter models—it’s about the engineering around those models: evals, harnesses, and feedback loops.
The “bitter lesson” (more data+compute wins) still dominates, but as progress plateaus, efficient engineering gives teams competitive advantage.
The industry cycles between open and closed-source model dominance, but practical delivery, reliability, and system integration matter most in the long run.
Simple benchmarks reveal how easy it is to be distracted by model capabilities (e.g., Bash) when better outcomes come from thoughtful system design (SQL).
Limits to progress may come not from model complexity, but demand-side constraints: how quickly can enterprises and consumers adapt?

If you want to grasp the “real work” behind the latest AI systems, this episode is a must-listen for practitioners, product managers, and founders navigating the space between research and reliable, usable software.