AI + a16z: Benchmarking AI Agents on Full-Stack Coding

Date: March 28, 2025
Host: Martin Casado (a16z General Partner)
Guest: Sujay Jayakar (Co-founder & Chief Scientist, Convex)

Episode Overview

In this episode, a16z's Martin Casado sits down with Sujay Jayakar, widely regarded as a leading systems thinker and co-founder of Convex. Their conversation explores the challenges and nuances of benchmarking AI agents on full-stack coding tasks. Sujay walks through the motivations for his team's newly released benchmark, discusses the current capabilities and limitations of autonomous coding agents, and shares insights into both model performance and the future of software development with AI-generated code.

Key Discussion Points & Insights

1. The State of AI in Coding: Early Days for Full-Stack Autonomy

Sujay compares autonomous coding to trajectory management in reinforcement learning, noting that "coding a difficult problem is actually like playing a game" ([00:00]).
Insight: Current AI agents are sometimes able to build full-stack apps, but remain "right on the edge of what's possible," especially for complex, real-world scenarios ([04:48]).
Quote: “It’s still not a slam dunk for autonomous AI agents today.”
— Sujay Jayakar, [04:48]

2. Convex: Rethinking App Databases for Simplicity & Reactivity

What is Convex? A reactive, type-safe, end-to-end database aimed at making application development as simple as possible ([02:35]).
Key design: Everything is type-safe and state management is handled automatically; developers write just JavaScript.

3. Motivation for New Benchmarks: The Full Stack Gap

Existing benchmarks focus on narrow, isolated coding problems, not end-to-end application building ([06:37]).
Convex’s customers—especially those building with AI coding agents—faced inconsistent results, prompting the team to create a more holistic and meaningful benchmark.
Full Stack Bench: Evaluates whether agents can implement backends for existing frontends in common app patterns.
Quote: “For the tasks that people are actually doing... there aren’t really great publicly available benchmarks.”
— Sujay Jayakar, [07:57]

4. Benchmarking and Evals: Essential, But Underappreciated

Official model benchmarks aren’t directly useful for independent developers. Instead, evals—task-specific, test-based evaluations—are crucial for assessing model performance in real-world products ([09:49]).
Evals involve:
- Defining specific tasks
- Grading model outputs (sometimes launching backends and running human-written tests)
- Comparing across different model versions ([11:56], [14:09])
Quote: “A lot of companies consider high quality evals to be their secret sauce... it’s the evals that actually let you build a real app.”
— Sujay Jayakar, [25:08]

5. Practical Challenges: Complexity, Variance, and Guardrails

High Variance: More complex tasks see increased inconsistency—even within the same model ([14:40]).
- Large codebases result in agents losing context, requiring human intervention ([15:59]).
Reducing Variance:
- Strong guardrails are vital; the best example is TypeScript-style type safety ([16:15]).
- Feed type errors and context back into the agent for immediate correction ([17:12]).
Beyond Type Safety:
- Testability and clear execution semantics help, but runtime reasoning is still a challenge ([17:53]).
- Certain rules (e.g., complicated React hooks or RLS SQL policies) consistently trip up even advanced models ([18:47]).
Quote: “If you want to decrease that variance... having type safety can keep it on the straight and narrow.”
— Sujay Jayakar, [17:36]

6. Model Quality vs. Cost: Tradeoffs and Current Best Picks

There’s a real performance gap between expensive models (e.g., OpenAI GPT-4o) and cheaper alternatives (e.g., O3 mini, Gemini); the tradeoff often shows in result quality ([22:28]).
Sujay’s take: Gemini offers the best value ("pretty good, and cheap for price performance" [23:28]), possibly because of Google's vertical hardware integration.
Model-specific fine-tuning for cheaper models is possible, but complex and not practical for most hobbyists.
Quote: "As a hobbyist... I always want to use the smaller models, but they just don’t find them to be as good... Is there anything that I can do to use the cheaper models or am I screwed?"
— Martin Casado, [22:16]

7. Knowledge, Context & the Need for Prompt Adaptation

Large language models' knowledge is anchored in their training data and doesn’t easily adapt to new abstractions or APIs ([19:21]).
In-context learning and prompt adaptation are essential for getting solid performance as models evolve ([20:38]).
Changes between model versions (e.g., Claude 3.5 to 3.7) require updating prompts and evals to maintain performance ([19:46]).

8. State of Public Evals & Community Sharing

High-value eval sets are rarely published—companies keep them proprietary ([26:16]).
Sujay hopes for an "open-source eval sets moment," where communities share task-based evals for common application types ([25:08], [26:16]).

9. Limitations in Workflow Consistency & the "Wild West" Reality

Incremental evolution in workflow is challenging: new models or even minor version changes can break established coding pipelines ([27:02]).
The core issue: post-training processes for foundation models are opaque, leading to unpredictable shifts in behavior with each model release ([27:04]).
Quote: “You have a new model, everything changes. So much of even evals are just dealing with the same code base in the same model.”
— Martin Casado, [27:02]

10. Practical Advice for Developers Using AI Coding Tools

Optimize both prompts and the tool/library environment (prefer strong types, good abstractions).
Break complex coding into steps—set up type-safe interfaces and make regular commits to avoid losing progress ([32:05]).
Always design changes to be easy to evaluate and revert problematic generations ([32:36]).
Quote: “As humans... we’re pretty intuitively good at [trajectory management]—I don’t think models are amazing at it yet.”
— Sujay Jayakar, [33:00]

Notable Quotes & Memorable Moments

On variance and model inconsistency:
“When we have these coding agents that are spending an hour in one of these... there’s just a lot of context and state... for our most difficult task... starts with 4,000 lines of code for the front end and then adds a few thousand lines as it implements the backend. Seen just a lot of variance...”
— Sujay Jayakar, [15:01]
On practical benchmarking advice:
“If you are building a product where that product then uses these models... your AI app needs evals... it’s just chronically underappreciated.”
— Sujay Jayakar, [09:49]
On where to focus improvements:
“There’s a whole nother space... change the tools we use... or what frameworks we use to have some of these properties of having better type safety and guardrails, and that is kind of secretly a part of the prompt.”
— Sujay Jayakar, [30:01]

Timestamps for Important Segments

| Timestamp | Topic/Quote | |-----------|----------------------------------------------------------------------------------------------------------| | 00:00 | Coding as "trajectory management" and the limitations of heuristics in autonomous coding | | 02:35 | How Convex works and its difference from traditional databases | | 04:48 | “It’s still not a slam dunk for autonomous AI agents today.” | | 06:37 | Why create the Full Stack Bench? Gaps in real-world AI coding benchmarks | | 09:49 | Importance of evals over benchmarks for product builders | | 14:09 | Use of automated tests and backends for comparing model-generated code | | 16:15 | Reducing variance in agent output with strong type safety and feedback loops | | 17:12 | Example of tools (Cursor Agent) using language server type feedback for rapid fix cycles | | 18:47 | Models’ difficulty reasoning with React hooks & SQL RLS rules | | 19:46 | The need for targeted in-context adaptation as models change | | 22:28 | Cost vs. performance tradeoffs in model selection, Gemini’s value | | 25:08 | Value of publicly available eval sets, current lack thereof | | 27:02 | Unpredictability of model upgrades and workflow stability ("everything changes" with a new model) | | 30:01 | Optimizing prompts and libraries/frameworks for better AI coding outcomes | | 32:05 | Practical workflow advice: break tasks into steps, commit early, revert on failure |

Closing Thoughts & Predictions

Optimizing both prompts and tools matters: Success with AI coders is equally about how you structure the task and which tools (frameworks, libraries, type systems) you choose ([30:01]).
Trajectory management and human-style workflows: Break problems into steps, commit regularly, and make it easy to test and revert—just like you teach junior engineers ([32:05]).
Current tools augment, but don’t replace, developers: Sujay emphasizes he uses AI for coding "100%" of the time—it makes him much faster and unlocks tasks he wouldn't have tackled before ([31:21]).
The future: As benchmarks and evals become more standardized and open, AI models may converge on more stable, usable development patterns.

Episode summary compiled for listeners who want the inside track on full-stack AI coding benchmarks and the intersection of practical systems thinking with state-of-the-art AI.

AI + a16z: Benchmarking AI Agents on Full-Stack Coding

Date: March 28, 2025
Host: Martin Casado (a16z General Partner)
Guest: Sujay Jayakar (Co-founder & Chief Scientist, Convex)

Episode Overview

Key Discussion Points & Insights

1. The State of AI in Coding: Early Days for Full-Stack Autonomy

Sujay compares autonomous coding to trajectory management in reinforcement learning, noting that "coding a difficult problem is actually like playing a game" ([00:00]).
Insight: Current AI agents are sometimes able to build full-stack apps, but remain "right on the edge of what's possible," especially for complex, real-world scenarios ([04:48]).
Quote: “It’s still not a slam dunk for autonomous AI agents today.”
— Sujay Jayakar, [04:48]

2. Convex: Rethinking App Databases for Simplicity & Reactivity

What is Convex? A reactive, type-safe, end-to-end database aimed at making application development as simple as possible ([02:35]).
Key design: Everything is type-safe and state management is handled automatically; developers write just JavaScript.

3. Motivation for New Benchmarks: The Full Stack Gap

Existing benchmarks focus on narrow, isolated coding problems, not end-to-end application building ([06:37]).
Convex’s customers—especially those building with AI coding agents—faced inconsistent results, prompting the team to create a more holistic and meaningful benchmark.
Full Stack Bench: Evaluates whether agents can implement backends for existing frontends in common app patterns.
Quote: “For the tasks that people are actually doing... there aren’t really great publicly available benchmarks.”
— Sujay Jayakar, [07:57]

4. Benchmarking and Evals: Essential, But Underappreciated

Official model benchmarks aren’t directly useful for independent developers. Instead, evals—task-specific, test-based evaluations—are crucial for assessing model performance in real-world products ([09:49]).
Evals involve:
- Defining specific tasks
- Grading model outputs (sometimes launching backends and running human-written tests)
- Comparing across different model versions ([11:56], [14:09])
Quote: “A lot of companies consider high quality evals to be their secret sauce... it’s the evals that actually let you build a real app.”
— Sujay Jayakar, [25:08]

5. Practical Challenges: Complexity, Variance, and Guardrails

High Variance: More complex tasks see increased inconsistency—even within the same model ([14:40]).
- Large codebases result in agents losing context, requiring human intervention ([15:59]).
Reducing Variance:
- Strong guardrails are vital; the best example is TypeScript-style type safety ([16:15]).
- Feed type errors and context back into the agent for immediate correction ([17:12]).
Beyond Type Safety:
- Testability and clear execution semantics help, but runtime reasoning is still a challenge ([17:53]).
- Certain rules (e.g., complicated React hooks or RLS SQL policies) consistently trip up even advanced models ([18:47]).
Quote: “If you want to decrease that variance... having type safety can keep it on the straight and narrow.”
— Sujay Jayakar, [17:36]

6. Model Quality vs. Cost: Tradeoffs and Current Best Picks

There’s a real performance gap between expensive models (e.g., OpenAI GPT-4o) and cheaper alternatives (e.g., O3 mini, Gemini); the tradeoff often shows in result quality ([22:28]).
Sujay’s take: Gemini offers the best value ("pretty good, and cheap for price performance" [23:28]), possibly because of Google's vertical hardware integration.
Model-specific fine-tuning for cheaper models is possible, but complex and not practical for most hobbyists.
Quote: "As a hobbyist... I always want to use the smaller models, but they just don’t find them to be as good... Is there anything that I can do to use the cheaper models or am I screwed?"
— Martin Casado, [22:16]

7. Knowledge, Context & the Need for Prompt Adaptation

Large language models' knowledge is anchored in their training data and doesn’t easily adapt to new abstractions or APIs ([19:21]).
In-context learning and prompt adaptation are essential for getting solid performance as models evolve ([20:38]).
Changes between model versions (e.g., Claude 3.5 to 3.7) require updating prompts and evals to maintain performance ([19:46]).

8. State of Public Evals & Community Sharing

High-value eval sets are rarely published—companies keep them proprietary ([26:16]).
Sujay hopes for an "open-source eval sets moment," where communities share task-based evals for common application types ([25:08], [26:16]).

9. Limitations in Workflow Consistency & the "Wild West" Reality

Incremental evolution in workflow is challenging: new models or even minor version changes can break established coding pipelines ([27:02]).
The core issue: post-training processes for foundation models are opaque, leading to unpredictable shifts in behavior with each model release ([27:04]).
Quote: “You have a new model, everything changes. So much of even evals are just dealing with the same code base in the same model.”
— Martin Casado, [27:02]

10. Practical Advice for Developers Using AI Coding Tools

Optimize both prompts and the tool/library environment (prefer strong types, good abstractions).
Break complex coding into steps—set up type-safe interfaces and make regular commits to avoid losing progress ([32:05]).
Always design changes to be easy to evaluate and revert problematic generations ([32:36]).
Quote: “As humans... we’re pretty intuitively good at [trajectory management]—I don’t think models are amazing at it yet.”
— Sujay Jayakar, [33:00]

Notable Quotes & Memorable Moments

On variance and model inconsistency:
“When we have these coding agents that are spending an hour in one of these... there’s just a lot of context and state... for our most difficult task... starts with 4,000 lines of code for the front end and then adds a few thousand lines as it implements the backend. Seen just a lot of variance...”
— Sujay Jayakar, [15:01]
On practical benchmarking advice:
“If you are building a product where that product then uses these models... your AI app needs evals... it’s just chronically underappreciated.”
— Sujay Jayakar, [09:49]
On where to focus improvements:
“There’s a whole nother space... change the tools we use... or what frameworks we use to have some of these properties of having better type safety and guardrails, and that is kind of secretly a part of the prompt.”
— Sujay Jayakar, [30:01]

Timestamps for Important Segments

Closing Thoughts & Predictions

Optimizing both prompts and tools matters: Success with AI coders is equally about how you structure the task and which tools (frameworks, libraries, type systems) you choose ([30:01]).
Trajectory management and human-style workflows: Break problems into steps, commit regularly, and make it easy to test and revert—just like you teach junior engineers ([32:05]).
Current tools augment, but don’t replace, developers: Sujay emphasizes he uses AI for coding "100%" of the time—it makes him much faster and unlocks tasks he wouldn't have tackled before ([31:21]).
The future: As benchmarks and evals become more standardized and open, AI models may converge on more stable, usable development patterns.

Episode summary compiled for listeners who want the inside track on full-stack AI coding benchmarks and the intersection of practical systems thinking with state-of-the-art AI.

wavePod

Benchmarking AI Agents on Full-Stack Coding

Get Free Podcast Summaries in Your Inbox

Pick Your Shows

Subscribe Free

Get Instant Summaries

Summary

AI + a16z: Benchmarking AI Agents on Full-Stack Coding

Episode Overview

Key Discussion Points & Insights

1. The State of AI in Coding: Early Days for Full-Stack Autonomy

2. Convex: Rethinking App Databases for Simplicity & Reactivity

3. Motivation for New Benchmarks: The Full Stack Gap

4. Benchmarking and Evals: Essential, But Underappreciated

5. Practical Challenges: Complexity, Variance, and Guardrails

6. Model Quality vs. Cost: Tradeoffs and Current Best Picks

7. Knowledge, Context & the Need for Prompt Adaptation

8. State of Public Evals & Community Sharing

9. Limitations in Workflow Consistency & the "Wild West" Reality

10. Practical Advice for Developers Using AI Coding Tools

Notable Quotes & Memorable Moments

Timestamps for Important Segments

Closing Thoughts & Predictions

Summary

AI + a16z: Benchmarking AI Agents on Full-Stack Coding

Episode Overview

Key Discussion Points & Insights

1. The State of AI in Coding: Early Days for Full-Stack Autonomy

2. Convex: Rethinking App Databases for Simplicity & Reactivity

3. Motivation for New Benchmarks: The Full Stack Gap

4. Benchmarking and Evals: Essential, But Underappreciated

5. Practical Challenges: Complexity, Variance, and Guardrails

6. Model Quality vs. Cost: Tradeoffs and Current Best Picks

7. Knowledge, Context & the Need for Prompt Adaptation

8. State of Public Evals & Community Sharing

9. Limitations in Workflow Consistency & the "Wild West" Reality

10. Practical Advice for Developers Using AI Coding Tools

Notable Quotes & Memorable Moments

Timestamps for Important Segments

Closing Thoughts & Predictions