Latent Space: The AI Engineer Podcast

Episode: State of Code Evals
Guest: John Yang
Date: December 31, 2025

Episode Overview

This episode dives deep into the state of code evaluation benchmarks in the rapidly advancing world of AI code generation and software engineering. Host Latent.Space sits down with John Yang, creator of SWE-bench and CodeClash, to discuss the evolution of code evaluation (eval) benchmarks, the move towards more realistic and difficult tests, the importance of multimodal and multilingual benchmarks, and the future of agentic coding and human-AI collaboration. The conversation covers landmark projects, Big Tech and startup contributions, lively debates over benchmark design, and what lies ahead for AI agents coding in real-world environments.

Key Discussion Points & Insights

1. SWE-bench: Impact and Evolution

[00:12–03:07]

Origins and Uptake
- SWE-bench was released in October 2023. Initially, it didn’t attract much attention until major AI releases like Cognition’s Devin brought code agents to the spotlight.
- “The release was, like, mind blowing. I was like, wow, these guys did an excellent job.” (John, [00:55])
Benchmarks Derivatives
- Multiple community-driven forks have emerged: SWE-bench Pro and SWE-bench Live. These are sometimes entirely independent, using the SWE-bench name but not directly involving the original authors.
  - “It's completely independent… It's a great benchmark.” (John, [01:31])
Multimodal & Multilingual Advances
- Expansion into nine programming languages across 40+ repositories, including JavaScript, Rust, Java, C, and Ruby.
  - “It's like nine languages across, like, 40 repos... JavaScript, Rust, Java, C, you know, Ruby.” (John, [01:58])
Moving Beyond Django
- Early feedback highlighted a heavy Django focus. Follow-ups consciously diversified repo and language selection.

2. The CodeClash Benchmark: Next-Gen Code Evals

[03:07–05:56]

Limitations of Unit Tests & Independent Tasks
- “I don’t like unit tests as a form of verification… all of the task instances are independent of each other.” (John, [03:20])
- “With CodeClash… let’s try to really evaluate Long Horizon development: development on a consequential codebase conditioned on previous steps.”
Programming Tournaments & Long-Horizon Evaluation
- In CodeClash, two or more language models iteratively improve their own codebase, then compete by running their solutions in a shared arena.
- “Each round...they edit and improve their code base...then the code bases are pitted against each other.” (John, [03:20])
- Judging can be done via LLM or programmatic criteria.
Example: Howlite Programming Game
- Game-based settings (e.g., fleets-of-ships strategy) form the initial testbed for measuring code agent performance.
- “If you've never done a programmatic competition… play Starcraft but you can code.” (Latent.Space, [05:04])
Beyond Games: Real-World, Economically Valuable Arenas
- Aim to build benchmarks that resemble “real world utility,” learning from successes like Terminal Bench and SWE-bench.
- “The big selling point...was that it was close to real world utility. And so I think it’s resolvable for CodeClash and that’s what we’re working on.” (John, [05:41])

3. The Explosive Growth of Code Evals in 2025

[06:00–08:00]

The Ophir Lab’s Prolific Output
- Spotlight on benchmarks: Sweefciency (code optimization for speed), algotune (algorithmic tuning), Psychode ("human eval but better").
  - “Sweefciency… you take a code base… do modifications that make the code run faster.” (John, [06:13])
  - “Psychode… is human eval but better.” (Latent.Space, [06:54])
Cost & Practicality
- Complex agentic benchmarks are expensive to run; single-turn completion tasks remain useful as a step before multi-turn, heavy-lift evals.

4. New Coding Benchmarks & Trends

[07:20–09:01]

Notable Benchmarks
- Meter: Uses SWE-bench Verified, measures by human-hours worked.
- Critical Point: Physics-focused, from Ophir.
- SEC Bench/SRE Bench: Centers cybersecurity (SEC) and site reliability engineering (SRE).
- User Simulator Benchmarks: (e.g., Tao Vending Bench) start to model actual user interactions, but with realism and sampling path concerns.
Work Gym-style Environments
- The field eyes more complex, “work gym”-like environments for human-AI collaboration in non-coding tasks and agent operations.

5. Debates on Benchmark Design & “Impossible Tasks”

[09:15–10:59]

TaoBench Controversy
- Some tasks are so under-specified or difficult they may be outright impossible; leads to debate over utility and fairness.
- “Some people are saying that Taobench is impossible to get a high score on because some of the tasks are under specified or just impossible.” (Latent.Space, [09:16])
Intentional Impossible Tasks
- Host advocates for intentional inclusion of impossible tasks to detect cheating:
  - “I think we should intentionally include impossible tasks as a flag...everyone reporting above 75 on TaoBench, you’ve been cheating.” ([10:04])
Impossible Bench
- New benchmark variant where tasks are by design unsolvable (modified from SWE-bench), testing AI model refusals.
  - “They checked like how often the models would be like, I actually just can't do this… All the models are...saying like, oh, I did it, you know, so maybe not great.” (John, [10:53])

6. Where Is Code Evaluation Headed?

[11:03–14:31]

Terminal Bench: Creativity and Scalability
- Terminal Bench is lauded for creative environments, broadening beyond “issues and PRs in real repos.”
  - “With Terminal Bench, there’s a lot of creativity you can infuse into that.” (John, [11:37])
Long Autonomy vs. Human-in-the-loop
- Debate over the value of long-horizon “agents that run for hours” vs. highly interactive, fast-feedback workflows AI engineers need.
  - “We are emphasizing a lot of interactivity...what people want is back and forth, back and forth on a really fast time frame.” (Latent.Space, [12:36])
  - “I definitely don’t believe in this idea of just kind of getting rid of the human...enabling different levels of abstraction...” (John, [13:38])
Diversity in Developer Needs
- Different code tasks require different degrees of autonomy and collaboration.

7. Calls to Action & Future Collaboration

[14:31–17:31]

Unlocking User Interaction Data
- John expresses a desire for real user-interaction data (like Cognition and Cursor possess) to inform better academic understanding.
  - “Super jealous of all the great data that Cognition and Cursor would get… building really good user simulators… is also like non trivial.” (John, [14:37])
CodeClash as a Testbed
- CodeClash can serve as a platform for multi-agent, human-AI collaboration, and competitive evaluation.
- “You could have multi agents… a human and agent work on the code base versus just AIs… how does human-AI interaction change with model capability?” (John, [15:55])
Code Understanding & Context Engineering
- Cognition is advancing in codebase understanding and context engineering for LLMs.
  - “It is helping humans understand their own code bases better to enable humans or to sort of mind meld the human with the machine…” (Latent.Space, [16:29])
Open Benchmarking Questions
- How to benchmark deep code understanding, beyond trivia and simple retrieval, remains an open research problem.

Notable Quotes & Memorable Moments

On the Benchmarking Arms Race:
“After that, it kind of kicked off the arms race.” — John ([00:42])
On Programming Game Benchmarks:
“If you've never done a programmatic competition… play Starcraft but you can code.” — Latent.Space ([05:04])
On Impossible Tasks in Benchmarks:
“I think we should intentionally include impossible tasks as a flag… everyone reporting above 75 on TaoBench, you’ve been cheating.” — Latent.Space ([10:04])
On the Human in the Loop:
“I definitely don’t believe in this idea of just kind of getting rid of the human… enabling different levels of abstraction…” — John ([13:38])
On Calls to Action:
“That user interaction data is like really fascinating from an academic standpoint… what’s the best way to scale up sort of evaluating human AI interaction?” — John ([14:37])

Timestamps for Key Segments

SWE-bench Origins and Uptake: [00:12–01:19]
Multilingual/Multimodal SWE-bench: [01:48–02:18]
CodeClash Concept and Tournament Structure: [03:07–05:56]
New Benchmarks Sweefciency, Psychode, etc.: [06:00–07:04]
Emerging Benchmarks and Environment Design: [07:20–08:34]
TaoBench and Impossible Task Debate: [09:15–10:59]
Future of Eval: Autonomy vs. Interactivity: [11:03–14:31]
Open Calls & Data Needs: [14:31–17:31]

Conclusion

This comprehensive episode covers the rapid innovation and debate swirling around coding benchmarks for evaluating advanced AI and agentic systems in software engineering. John Yang and Latent.Space give listeners a front-row seat to the key players, evaluative philosophies, and unresolved challenges in the space, while reflecting on where the next generation of AI-powered coding—and the benchmarks that test them—might lead.

For further resources, detailed show notes, and referenced papers, visit latent.space.

Latent Space: The AI Engineer Podcast

Episode: State of Code Evals
Guest: John Yang
Date: December 31, 2025

Episode Overview

Key Discussion Points & Insights

1. SWE-bench: Impact and Evolution

[00:12–03:07]

Origins and Uptake
- SWE-bench was released in October 2023. Initially, it didn’t attract much attention until major AI releases like Cognition’s Devin brought code agents to the spotlight.
- “The release was, like, mind blowing. I was like, wow, these guys did an excellent job.” (John, [00:55])
Benchmarks Derivatives
- Multiple community-driven forks have emerged: SWE-bench Pro and SWE-bench Live. These are sometimes entirely independent, using the SWE-bench name but not directly involving the original authors.
  - “It's completely independent… It's a great benchmark.” (John, [01:31])
Multimodal & Multilingual Advances
- Expansion into nine programming languages across 40+ repositories, including JavaScript, Rust, Java, C, and Ruby.
  - “It's like nine languages across, like, 40 repos... JavaScript, Rust, Java, C, you know, Ruby.” (John, [01:58])
Moving Beyond Django
- Early feedback highlighted a heavy Django focus. Follow-ups consciously diversified repo and language selection.

2. The CodeClash Benchmark: Next-Gen Code Evals

[03:07–05:56]

Limitations of Unit Tests & Independent Tasks
- “I don’t like unit tests as a form of verification… all of the task instances are independent of each other.” (John, [03:20])
- “With CodeClash… let’s try to really evaluate Long Horizon development: development on a consequential codebase conditioned on previous steps.”
Programming Tournaments & Long-Horizon Evaluation
- In CodeClash, two or more language models iteratively improve their own codebase, then compete by running their solutions in a shared arena.
- “Each round...they edit and improve their code base...then the code bases are pitted against each other.” (John, [03:20])
- Judging can be done via LLM or programmatic criteria.
Example: Howlite Programming Game
- Game-based settings (e.g., fleets-of-ships strategy) form the initial testbed for measuring code agent performance.
- “If you've never done a programmatic competition… play Starcraft but you can code.” (Latent.Space, [05:04])
Beyond Games: Real-World, Economically Valuable Arenas
- Aim to build benchmarks that resemble “real world utility,” learning from successes like Terminal Bench and SWE-bench.
- “The big selling point...was that it was close to real world utility. And so I think it’s resolvable for CodeClash and that’s what we’re working on.” (John, [05:41])

3. The Explosive Growth of Code Evals in 2025

[06:00–08:00]

The Ophir Lab’s Prolific Output
- Spotlight on benchmarks: Sweefciency (code optimization for speed), algotune (algorithmic tuning), Psychode ("human eval but better").
  - “Sweefciency… you take a code base… do modifications that make the code run faster.” (John, [06:13])
  - “Psychode… is human eval but better.” (Latent.Space, [06:54])
Cost & Practicality
- Complex agentic benchmarks are expensive to run; single-turn completion tasks remain useful as a step before multi-turn, heavy-lift evals.

4. New Coding Benchmarks & Trends

[07:20–09:01]

Notable Benchmarks
- Meter: Uses SWE-bench Verified, measures by human-hours worked.
- Critical Point: Physics-focused, from Ophir.
- SEC Bench/SRE Bench: Centers cybersecurity (SEC) and site reliability engineering (SRE).
- User Simulator Benchmarks: (e.g., Tao Vending Bench) start to model actual user interactions, but with realism and sampling path concerns.
Work Gym-style Environments
- The field eyes more complex, “work gym”-like environments for human-AI collaboration in non-coding tasks and agent operations.

5. Debates on Benchmark Design & “Impossible Tasks”

[09:15–10:59]

TaoBench Controversy
- Some tasks are so under-specified or difficult they may be outright impossible; leads to debate over utility and fairness.
- “Some people are saying that Taobench is impossible to get a high score on because some of the tasks are under specified or just impossible.” (Latent.Space, [09:16])
Intentional Impossible Tasks
- Host advocates for intentional inclusion of impossible tasks to detect cheating:
  - “I think we should intentionally include impossible tasks as a flag...everyone reporting above 75 on TaoBench, you’ve been cheating.” ([10:04])
Impossible Bench
- New benchmark variant where tasks are by design unsolvable (modified from SWE-bench), testing AI model refusals.
  - “They checked like how often the models would be like, I actually just can't do this… All the models are...saying like, oh, I did it, you know, so maybe not great.” (John, [10:53])

6. Where Is Code Evaluation Headed?

[11:03–14:31]

Terminal Bench: Creativity and Scalability
- Terminal Bench is lauded for creative environments, broadening beyond “issues and PRs in real repos.”
  - “With Terminal Bench, there’s a lot of creativity you can infuse into that.” (John, [11:37])
Long Autonomy vs. Human-in-the-loop
- Debate over the value of long-horizon “agents that run for hours” vs. highly interactive, fast-feedback workflows AI engineers need.
  - “We are emphasizing a lot of interactivity...what people want is back and forth, back and forth on a really fast time frame.” (Latent.Space, [12:36])
  - “I definitely don’t believe in this idea of just kind of getting rid of the human...enabling different levels of abstraction...” (John, [13:38])
Diversity in Developer Needs
- Different code tasks require different degrees of autonomy and collaboration.

7. Calls to Action & Future Collaboration

[14:31–17:31]

Unlocking User Interaction Data
- John expresses a desire for real user-interaction data (like Cognition and Cursor possess) to inform better academic understanding.
  - “Super jealous of all the great data that Cognition and Cursor would get… building really good user simulators… is also like non trivial.” (John, [14:37])
CodeClash as a Testbed
- CodeClash can serve as a platform for multi-agent, human-AI collaboration, and competitive evaluation.
- “You could have multi agents… a human and agent work on the code base versus just AIs… how does human-AI interaction change with model capability?” (John, [15:55])
Code Understanding & Context Engineering
- Cognition is advancing in codebase understanding and context engineering for LLMs.
  - “It is helping humans understand their own code bases better to enable humans or to sort of mind meld the human with the machine…” (Latent.Space, [16:29])
Open Benchmarking Questions
- How to benchmark deep code understanding, beyond trivia and simple retrieval, remains an open research problem.

Notable Quotes & Memorable Moments

On the Benchmarking Arms Race:
“After that, it kind of kicked off the arms race.” — John ([00:42])
On Programming Game Benchmarks:
“If you've never done a programmatic competition… play Starcraft but you can code.” — Latent.Space ([05:04])
On Impossible Tasks in Benchmarks:
“I think we should intentionally include impossible tasks as a flag… everyone reporting above 75 on TaoBench, you’ve been cheating.” — Latent.Space ([10:04])
On the Human in the Loop:
“I definitely don’t believe in this idea of just kind of getting rid of the human… enabling different levels of abstraction…” — John ([13:38])
On Calls to Action:
“That user interaction data is like really fascinating from an academic standpoint… what’s the best way to scale up sort of evaluating human AI interaction?” — John ([14:37])

Timestamps for Key Segments

SWE-bench Origins and Uptake: [00:12–01:19]
Multilingual/Multimodal SWE-bench: [01:48–02:18]
CodeClash Concept and Tournament Structure: [03:07–05:56]
New Benchmarks Sweefciency, Psychode, etc.: [06:00–07:04]
Emerging Benchmarks and Environment Design: [07:20–08:34]
TaoBench and Impossible Task Debate: [09:15–10:59]
Future of Eval: Autonomy vs. Interactivity: [11:03–14:31]
Open Calls & Data Needs: [14:31–17:31]

Conclusion

For further resources, detailed show notes, and referenced papers, visit latent.space.

wavePod

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Get Free Podcast Summaries in Your Inbox

Pick Your Shows

Subscribe Free

Get Instant Summaries

Summary

Latent Space: The AI Engineer Podcast

Episode Overview

Key Discussion Points & Insights

1. SWE-bench: Impact and Evolution

2. The CodeClash Benchmark: Next-Gen Code Evals

3. The Explosive Growth of Code Evals in 2025

4. New Coding Benchmarks & Trends

5. Debates on Benchmark Design & “Impossible Tasks”

6. Where Is Code Evaluation Headed?

7. Calls to Action & Future Collaboration

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Conclusion

Summary

Latent Space: The AI Engineer Podcast

Episode Overview

Key Discussion Points & Insights

1. SWE-bench: Impact and Evolution

2. The CodeClash Benchmark: Next-Gen Code Evals

3. The Explosive Growth of Code Evals in 2025

4. New Coding Benchmarks & Trends

5. Debates on Benchmark Design & “Impossible Tasks”

6. Where Is Code Evaluation Headed?

7. Calls to Action & Future Collaboration

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Conclusion