Latent Space Podcast Summary
Episode Overview
Episode: SWE-Bench-Dead: The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
Date: February 23, 2026
Guests: Mia Glaese (VP of Research, OpenAI), Olivia Watkins (OpenAI Frontier Evals Team)
Host: Latent Space
This episode dives deep into the retirement of SWE-Bench Verified, a foundational coding benchmark created to measure the progress of AI in code generation. Mia Glaese and Olivia Watkins from OpenAI discuss the history, evolution, challenges, and eventual saturation of SWE-Bench Verified, its pivotal role in industry evaluation, and the future of coding benchmarks—especially as AI capabilities accelerate. They discuss contamination issues, what’s next with more robust benchmarks like SWE-Bench Pro, and the broader implications for preparedness frameworks and evaluating AI’s real-world coding ability.
Key Discussion Points and Insights
1. The Legacy and Saturation of SWE-Bench Verified
-
Original Purpose: Designed as a “North Star” benchmark to measure the real-world coding ability of AI models. It involved giving a model a codebase, a real GitHub issue, and grading on whether the resulting patch passed specific tests.
-
Human Effort: Creation involved nearly 100 expert software engineers to curate and validate 500 tasks, requiring extensive multi-round human review for quality assurance.
“It was literally like many expert software engineers reviewing the problems like sequentially multiple times until…three different experts independently decided [they were valid].” — Mia Glaese [03:07]
Benchmarks Are Not Forever
- Saturation: As models improved, incremental gains (e.g., +0.1%) became meaningless due to increasingly narrow, “gamed” increments.
“At the point that we are now, we’re kind of starting to measure…the agent’s ability to correctly guess how to name a specific function, and that isn’t really what we want to measure at this point.” — Mia Glaese [09:19]
- Contamination: Widespread contamination (models having seen parts of the benchmark in their training data) made results unreliable.
“We found many instances of contamination across OpenAI models, across…Gemini Flash…and in all of these we saw things like regurgitating the ground truth solutions, [and] giving the task IDs.” — Olivia Watkins [11:55]
- Post-mortem: Even with efforts like “canaries” (unique tracking strings), open-source nature and popularity made complete isolation impossible.
2. Issues Discovered in Benchmark Content
- Unfair Tests: Many tasks had problems—narrow test requirements, underspecified instructions, or tests seeking unseen features.
“Over half of the problems…had some issue. The most common were overly narrow tests…expecting models to make undocumented design choices.” — Olivia Watkins [07:11]
- Benchmark Evolution: The process highlighted how benchmarks age and lose effectiveness as models reach and surpass their original scope.
3. Moving On: Embracing New and Harder Benchmarks
SWE-Bench Pro
- Rationale: Harder tasks, more diversity, longer solution times, multiple languages, and less contamination. Sourced to have longer, more complex, and diverse challenges.
“They’re just bigger and harder. There’s much more headroom [for measuring improvement].” — Olivia Watkins [10:37]
- Contamination Audit: New procedures to check for data leakage showed SWE-Bench Pro had significantly less contamination.
Future Directions
- What Should New Benchmarks Measure?
- Open-ended Design: Can agents make reasonable design decisions in under-specified situations?
- Quality of Code: Maintainability, cleanliness, and “taste” in code—beyond simply passing tests.
“Does it have design taste? Does it solve the problem the way my team likes to? Is the code nice, clean, maintainable?” — Mia Glaese [13:33]
- Evaluation Approaches:
- Human annotation (“slow, expensive”)
- LLM-proxy grading (faster but less nuanced)
- Hybrid approaches (example: GDP-Val using domain experts and human-created rubrics)
4. Alignment with Preparedness and Human Data
- Preparedness Framework: OpenAI’s internal tracking for “frontier risk”—ensuring they anticipate possible dual-use (good/bad) impacts of AI in biosecurity, cybersecurity, and research automation.
“The preparedness framework…tracks frontier risk. Coding is not all of automating research, but it is one very important key component.” — Olivia Watkins [22:08]
- Role of Human Data: Human-in-the-loop remains crucial, especially for nuanced judgments in coding and beyond.
5. Practical Challenges and Industry Collaboration
- Real-World Impact Metrics: Interest in metrics for actual industry impact—how much AI is used, replaced human labor, or improved productivity.
- Industry Collaboration: Strong advocacy for shared, open benchmark creation and improvement across labs and the broader AI community.
“We deeply appreciate other people…creating and sharing evals. We use them…we really encourage people to find more ways to create and share evals.” — Mia Glaese [23:12]
Notable Quotes and Memorable Moments
- On Saturation and Gaming Benchmarks
“There’s a group chat with all the labs and everyone just takes turns to increment like 0.1 on [benchmarks]…It’s not super convincing at this point.” — Host [01:29]
- On the Grit of Benchmark Creation
“Maybe it’s hard to overstate the amount of effort…it was literally…experts reviewing the problems sequentially multiple times.” — Mia Glaese [03:07]
- On Unfair Grading
“If you chose another reasonable name [for a function], the test would fail.” — Olivia Watkins [07:19]
- On Open Benchmarks’ Evolution
“Benchmarks go through [an] evolution. They start popular, then high performance makes extra improvements sort of meaningless.” — Mia Glaese [08:40]
Important Timestamps
- [01:04] – Main thesis: “SWE-Bench Verified is saturated and contaminated, shouldn’t be used for progress.”
- [03:07] – Human effort required to make SWE-Bench Verified reliable.
- [07:11] – Uncovered issue: Over half of problems had unfair or narrow tests.
- [09:19] – Why additional incremental improvements don’t matter anymore.
- [10:37] – Why SWE-Bench Pro is better suited as the next industry benchmark.
- [11:55] – Details of contamination analysis and results.
- [13:33] – Desired qualities in future coding benchmarks, beyond correctness.
- [15:51] – Hybrid evaluation using domain experts and LLM proxies.
- [22:08] – How SWE-Bench fit into OpenAI’s preparedness framework.
- [25:08] – Appeal for more real-world usage metrics and difficult, industry-relevant benchmarks.
- [25:40] – OpenAI’s direction: focus on real-world impact and broader preparedness.
Takeaways for AI Engineers and the Community
- SWE-Bench Verified was an important milestone, but it is now retired due to saturation and contamination.
- The industry must continually adapt benchmarks to keep up with advancing model capabilities and ensure genuine progress.
- Human-in-the-loop and expert-driven approaches remain essential for nuanced, open-ended coding tasks.
- Active, open collaboration in the benchmarking community is vital.
- Future benchmarks should reflect real-world complexity, diverse tasks, and measure practical impact.
- OpenAI is focusing on preparedness for dual-use risks and pushing the science of evaluation forward with greater transparency.
For More Information
Check out detailed show notes and additional resources at https://latent.space
