![[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang — Latent Space: The AI Engineer Podcast cover](https://substackcdn.com/feed/podcast/1084089/post/186610569/183dd75aed4203e2c58adcc0da042dcd.jpg)
Loading summary
A
Light in space 22. 5 Wake up.
B
We're here at Neurips with John Yang of Sweet Bench and many other things. But welcome.
A
Thanks so much for having me. Yeah, really happy to be here.
B
Last year I talked to Ophir and I think Carlos as well, one of your co authors. How's seabence doing? Like, just. Just generally, the project is like one and a half years old.
A
Yeah. Yeah, I think one and a half years old in terms of when it was actually useful. Yeah. We put it out October 2023, and then people didn't really touch it too much. And then, of course, like, cognition came on the scene and Devin was an amazing release. And I think after that, it kind of kicked off the arms race.
B
Did they tell you beforehand? And they just showed up.
A
You know, I got an email about, like, two weeks ago. I think it was from. I think it was from Walden. He was like, hey, you know, we have a number on it. I was like, wow, congrats. You know, thanks for using it. And then the release was, like, mind blowing. I was like, wow, these guys did an excellent job.
B
Yeah, amazing. And then sweetbench Verified was like, maybe last year.
A
That's right. Yeah.
B
Catch us up this year. Like, you have other languages. There's, like a whole bunch of varieties of Sweet Bench now. Yeah. So what should people know?
A
Yeah, for sure. I think there's a couple extensions that have happened. One is, like, more suite Benches. Sweep Bench Pro. Sweep Bench Live.
B
Oh, Sweep Pro. Was that with you guys? Because it looks independent. It's like different authors.
A
It's completely independent. Yeah.
B
So they just call this Bench Pro without your blessing?
A
Yeah, I think. I think we're. We're. We're okay with it. When we came out, we were like, oh, cool. Interesting. It would have been, you know, fun to be part of it. But, you know, I mean, congrats to them. It's a great benchmark. Yeah.
B
But, yeah, multimodal.
A
Yeah, we did multimodal and multilingual, and I think, like, those have. Multilingual. Seems to be.
B
Is it like JavaScript? What else? Yeah, yeah.
A
Multilingual.
B
It's like.
A
It's like nine languages across, like, 40 repos. But yeah, you got them like JavaScript, Rust, Java C, you know, Ruby. Yeah, yeah, you got them. Yeah.
B
And then Corsair Bench itself. A lot of people, like, they. They talk about the. The Django focus. Yes. Is there. Is there, Is there, like, I don't know. How do you. How do we move past Django?
A
Yeah, for sure. I mean, it's cool to see a lot of the newer benchmarks, like really try to diversify the repos. Like, you know, in the two follow ups we did with multimodal and multilingual, we made it a point to do that. So I think.
B
But you can also just put out sweep 2025 and just.
A
That is true. And do a new distribution. Yeah, yeah. So it's been cool to see the follow ups, I think quietly and it's an open question for me. I'm excited to see how people curate the next sets. Like it's kind of interesting to see in the literature or in their blog posts, like how they're justifying why they're creating their separate split. The easier ones were like, oh, more languages, more repos. And then I think now people are like, well, ours is more difficult because of this curation technique. And I'm. Yeah, I'm excited to see how, how long that lasts and you know where we're going to like guide the evaluations towards.
B
Yeah. And more recently you're working on Code Crash.
A
Yes, that's right.
B
So let's get people. You've already done other episodes, other podcasts about it. So refer people to that with your channel, chat with Andy, but just give like a people like a one, two sentence.
A
Yeah, no, happy to do it. Especially on your podcast. It's on. Yeah. So basically the idea is I don't like unit tests as a form of verification. And I also think there's an issue with Suite Bench where all of the task instances are independent of each other. So the moment you have the model kind of submit it, oh, it's done, you know, and. And that's the end of the story, end of the episode, you know. So with CodeClash, what we're thinking is let's try to really evaluate like Long Horizon development and development on a code base that is consequential and condition upon what a model did before to that code base. And so the general idea is you have two or more language models and they play a programming tournament. And what that means is each model maintains their own code base. And each round of the tournament first they get to edit and improve their code base however they see fit. Very self determined. And then in the competition phase, those two code bases are pitted against each other. So the code bases are run and there's generally an arena. We have a lot of diverse arenas, but the arenas determine code base A is better than code base B. And then you kind of repeat that across multiple.
B
As determined by an LLM judge.
A
Yeah, yeah, so Element Judge is definitely one of the mechanisms. We started with some pretty like simple programming games. So one of the cooler ones is like Howlite, which.
B
Oh yeah, I played it for Jane Street.
A
Yes, that's right. Yeah, that's right. You know, that's awesome. Yeah. Highlight. One, two, three. Like Michael Troll of Cursor wrote this game. Two Sigma. Jane Street. Yes. Oh, oh. Two Sigma. Two Sigma. Two Sigma.
B
I worked at Two Sigma.
A
I'm like, oh, there you go.
B
Yeah, this is too long ago.
A
There you go. Yeah. 2016 at this point. But we're bringing it back.
B
You know, headline is fun. I would say if you've never done a programmatic competition where you have to control fleets of ships and attack things and defend things and collect resources. It's like play Starcraft but you can code.
A
Yeah, exactly. Exactly. Yeah. Yeah.
B
A lot of games.
A
Yeah.
B
Are there non games or are you focusing on games?
A
I think that's an excellent point. So for kind of the initial release, for scientific purposes, we kind of use existing programming games. The current ongoing effort is know to build economically valuable arenas. That's you know, the popular word these days.
B
So yeah, Sweetlancer is a big one this year.
A
Yeah. Gdp. Awesome. Yeah, just, I mean, I think the big selling point of Terminal Bench and suitebench and these evals is that it was really close to real world utility. And so I think it's resolvable for Code Clash and that's what we're working on. Yeah.
B
Okay.
A
Yeah.
B
So you're part of Ophir's goop.
A
Yes.
B
The other students have also been putting out a lot of other stuff. What would you highlight?
A
Yeah, no, I mean Ophir is such a prolific mentor when it comes to benchmarking. Sweetficiency I really like in the line of performance.
B
What's the deal, Doctor on that one?
A
Yeah, for sure. So Sweefciency was wrote by this PhD student called Jeffrey Ma, who happened to be my high school classmate. And the idea there was like you take a code base and you just want to, you know, do modifications that will literally make the code run faster. So I think this is like parallelization SIMD operation, stuff like that.
B
Yeah, no, no behavior change, just faster.
A
Exactly. Okay. If the unit test passing. But I want better runtime. Yeah, yeah. And then, and then there's algotune that is kind of in line with that. And then there's also kind of pushing along like the scientific coding domain. Yeah, exactly. Is awesome. They did like a quick and for people.
B
Psychode is the way I Explain Psychode is. It's human eval but better.
A
Yes, exactly, exactly. I think, you know, there's a lot of good stuff that these days where. Yeah, that's, that's the way to go.
B
Which is like 3 bench is expensive to run. Any agentic benchmark is expensive to run.
A
Yeah.
B
Actually you do need some completions. Benchmarks. Yeah, just, just, just complete.
A
Exactly. Like you know, you can do well on those first and then sort of graduate to the multi turn. Expensive stuff. Yeah, yeah.
B
Okay. Other than that, just like broadly other work in the field in 2025 in terms of coding evals, obviously we shot out meter. They use vbench and they have a very interesting like I guess human hours worked number.
A
Yeah, they like the X axis being sort of the runtime and. Yeah. Y axis being the completion. You know, like we can do more long running swagen tasks. Yeah, I think the projections are quite interesting and I definitely appreciate them kind of using suitebench Verified to sort of proxy a lot of these things. But yeah, they're great. Yeah. Okay.
B
Any other work that like caught your eye?
A
That's. Yeah, I mean I think within the. Okay. Terminal bench. Sweep bench. Yeah. Critical point was kind of cool.
B
Critical point.
A
Yeah, it's like a very new benchmark that Ophir did and I think it's kind of related to physics. There's this one called SEC Bench. Kind of related to cybersecurity. Yeah, exactly. SRE bench, which I think is affiliated with lod. Like it's just cool to kind of see people really dive into different coding domains and then stepping a little bit outside of coding. I personally think it's quite interesting to think about the user simulator stuff. So like Tao Vending bench too? Yeah, and Vending Bench and I got mixed feelings. Yeah, no, I'm interested.
B
Well, I mean it's like, it's like you're sampling one path. I don't know how realistic it is, to be honest. Yeah, it's just. But it is cool.
A
No, for sure. Yeah, I agree. I think it's a good initial effort. To me, I think it's super cool to see companies like, you know, I'm sure Merkor and stuff are focusing on building environments like for code. Code beyond code. And so I think it might be interesting to have like work gym style stuff. This is stuff that my advisor De Young at Stanford thinks about a lot. So. Yeah.
B
Yeah, I just realized we're talking about terminal bending. Yes, we have a lot of folks.
A
Yeah, yeah.
B
You know, really, really, really good work. Just overall yeah, let's talk about Taobench, because you mentioned Taobench.
A
Yes, yes.
B
There's some discussion or some people are saying that Taobench is impossible to get a high score on because some of the tasks are under specified or just impossible.
A
Yeah.
B
I don't know if you're up to speed on that.
A
I'm a little bit.
B
A little spicy.
A
Yeah, it's a bit spicy. I think I saw. So I, you know, for like, I worked with Shunyu and Karthik back in Princeton very closely. I think Karthik, I just saw, posted a tweet, kind of defending. Yeah. Like rebutting some of these claims. Yeah. I mean, I think I get the concern, but yeah, I think it also brings up just maybe like, interesting research problems to solve of like, okay, like, why is it impossible? The ambiguity. Is it kind of the user simulator that has issues. And I think generally we all agree that, you know, we'll improve on these things over time for you guys.
B
So I actually really like benchmarks that intentionally. I think we should intentionally include impossible tasks as a flag. Yeah. Of like, hey, you're cheating.
A
Yes.
B
It's kind of sad that, like, Cardpik actually is defending it because the master move would be like, oh, yeah, you caught us. Like, that was, you know, like everyone reporting above 75 on top bench retail. You've been cheating.
A
Yeah. Oh, interesting. That would be. That would be cool. Yeah. I mean, yeah, you'll have to ask the Tao Bench authors, but yeah. No, that's fun. Yeah. I think there was. Impossible Bench was a recent benchmark. Maybe from. Was it from Anthropic? I don't know. But they basically took Sweetbench verified and they changed the issues to make them impossible. And they checked, like, how often the models would be like, I actually just can't do this. I don't know what's going on.
B
Oh, like, for refusals.
A
Yes, yes, yes. So.
B
Oh, how did they do?
A
I thought that was interesting. I think they're. All the models are all kind of attempting and saying like, oh, I did it, you know, so maybe not great.
B
That's cool. But no, that's an important one.
A
Yeah.
B
How does coding evalys evolve next year?
A
Wow, that's a great question. I mean, honestly, I think people will make more suite benches. I think terminal bench has really got something going where you ask people to. No, sweet Bench. You're confined in some sense to the domain of issues and PRs that already exist, which I think has its benefits of being close to reality and natural But I think with Terminal Bench, there's a lot of creativity that you can infuse into that. So I would personally be really excited. The 2.0 job was really excellent and I'd be super excited to see, you know, 3.0, 4.0.
B
Because of like the environments.
A
Yeah, I mean the environments, you know, bringing more people into the fold. You know, I think, correct me if I'm wrong, Mike, but early on you had PhD students, very smart CS people who are adding tasks and you know, what does that look like when you fold more coding environments for non coding tasks, non coding environments in general, and ask people to make stuff there. So that's pretty cool. And then of course for myself, I think just like this long running suite agent kind of thing just feels very compelling. I think the vision of like, hey, I tell it a goal. I don't have to be super specific about my task. I have like a decent verifier that proxies what I want. Something literally like a code base that makes the most money in this like setting, you know, like that's my verifier, you know, and I walk away for five hours. The thing is, just run in, I'm hanging out with you, talking to my friends. I come back and, and it gives me like literally a soda code base on that, you know, task. I think that would be super cool.
B
Okay, I'll push back. We're part time and cloud mission.
A
Yes.
B
And we are emphasizing a lot of interactivity because the point is that you're going to under specify.
A
Right, right.
B
And actually what people want is back and forth, back and forth on like a really fast time frame. Which is terrible for a benchmark author. Right. Because that's how you do that. Yeah. But realistic.
A
Yeah.
B
So I, I think like that this, this, this is where I'm a little bit anxious or cautious about this push for long autonomy.
A
Right.
B
We're gonna, I mean, you know, let's say this time next year we'll have five hours is, is pessimistic. Like. Yeah, it'll be, it'll be 24.
A
Yeah. Right. Days.
B
But I don't know if that actually materially changes the industry. So we approach like as an ev. You know, we have the people, people make evaluator.
A
Yeah.
B
We push the industry in ways that we wanted to push. But I don't know if we like that's a productive way because that's more of like a stunt that like. Yeah. It's a proof of concept that. Proof existence, proof it can be done.
A
Yeah.
B
But Will you use it in practice for real life?
A
Yeah, yeah. I mean, honestly, to me I think there's potentially room for growth. So I would actually agree with your take here. I mean with my lab at Sanford, with de, like there's a, you know, her emphasis is on human AI collaboration and so I definitely don't believe in this idea of just kind of getting rid of the human. But yeah, maybe just like finding the balance of like, you know, just because the developer ecosystem is so diverse and there's so many participants in it who want different things out of it, like just enabling different levels of abstraction and you know, it depends on the task. Like there's settings where you want to be, you know, more involved and more sort of hands on and so you want to use Windsurf for that. But then maybe there's kind of this general data processing thing. It's just a lot of JSON parsing you don't really care about. And that's the one I kind of want to walk away from and just let it figure it out. Yeah. So yeah, I would agree with you generally. Yeah.
B
Amazing. Any calls to action? What, what do you want help on? How can people, I guess like find more of your work?
A
Definitely for the call to action. Super jealous of all the great data that cognition and you know, cursor would get. Like that user interaction data is like really fascinating from an academic standpoint. It feels like there's two difficult approaches to resolving that. Either you build like a really compelling product like El Marina that people have people use consistently, which is, I mean really tricky in and of itself, or you build like really good user simulators that try to mimic sort of these settings. But that is also like non trivial. I don't think it's as simple as hey ChatGPT, act like a human. Right? Yeah. So it would be really cool to sort of get inspiration of like what exactly does that data look like or, or between the two, like what's the best way to scale up sort of evaluating human AI interaction? And then I think for visibility for my own, we're pushing more arenas like I think for, for CodeClash. What I'm excited about is the current framing is really long running suite agents. But you know, you could have multi agents, like two agents work together on the code base and what happens? You have a human and an agent work on the code base versus just AIs. What happens there? You know, like when the models improve and hopefully they hill climb and they become better at digesting laws and iterating on Analysis, you know, how does, how does human AI interaction like change with model capability? And so I'm kind of hoping, you know, I'm trying to inspire and convince people that it's a very cool testbed where you can do a lot of different sort of combinations of like human AI on different arenas, playing one arena at a time, N arenas at a time, you know, and just, you know.
B
Yeah, I think very interested to work with you on the interaction stuff.
A
That would be awesome.
B
And then I think one more thing I'll add is for cognition is going to be pushing a lot of code based understanding, which is kind of code based retrieval.
A
Yes.
B
And mostly it is helping humans understand their own code bases better to enable humans or to sort of mind meld the human with the machine to do the highest possible task that LLMs could not do alone. Humans couldn't do alone. And then the other thing is also basically automatic context engineering for an LLM. So that is sort of like a research subagent that we're working on.
A
That's so awesome. Yeah.
B
So I don't know what the benchmark would be because how do you benchmark understanding? That is apart from. I think it's mostly like you freeze a repo, have some manually curated answers and then, you know, pose trivia questions. That's very easy to saturate. So I don't know how else.
A
Yeah, I think Silas tweeted a while ago, like sort of like the wiki. The code wiki. That's incredible. I mean I use Google actually just.
B
Came out their own version. Oh yeah, yeah.
A
With the anti gravity people. That's.
B
No, no, no. This is like a separate.
A
It's a different gotcha, gotcha, but cool.
B
That's the state of code.
A
Yep.
Episode: State of Code Evals
Guest: John Yang
Date: December 31, 2025
This episode dives deep into the state of code evaluation benchmarks in the rapidly advancing world of AI code generation and software engineering. Host Latent.Space sits down with John Yang, creator of SWE-bench and CodeClash, to discuss the evolution of code evaluation (eval) benchmarks, the move towards more realistic and difficult tests, the importance of multimodal and multilingual benchmarks, and the future of agentic coding and human-AI collaboration. The conversation covers landmark projects, Big Tech and startup contributions, lively debates over benchmark design, and what lies ahead for AI agents coding in real-world environments.
[00:12–03:07]
Origins and Uptake
Benchmarks Derivatives
Multimodal & Multilingual Advances
Moving Beyond Django
[03:07–05:56]
Limitations of Unit Tests & Independent Tasks
Programming Tournaments & Long-Horizon Evaluation
Example: Howlite Programming Game
Beyond Games: Real-World, Economically Valuable Arenas
[06:00–08:00]
The Ophir Lab’s Prolific Output
Cost & Practicality
[07:20–09:01]
Notable Benchmarks
Work Gym-style Environments
[09:15–10:59]
TaoBench Controversy
Intentional Impossible Tasks
Impossible Bench
[11:03–14:31]
Terminal Bench: Creativity and Scalability
Long Autonomy vs. Human-in-the-loop
Diversity in Developer Needs
[14:31–17:31]
Unlocking User Interaction Data
CodeClash as a Testbed
Code Understanding & Context Engineering
Open Benchmarking Questions
On the Benchmarking Arms Race:
“After that, it kind of kicked off the arms race.” — John ([00:42])
On Programming Game Benchmarks:
“If you've never done a programmatic competition… play Starcraft but you can code.” — Latent.Space ([05:04])
On Impossible Tasks in Benchmarks:
“I think we should intentionally include impossible tasks as a flag… everyone reporting above 75 on TaoBench, you’ve been cheating.” — Latent.Space ([10:04])
On the Human in the Loop:
“I definitely don’t believe in this idea of just kind of getting rid of the human… enabling different levels of abstraction…” — John ([13:38])
On Calls to Action:
“That user interaction data is like really fascinating from an academic standpoint… what’s the best way to scale up sort of evaluating human AI interaction?” — John ([14:37])
This comprehensive episode covers the rapid innovation and debate swirling around coding benchmarks for evaluating advanced AI and agentic systems in software engineering. John Yang and Latent.Space give listeners a front-row seat to the key players, evaluative philosophies, and unresolved challenges in the space, while reflecting on where the next generation of AI-powered coding—and the benchmarks that test them—might lead.
For further resources, detailed show notes, and referenced papers, visit latent.space.