Transcript
A (0:00)
Foreign.
B (0:11)
Welcome to the Last Week in AI podcast. We can hear chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to Last Week in AI for our text newsletter with even more stuff we won't be touching on. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocate.
A (0:37)
And I'm here the co host, Jeremy Harris. I'm back on for the second episode in a row. So kind of exciting. We're back and I was just telling Andre I had a kind of topical so I was using a language model to help me just with a code base that I've been working on and it's tied to a database and I just pasted mindlessly code from the chat bot trying to solve a problem and it fucked my. My entire database. So that's. That's how my Friday is going, you guys. That's how my Friday's going. So it's gotta feel bad if I'm a little. A little on edge. That's the reason.
B (1:11)
So. Well, you know, it's one of those things you learn the hard way. I'm sure you won't repeat that mistake. Well, to preview the episode, of course we'll be starting off with GPT5. 2. Just announced yesterday the exciting news of a week. Other than that, nothing too big, just a variety of stories, some updates on US and China relations. Disney and OpenAI had an interesting business arrangement and quite a few papers of a variety of types. We've got robotics stuff, some stuff on scaling agents, RL for reasoning a lot of things. So it'll be a bit of a technical episode, I guess, and we're going to have to get going and try to get through it all in time. So starting with GPT 5.2, they have announced this just yesterday and this is meant to be their big kind of getting back into the leadership position announcement. So the big deal here was pretty much your benchmarks, right? It is now neck and neck or competitive of GPT3 and generally smarter. Yes, I believe it's Gemini 3 Pro.ish. so yeah, there's not too much to say on my end here. One interesting thing is it is more expensive. The input for GP 5.1 was 1.0.25. GP 5.2 is 1.75. The output is about 40% more expensive too. So that's pretty unusual. Usually you see model families not change too Much. One other interesting thing is this has a different knowledge cutoff than GP 5.1. They have the previous one September 30, the next one is August 31. So that's kind of interesting as well. The knowledge cut off, changing in that way perhaps indicates that they're continually training and this is really just like they cut off a point in their training and it is better than the previous one.
A (3:25)
Absolutely, yeah. And you mentioned the evals are a big part of, part of the announcement. That's absolutely the case. We know very little about what GPT 5.2 actually is other than the fact that it builds on the safe completion research that OpenAI did previously. Just kind of new sort of alignment technique that they're workshopping and I think we actually have an episode on that when it, when it came out. Bunch of highlights from the evals. Right. So they, they've got this, oh, GDP val which by the way I was not around when this dropped, so I had to look into what GDP VAL was. Basically an eval of a whole bunch of knowledge work tasks across 44 different occupations. This may not be news to listeners if you've been tuning in this whole time. It was to me sounds like a really cool eval actually. The idea being presumably to assess when, you know, AI systems are on, on course to radically change the GDP of, of the world by automating straight up white collar jobs. So Here we have GPT 5.2 thinking beating or tying industry professionals, top industry professionals, which is what GDP Bell measures on 71% of comparisons. So that's pretty impressive. The human expert judges were actually used for that. So it is, you're not subject to LLM as a judge type errors and they say, you know, these are obviously the, the top lines in part of the press Release here, but GPT5.2 thinking produced out for GDP VAL tasks at over 11 times the speed and less than 1% of the cost of expert professionals. So how these things translate in the real world is always the big question. But that's a pretty, Pretty interesting stat. 30% less common, less frequent hallucination rate than 5.1 thinking. And then the other piece was Sue Bench Pro. This is by the way, it's a much harder benchmark than Swebench, than Sweebench Verified, which we've talked about a lot in the past. So to give you an idea on the verified benchmark, you'll see top models often scoring anywhere from like 70 to 80% somewhere in that range I think. Claude 4.5 was in the kind of high 70, 77 or so. Whereas on Pro, performance typically drops like 40 to 55%. What we're seeing here with GPT 5.2 thinking it is at the very, very top of that range, 55.6%. On Swedbench Pro Claude Opus 4.5 hits like 46%. All these things have some error margin, right, because it depends on exactly how you run the test. But by and large, again on the evals, this suggests really good performance relative to market. We gotta wait to the, you know, for the sniff tests to come out for people to actually playtest with it. But everything from the needle in a haystack tests to, you know, all the sweet bench stuff to even some of the image stuff, they've got a cool demo where they show an image of basically a motherboard to the model and it's going through and identifying all the little components on the motherboard. It's got like, oh man, yeah, the PCIe, the serial ports, HDMI, RAM, slots, chip, like all these things. And they're comparing it to what its previous performance was with 5.1. And you really do see an impressive shift in the multimodal capabilities, the image capabilities too. So that's all part of this release. Again, time to wait for the vibe check and see how it actually works in practice.
