Loading summary
A
Foreign. Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can head on over to Last Week in AI for our email newsletter with even more articles. Then we're going to be talking about. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocade.
B
And hi everybody, I am Jeremy Harris. I'm your other regular co host and Gladstone. AI, AI National Security. As you guys will all know, we were just talking about this is like the beginning of that holiday period. We're recording this on December 19th. So it's. You can feel yourself falling into the black hole of the sort of Christmas holiday period. And like things are starting to slow down. There was a burst though releases this week. There's some really interesting papers. We're almost hoping things slow down over the next week because I think theoretically our next recording date would be Boxing Day. So we'll see how that goes. Other fun stuff.
A
We might, might have to delay that one.
B
That's right, we'll see. Well, we're talking about how, you know, maybe we'll have another deep seek type thing where, you know, some Chinese lab comes out because it's obviously not there sort of holiday period over there and just tries to kind of throw a hand grenade the way of the west or some other. Andre was actually saying hey, maybe some smaller like lab or company that's just trying to take advantage of the opportunity when there's a lull. So that's, I think a really interesting point. But anyway, the other thing is, I don't know if you can tell for the third week in a row that we're recording I am sick and this is actually not the same cold. It has been three different colds. So. So I highly recommend having kids. But you will find new fluids, new rhinoviruses, new. What are the. I don't know. Friggin. It's bad rsv.
A
You're really like working out your immune system. So that's good.
B
That's right.
A
Get in good shape. And yeah, as you said, we are going to probably expect a bit of a slowdown, but we didn't have that much of a slowdown this past week. Gemini Free Flash was announced and it's actually a big deal. So Google capping off their year of massive, like a massive comeback, I guess you could say, with one last shot Then we've got some interesting news on China as usual, some big fundraises once again and we have quite a few open source stories this week, some pretty notable releases including from Nvidia which is probably the biggest one and then we'll round things out with a decent amount of research and discussion on policy type topics. So let's go ahead and dive in with tools and apps and Gemini Free Flash. So this is their faster and more cost effective version of Gemini Free and it is really good. So we already saw this with the previous Gemini Flash. It was very capable while also being cheap and fast. This one is better than Gemini 2.5 Pro. So the big previous iteration of Gemini it even seems to outperform some top of line models, GPT 5.2 on some coding benchmarks which some people questioned. But according to Google or someone at Google they do say that not only they do distillation of Gemini Free for this which Gemini Free Pro already very good. It also is trained more via RL and just generally has more of their research progress involved. So yeah, very impressive scores for a faster cheaper model and now the default model in the Gemini app globally which from a like consumer standpoint, from a competitive standpoint, I'm pretty sure this is better than GPT 5.2 low or whatever else OpenAI is calling their fast cheap model these days. Another reason why people might start preferring Gemini over ChatGPT potentially.
B
Yeah, going after that ever critical enterprise and coding segment, right? That's where your revenue per token is going to be highest. That's why everybody sort of waking up the realization that holy shit, like anthropic is just not monopolized but you know, 60% market share in this critical segment. So everybody now is trying to get a piece. One of the things that might throw you off and Andre just explained why this is the case. But when you look at the performance on suite bench verified for agentic coding, you will actually find that Gemini 3 flash thinking mode scores higher than Gemini 3 Pro. Right. So that nominally would be really surprising. But what's happening as Andre said is you're giving it extra training RL training on coding specifically. So so that really tells you that is the focus here. The tokenomics looking really good. So higher token efficiency by quite a wide margin for Gemini 3 flash over Gemini 2.5 flash and Gemini 3 Pro. No surprise on Gemini 3 Pro of course. So yeah, I mean it is a really impressive model. Humanity's last exam. The scores here looking really strong in particular. So it scores within about a percent of GPT 5.2. So. So you know the full scale thing on humanities last exam, which spans a whole bunch of sort of like really hard expert questions. I mean we almost get tired here of saying like really hard Expert. I mean PhD. No, I mean like post. It's like one of those things where it's just, you know, the chemistry, the bio, all these things. So it's again within 1% of the really full scale GPT 5.2 release, at least at the time of writing here. So that's really impressive for a again nominally a small efficient model. So yeah, the war of the models continues late into December.
A
Yeah, and not only is it better on tool based coding, but even on things like arc agi2 which is meant to capture general kind of reasoning, it's doing better. It's close to Gemini Free Pro on really most of the benchmarks. So very strong faster model costs a fourth of the price of Gemini Free Pro. So I think if you're like using APIs, pretty strong chance you will consider switching to Gemini Free Flash from whatever other models you're using there. And I think yeah, it captures a lot of things. We have seen a massive improvement in distillation over the last few years. You know, we've seen even more so than big models, small models at 1 billion, 8 billion, however many billion parameter range have gotten more and more impressive. And I'm sure that's part of the equation here is just the distillation has gone really good and the other piece here is rl and I think the last thing I'll say on it is compared to the other labs, Gemini actually hasn't shipped that much. So Gemini 2.5 Pro was early 2025. We just recently got Gemini Free Pro. Gemini 2.5 Flash was shortly after 2.5 Pro, I think around March. So it took them nine months to really get to the next gen and they really knocked it out of the park, I guess. We discussed last week how GPT 5.2 the numbers looked really solid. We needed to wait for Vibe check. The vibe check has been inconsistent. If you look on Twitter, some people are saying this is like a benched maxed model. In case you don't know. Benchmax is now a common term to know where you get really nice benchmark scores. But it's maximized for benchmarks and in practice it's not that solid. Some people have expressed that GPT 5.2 is underwhelming. Others seem to think it's really good. And from what I've Seen of a vibe check for Gemini Free Flash so far, people are saying it's really solid.
B
Yeah, that's one of the weird things. And you know, you hear similar things about on benchmaxing, on. You know, we've heard it for some XAI releases. We've heard it for basically every lab with different models. And I think it's just part of the challenge of getting attention. You know, you launch a new model and like there's four or five companies with competitive models. Even if you're just looking at closed source, it's really hard to get people to peel away from their workflow and actually switch to a new model. At least that's what in fact, some of the research that we talked about last week showed. Surprisingly sticky these models are. But yeah, so we'll see to your point too, on these slow releases. It might have to do with Google, right? Traditionally, Google a little bit more cautious on the releases. They are, after all, the one big established hyperscaler among all the companies vying for dominance right now. You know, XAI is new, OpenAI is new, Anthropic is new. These are all new labs. But Google's been around for a long time with a big established reputation, customer base, enterprise kind of heft. So they're going to be a little bit more cautious with those rollouts. And that might be part of what.
A
We'Re seeing here, right? And they are actually shipping it out across various, you know, Gemini API, Vertex AI, Gemini Enterprise, Gemini App, AI Mode, Gemini cli. So they also are rolling out these things, you know, rapidly across their many surfaces. And it also makes me think back on two years ago, one year ago, Google had this thing and really all the labs had this thing of like pre announcing models, saying, oh, we have this model and we have really good results. And then you have to wait for a while for it to come out with this new kind of pattern of you announce a model and then it's out for you to use right away. I think we've gotten to a point where they kind of can flex and not worry too much. And next up, we've got an update from OpenAI. A few months ago, they announced the availability of apps within ChatGPT. So the idea there being that you can essentially launch a little mini program through ChatGPT. So if you were to say, hey, help me edit this photo. Recently we saw Adobe partner with them and there's a special version of Adobe Express that gets launched within ChatGPT where it's, you know, custom UI and so on so this week it's been announced that the App Store is open for business, so third party developers can now actually submit their apps. Before this, OpenAI was partnering with specific companies like Spotify, Zillow and Canva. And now there is an apps SDK in beta which allows anyone to, to release their apps. And it's an interesting play and potentially smart given that now there are 800 million users of ChatGPT. I forget, I think it's weekly users. And if you're an app developer, you know, if you are Adobe for instance, and you want people to start using your stuff and paying for your stuff, well, with that larger user base, if you're just like saying, let me edit a photo and ChatGPT serves you up with something from Adobe, it's a pretty impressive or potentially powerful way to get new users. So I could see this like App Store concept taking off as opposed to things like GPD Store and OpenAI's previous attempts at this kinds of thing basically falling flat.
B
Yeah, and I mean this is the flip side of that consumer facing coin, right. We talked about how Anthropic makes a lot more money per token than because they're more in the enterprise segment, the coding segment. Well, OpenAI owns consumer mindshare, as you said. They just have more people on the platform day to day and so they just have distribution. Like they have become their own distribution engine. Not that Anthropic isn't, certainly not that Google isn't. But this creates an option for a kind of leveraged play for OpenAI that takes advantage of that massive consumer market that they have on the platform. So yeah, I mean this is a big deal. These integrations you mentioned, Expedia, Spotify, Zillow, Canva, you know, all these companies that they run reached out to make sure that they had an integration to offer on launch day. That's a big deal, right? When you think about how are developers gonna use this, they're gonna want third party tools. That's kind of the point. And so yeah, I mean, expect this rollout. I'm just waiting for Sam Altman to get up on a stage and start dancing and shouting. Developers side joke for tech worldview.
A
It's a classic for anyone in tech. Yeah. And I think it really highlights how OpenAI has really gone deep into being a product company. So a few months ago they've launched Pulse. I don't know if anyone remembers Pulse, but I believe it's still in there where when you open ChatGPT, it gives you this little like daily Update now they're launching this, they kind of want to be a quasi everything app it seems where you can go to ChatGPT to do anything including these more UI heavy tasks. So certainly shows the ambition. We'll see if it actually takes off next up. Actually Another story from OpenAI with a model release GPT 5.2 codecs. So the codex family of GPTs is their coding specific models, an AI agentic coding model and you know, as you might expect, beats on the benchmarks now industry leading apparently has stronger vision capabilities so you can give it screenshots and it is able to then translate into programs just all around better and is now being rolled out for various people to start using.
B
It does look like legitimate, legitimately impressive model. You know, SUI Bench Pro benchmark scores are top of the line. You know, 56.4 is what they show there. These vary a little bit in terms of the way that the scaffolding that you set up so you can't quite do the A B tests directly like apples to apples unless OpenAI also published like here's what Sonnet 4.5 can do or whatever. But these are really impressive progressive scores nonetheless in that context. Terminal bench 2.0, same thing. We have like a little maybe hint of what's going on here in terms of how Codex was built. They say GPT 5.2 Codex builds on GPT 5.2 strengths in professional knowledge work. So we can maybe infer from that that this is a distilled version of 5.2 maybe. But then it says and GPT 5.1 Codex Max's Frontier Agent encoding and Terminal using capabilities. I wonder if that implies that they distilled 5.1 Codex Max into a 5.2 clone or you know, like, or distillate. Anyway, it sort of seems to imply that they use these two models in some way together and there are only a few ways that I can think of that one would do that. But anyway, they do say, by the way, we'll talk a little bit more about this later. So and we're doing lightning round right now, so be quick about it. But they do say that GPT 5.2 does not reach their high level of cyber capability under their preparedness framework. This is by the way, if it reaches a high level of capability in any of these domains, then that brings in a whole bunch of responsibilities that OpenAI is sort of like self assigned. They said, look, once our models reach high level of risk in any of these categories, we're going to do X, Y, And Z. We'll talk about what X, Y and Z are a little bit later in the context of the system card for GPT 5.2, which also dropped this week. But the bottom line is it does not reach that threshold. They are, though, anthropic, getting ahead of it a little bit and saying we're going to start behaving as if we're in that domain for that risk threshold level for cyber capability in the future. And among other things, they're piloting like an invite only trusted access program to let vetted people and organizations access upcoming capabilities that are focused on defensive cybersecurity work. So they want to try to help the cyber defense guys get access to their tools first, build up kind of a bit of a lead before they open up the model more widely to the public, given the offensive cyber capabilities implied by these benchmarks. So there you have it. I'm sure we'll be talking more about this later, but pretty interesting release, right?
A
And on that note, that's probably the most interesting bit on this release. If you look at the blog post like half of it is about cybersecurity. They position it as an advanced coding model for professional software engineering and defensive cybersecurity. They have this plot on professional capture a flag challenge performance over time. And on this particular assessment, going Back to August, GPT5 was at 25% and then over just a few months it is now reaching like 85, 90%. And you've seen large jumps from GPT5 to GPT5 codecs to GPT 5.1 codecs to GPT 5.2 codecs. So I think what they've probably done here is notice that there is a pretty rapid advancement on this particular aspect and even highlight security engineers and researchers, including at the REACT team, showing how people are using these tools to discover vulnerabilities.
B
Absolutely. And actually to your point, I mean, one of the stories of that plot too is this is for, as you said, professional capture the flag challenges. These are hard, right? Capture the flag challenges are basically some kind of written string or token that's hidden somewhere, encrypted somewhere, and you got to find a way to hack your way into it, to reveal it and what you see historically. So.03, all the way back in April 2025 and GPT5 basically had the same performance. GPT5 came out in August. So between April and August of 2025, no meaningful difference in the capability of frontier models, or at least of OpenAI's frontier models on that. So if you tuned out when GPT5 released. If you saw GPT5 come out and you're like, eh, okay, offensive cyber is kind of there, the story has completely changed. So the steady improvement, like basically linear improvement from 25 or so percent on this benchmark, this critical cybersecurity benchmark to now, I mean it looks like north of 90% or very close to 90% for GPT 5.2 Codex has happened over the last like five months. So the entire cyber kind of landscape, the cyber threat model, threat picture associated with these models. If you tuned out in August and you're waking up today, five months later, holy shit, do things look different? Like you know, this is a good moment to reassess your cyber posture. So that's one of the take homes, at least for me from this plot. You know, looks like it's going nowhere and then suddenly boom, right? So this is going to be happening across a lot of benchmarks and it's important to keep tabs on this sort of thing if you care about things like, you know, national security and cybersecurity.
A
And moving on again with OpenAI the last big release, they have also rolled out an update to ChatGPT images. So they now have GPT image 1.5 rolling out. To most users the big story is it appears to be competitive or more competitive with nanobanana and nanobanado Pro. It is especially better at precise edits, generates images up to four times as fast. So you know, we've discussed this before, I think narobanana and nanobanana Pro, the ability to edit images very precisely to create infographics or slides actually appears to be a very big deal, more so even than image generation in many cases because these are things you can use to not just make fun AI images, but actually useful stuff for presentations or posters or whatever else. So not too surprising I guess that OpenAI is trying to be neck to neck with Google on this kind of image generation. And I think at this point it's really just Google and OpenAI with Black First Labs trying to also keep up on this kind of very high level performance on editing.
B
The prompt fidelity here is pretty wild. They even have. So there's this example that's pretty striking where they have a prompt that says basically like, you know, write the following, you know, an article introducing GPT 5.2, make an image of it on a sort of old school newspaper that's printed, right, but print the exact text that we're going to write below, including a table including even things in markdown so there's like markdown for italics and bold and all of that is captured in this image. Like that's a big deal. I remember, you know, there are a lot of times where I say when there's a new image model that comes out, I'm always like, I don't see the difference between these anymore. Like there's only so much more sort of text prompt fidelity and all this stuff that we can go for. But this seems to suggest the problem is really solved. I mean that's the first time I've seen just okay, I get it. You can produce any text on anything in any. And we'll see what the vibe check is obviously on that, but we're clearly on the way there.
A
Right. And it's something I wouldn't have expected, right? Not only text, like readable text, but in general sort of realistic image details have been a point in which image generation has struggled. I think, you know, if you look at the technical side of things, being reliant on diffusion models makes it tricky to get these very precise details right. This kind of advance I didn't see coming probably has to do with the move to autoregression. Like these image generation models seem like have the same vibe as just chatbots LLMs at this point where they just can output something like coherent sentences or paragraphs or stories, but in image form and just one more quick story. Meta is partnering with eleven Labs to power AI across Instagram and Horizon. So they will be using it to dub Instagram reels in various local languages, also be using them to generate music and create character voices. So 11 Labs, the leader in the space, another one for them. I would expected Meta to try and do this in house, I suppose, but it seems they are outsourcing this stuff.
B
Yeah, it's interesting. I mean that's gotta have something to do with the way they're task organized now with their super intelligence team internally. Maybe this is seen as a little too producty and a distraction from the super intelligence. I don't know, I'm making that up. But yeah, it is an interesting development.
A
Onto applications and business and going back again to OpenAI, they are going to end equity vesting period for employees. So what this means is there's a typical requirement in startups where your equity, your shares basically of the company, which typically if you're at a startup, the way you get rich is you have a bunch of shares, the shares become worth a lot and then you can sell them to get actual cash. There is this thing of vesting where you get more access to shares over time. And in some cases there is this period you have to work for before your equity starts vesting. So you have to be there for six months or a year before you get any access to any of your shares. So this story is telling you now, there's no requirement, no waiting period at all, which I don't know, I'm not like technical in the finance world enough to know what it means, but seems like a play to get talent. Like it's another way to make the job more appealing.
B
Yeah, this is super weird. So when, and this is something, you know, when you go to like, like we're at Y Combinator, there were a lot of talks about this sort of thing, like how to think about equity vesting. And you nailed it, right? So there's typically the typical arrangement in a startup is you'll have, you know, say four year vesting. In other words, the startup tells you, hey, you join our company, you're joining very early on, we'll give you 1% of the company, right? That's your package. But you don't get that 1% upfront, you're going to get it delivered to you on a drip over the course of four years, gradually vesting more and more of it. And the idea is that's meant to incentivize you to stick around because startups are high risk things and you know, if you contribute a little bit early on and then you leave, like most of the work is still ahead of you and the startup needs to have a way to convince you to stick around. The one year cliff is where exactly, as you said, you don't get any equity until the cliff hits. Usually that's 12 months in and then your equity starts to vest after that, running through the four years. This is really weird what OpenAI is doing here, right? So they're basically saying forget it. Like they had previously, by the way, reduced the vesting period to six months, which is already very short. I mentioned a one year cliff is the standard. So now rolling that back to six months, now getting rid of it completely, which is wild, right? Like this is really usually especially given that OpenAI is moving towards an IPO, which means the IPO is the moment you can finally sell your shares on the open market if you're a early employee. So this is like your moment to cash out normally. And as you approach the ipo, usually companies actually become more restrictive with equity just because they want to make sure the core team stays in, like through that, that lockup. Period. OpenAI is doing the exact opposite, just as they seem to be preparing for an ipo. So this is a talent war issue. The way they're framing it is they want to encourage employees to take bigger risks without like bigger sort of product risks. So take bigger swings on research and things like that without worrying that they'll get fired before their equity can best, which, you know, fair. But the real story here may well be that the talent war is so hot that it is just very difficult. Given that XAI is offering basically no similar vesting provision, given that they're offering $100 million pay packages, OpenAI is kind of using this instant ownership as a massive recruiting carrot. And that's I think, a big part of this though there is obviously this factor of eliminating fear. Don't worry, take bigger risks and don't worry that failure of a project will lead to a layoff and then on month five, when you can't access your shares. So this has a lot of implications. It's a very risky play actually for the exact reasons that Sam Altman surfaced to Mark Zuckerberg when Zuck was offering these massive pay packages. You know, Sam was saying, hey, look, you're going to end up hiring a lot of mercenaries who are just doing this for the money and that's bad for your culture. There's a bit of a factor of that here too where if you're making it all about the equity, there's no lock in, period, no vest, no cliff. It's sort of similar. Right? So this is the challenge that everybody's in, in this talent war, it's really hard to recruit. Sam Altman's in it between a rock and a hard place and basically everybody is right.
A
And I would imagine in this talent where OpenAI is now at a point where they just have less equity to throw around for new employees, right? They're as a startup quite mature. They have thousands of employees, they're no longer small. So new employees can't get as big a share of a pie. Xai, from what I understand is not an order of magnitude, but like a significant amount smaller, especially on the AI team but also on the product side. So yeah, it's definitely a sign of it still being competitive and among the frontier labs, among Google, Anthropic xai, if you're like top of the line talent to this point, OpenAI isn't like your go to pick, you know, as a no brainer. Next we've got a pretty detailed look into China's efforts to kind of be able to get away from reliance on the west for hardware. The title of the article is How China built its Manhattan Project to rival the west in AI chips. So some details here. Chinese scientists in Shenzhen have built a working prototype EUV extreme ultraviolet lithography machine, completed in early 2025 and now undergoing testing. This is reverse engineering, seemingly. So this was built by former ASML engineers who reverse engineered the technology from asml. ASML being the only company able to create these machines at this point. ASML's CEO said that China would need many years to develop this technology, but it appears they might be ahead of that prediction. And on that Manhattan Project note, there's a coordination by Huawei for this and overseen by a confidant of a president where there are thousands of engineers across multiple state research institutes working in this direction. So, yeah, very strong push, unsurprisingly for this direction, I guess, Jeremy, you would know more. How much of this information is new, how much of it is surprising, all of that.
B
So yeah, I mean, this information is new is shocking actually. I mean, if you think about the one area that the west was supposed to have an enduring advantage in relative to China when it comes to AI. And I really, I mean, at this point it is the one thing, it's chips, right? It's the ability to, well, not anymore design, because Huawei actually can design really good chips and network them together efficiently, given their constraints, but. But to fabricate these chips, right? So you can go back to our hardware episode that we recorded, I want to say like a year ago now, check out the very end of that. We talked about how these EUV machines work. They are insanely advanced, right? So the old generation of lithography machines, these are the machines that essentially fire these crazy precise lasers that etch into, or, yeah, carve into, if you will, the circuit designs, the circuit patterns on these chips that are so critical to making them work. Without these lithography machines, you cannot make chips. You can't make the H100, you can't make the B200, any of those things. So the previous generation of lithography machine was the deep ultraviolet lithography machine. China has had that for a long time. They bought a bunch from asml, that Dutch company that you mentioned, but they've never been able to crack into the, the extreme ultraviolet lithography machine that is basically needed to get to the 3 nanometer node and below. So with DUV, you can do 7 nanometer nodes, which basically get you the, like the A100 GPU. You can do 5 nanometer nodes, which gets you to the H100, but you can't crack that 3 nanometer node that you need to really go a lot further. It was thought for a long time that the only way to make a 5 nanometer node was to use that EUV process. The Chinese have been pushing the limit on that though, using their old machines, they're DUV lithography machines and passing the same chip through many times, using what's called multi patterning to achieve the same result. So that was already a point of frustration for the West. They're saying, you know, we thought they weren't able going to be able to crack 5nm, but here they are actually doing that. And there's a story basically that we'll talk about in lightning round that is just about that, how essentially Huawei SMIC have been able to hit the 5 nanometer threshold through multi patterning with these old machines. But now they're breaking into, it seems, extreme ultraviolet lithography, which was meant to be the last bastion. It's not that they're all the way there. They've got basically this like crappy prototype. It's unclear how clean the beams are coming out of this thing, but this is a huge step. It is meant to be. It was meant to be the thing that China couldn't crack for like a decade. It looks like now that deadline is more like, okay, maybe four years from now. That's crucial because if they can do this at scale, self sufficiency for China seems to very much be on the table. There are still a lot of other things in the kind of semiconductor supply chain that are being blocked. And a lot of the components that they need to assemble this prototype EUV device they've brought in illegally, kind of bypassing a lot of the export controls. But this is a story of industrial espionage at a massive scale. This is a story of China spying relentlessly on Dutch companies. Think here, asml, Carl Zeiss, that's sort of a complex of companies essentially. And there are tons of stories we've covered about this just relentless Chinese espionage. Just like China's TSMC smic, it's called, was created essentially by stealing a bunch of IP from TSMC back in the day. The same is happening right now with asml. You've got all these former ASML engineers who are being brought in, basically getting signing bonuses close to the millions of dollars. You know, the sky's the limit on salary and they are working in secret under pseudonyms. So they run in they show up basically after being recruited on the kind of factory floor and they see their colleagues and they're like, holy shit, Jim. And Jim's like, oh no, I'm going by Paul now. That's literally what's happening. They're concealing their identities, getting fake IDs from the government to work in these secure facilities. All of this being orchestrated at a massive scale by Huawei. When you think about the Chinese ecosystem, it is not like the western ecosystem with a bunch of different labs. Huawei is the master entity. They are. You can think of them as almost the parent company running smic. So Huawei, SMIC are really working closely together. Huawei designed the chips, SMIC fabs them. So Huawei kind of like Nvidia in that context and smic TSMC in Taiwan and now also incorporating basically the Chinese equivalent of asml, this massive Manhattan Project like effort. And I think that is a sort of fair point of comparison if you think of these things as effectively the enablers of the weapons of war of the next decade. So there are a bunch of export restricted components from Nikon, from Canon that are being used in this prototype. This is apparently news to a lot of these companies, including Canon. They were like, we were not aware that was happening, but that's it. And you know, there's like hundreds of people floating around this that are a lot of them former Chinese citizens who worked at asml. And a bunch of stories here about essentially evading everything from sanctions to lawsuits to be able to do this in China. So classic Chinese industrial espionage story and a great, great reason to reevaluate the security posture. A lot of these labs that have critical ip.
A
Right. This article does go into in particular how former ASML engineers are being recruited with fake ID and aliases, insane signing bonuses. Notably Lynn Nan, ASML's former head of light source technology headed over in this past year. And on that note of the Manhattan Project, it actually, you know, that term gets thrown around for sort of any ambitious technical project. But it is an apt comparison in this case because a, you're bringing in a massive amount of high end technological scientific talent from various places to work on a single goal. They have a team of 100 recent graduates who are systematically reverse engineering components from EUV and DUV machines. Apparently while being watched on camera, I.
B
Think the article said.
A
Right, and the other bit here is the extreme amount of secrecy and sort of national security oversight, I guess you could say. So there are details here like employees working under fake identities so they don't know each other's real names, working inside secure facilities where no one outside the compound could know where they are. They are divided into teams who are kept isolated from each other to protect the confidentiality of a project and they don't know what other teams work on. I suppose this might explain why we are just now getting this news, even though it is a massive undertaking which has been of course in the works for quite a long time. This is part of a priority of President Xi for years, but it appears to be now starting to bear fruit in a big way. And on that note, a related story. China's Huawei and SMIC make progress with chips, according to a new report. So another important thing to understand about China, they have smic, which is their competitor to tsmc, the domestic manufacturer of semiconductors, which presumably would be the one using these EUV machines. And they are continually trying to get to a point where they can make the most advanced kinds of chips, you know, 3 nanometer or whatever the leading edge technology is these days, which is typically goes into the latest GPUs from Nvidia and also the latest smartphones. So SMIC hasn't gotten to a point of being able to get there. They're kind of a few generations behind. But according to this report, they are making incremental but meaningful progress to get to that next generation of chips.
B
Absolutely. And so TSMC is that Taiwan based company that does basically fab all of the world's chips, the high end chips, or at least the west chips. And the only reason they can do that, or one big reason, is they get to use those fancy extreme ultraviolet lithography machines from asml. Those are so critical. As we talked about, if you're going to crack 5nm, 3nm like you need to use those machines. The way around that is to take those old duv machines and just do multi patterning. So pass the same chip through many times. That's very costly because it means you literally have to spend, you know, three times or four times the number of runs to get these same wafers through, like pass them through yet again. And that slows down your run. At least you have to do that for all of the sort of the very high resolution layers. Essentially that's what's happening here is we're finding out because techinsights did one of their teardowns. The Kirin 9030 and 9030 Pro for Huawei's Mate 80 models has this 5 nanometer ish process that has been manufactured by SMIC. So it seems like they're actually able to do this. Now you can't take multi patterning that like multi pass through strategy. You can't take it all the way to 3nm or I mean I expect to be surprised in this space, but it really seems like you probably can't take it at least economically down to 3nm. So I would say expect, yes, there's going to be a refinement of this multi patterning process. 5nm is going to get probably pretty good at some point, but expect there to be a lull until China can get their hands on actual EUV machines at scale. Expect that to be maybe sometime around 2030 or so. They say it's 2028, probably more like 2030. And at that point then expect progress to catch up or to accelerate. But for now still a meaningful improvement. Again, this puts them in the territory of being able to make the sort of H100 generation of chip. This is really a solid amount of fabrication capability. It's not quite as good as TSMC's own kind of internal process. This is what's known as the N3 node that smic is using. SMIC is using, but it is almost as good as Samsung's 5 nanometer process, which tells you a little bit about how far behind Samsung is as well. But anyway, so really important story, this sort of 5 nanometer or 6 nanometer breakthrough that has happened by SMIC. Again I expect this to plateau and then we're going to be waiting for them to either do like industrial espionage and innovate their way into extreme ultraviolet lithography or find another way to get EUV machines and that'll basically be it.
A
Right? And for some context, right, there's tsmc, there's smic, and then beyond that there's Samsung and Intel and kind of no one else. Like this is a very small space of companies capable of producing this technology. This is like the most space age science fiction, absurdly complicated stuff that humanity has ever built, which is why you need a Manhattan Project, which is why it's such a huge deal to even be able to start catching up to TSMC. Alrighty, heading back to OpenAI, I guess we have a lot of of stories on them. They are in talks to raise at least 10 billion from Amazon and use its AI chips. So these are just talks for now. Potential investment that could exceed 10 billion and could come with an agreement to use the Trainium AI chips from Amazon which Anthropic has notably used this is coming after Amazon has already invested at least 8 billion in anthropic. So that is Amazon kind of potentially shifting alliances or broadening their alliances. Also showcasing OpenAI kind of expanding beyond Microsoft as a cloud provider to others in the industry. And you know, at this point OpenAI I guess will take billions from whoever can give them. I don't know if that's true, but it kind of feels like it.
B
Yeah. I mean everybody is chip starved right now. Certainly the interesting thing is Anthropic has had this was the earliest adopter of this kind of like multi platform strategy where they're using chips from Google, they're using chips from Nvidia, they're using chips from aws. So they're well practiced at this kind of multi platform play, which introduces a lot of overhead. I mean, you know, you're having to optimize for a lot of different frameworks. Looks like OpenAI is going to be heading in that direction. I, by the way, at the same time, from like 5am this very morning, looks like OpenAI is in talks to raise up to $100 billion in funding at an $830 billion valuation. So that's got to be sovereign wealth funds. There's no other way you get that kind of money. But I wonder if this will be bundled in together with the Amazon investment at the same valuation. I don't know. So we may be kind of re reporting on this next week, depending on what we learn.
A
Yeah. And for a bit of a sense of scale, you know, these numbers are getting ridiculous. 830 billion. I mean there's only a handful of companies that cross the trillion barrier to be valued at more than that. Right. This is starting to be up there with Tesla, with Meta, with just the most, the biggest tech companies of them all. Right. TSMC is at a trillion dollars of value for context. So this is like OpenAI trying to say that they'll be one of the biggest companies in the world, like top 10 material. Right. And speaking of Amazon, they have a bit of a shakeup. The head of AGI, Rohit Prasad, who's been leading that for a couple years and who's been with the company since 2013, was instrumental to the development of Alexa. He is leaving and being replaced by a new head of AGI. That person is Peter Abbeel. He is an AI researcher, a professor notable in the deep learning space. One of these people who's been insanely productive and influential over the past decade. So it seems like Amazon is trying to be more competitive in this sort of R and D space, model development space where they haven't seemed to really drive very hard to be neck and neck at all with Frontier Labs. They recently did release their Nova AI models, which I don't even know if you've covered because I don't think anyone kind of talked about the Nova models from Amazon, which tells you kind of the amount of impact people seem to think this has. So if Amazon is going to start trying to develop their own kind of AGI models, could potentially be a big deal. We really haven't seen anyone try to become a Frontier AI lab. And it doesn't feel possible at this point to get a new player in the space, you know, to compete with anthropic and OpenAI and Microsoft. But Amazon could make an effort.
B
Yeah, I mean, you know, XAI came out of nowhere, Google came back. So maybe. But to your point, yeah, it is challenging. The talent issue is really tough too. Right? I mean, if you're. If you are one of the best, which is what it takes, you've got your choice. OpenAI will offer you. Hey, no vesting period. Here's your stock. By the way, we're an $830 billion company about to IPO. Like, do you want that private island? Like, that's basically where things are at really challenging. The other piece for Amazon is because they're in the business of chip design, they have to be in the business of model development. You just need that feedback loop. Amazon's solution to this has been historically deep partnerships with Frontier Labs that involve forcing those labs to use their chips so they can get that feedback loop. That's what happened with Anthropic with the early Amazon investment in them. And so here's Amazon desperately trying to get their arms around this and, like, have some kind of internal team, because then the feedback loop can be even tighter. That's informing chip design. So I think this is an important feedback loop that's missing right now within Amazon is like, how do you have an organic capability within Amazon to give that feedback? On the chip design side, everybody's designing their own chips now. We're going to get into a story about Broadcom in a second. I mean, you know, Google's got their own chips, Meta's got their own chips, Microsoft's got OpenAI's got their own chips, everybody gets a chip. But, you know, here's Amazon, early mover on the chip design thing, which made sense because they're also sort of an infrastructure company. First and foremost. It's how they think about themselves.
A
But.
B
But it's going to be hard to maintain if the people who know the most about how you're going to use those chips are sitting outside of your company in a competitor company that's also interested in designing their own chips. So we'll see how this goes. But that's kind of part of the calculus here. It's not that Amazon's out of the game, but this is a really decisive period for them.
A
Right. And notably, Amazon is in significant part a cloud computing company. Right. It something like half of their profit is. Actually more than half of their profit is the cloud compute that various tech companies buy from them. So you can see from a business perspective, you have Google with TPUs potentially being a better option for cloud compute, for AI specifically, which is a big threat for their ability to lead in that space. And on the topic of Broadcom, the story of them is they have revealed they have a mystery $10 billion customer, or it used to be a mystery customer. Now we know it is anthropic. And they have ordered the latest round of Google. TPUS placed an additional $11 billion order with Broadcom recently. So as you said, everyone wants the chips. This is Broadcom's fourth XPU customers, Visa, custom AI chips from Broadcom. And apparently Broadcom is getting another one with $1 billion order.
B
Yeah, it's amazing the sticking power of Broadcom. Everybody's basically using them now, which is. It's kind of weird, you might wonder, like, what is Broadcom actually doing? Because I thought you guys said Google is designing their own chips and OpenAI is designing their own chips. So if they're designing their own chips, why are they partnering with Broadcom to help them design the chips? And the answer is that chip design is really complicated and there's two separable parts of it. You can think of it like in construction, actually, there's a division between the architect and then the general contractor and engineers who actually kind of take the architecture design and turn it into okay, but this is like the detail we need to actually implement it. These are the supply chains that we need to worry about to actually get our arms around this. That's what Broadcom does. And for a long time, you know, when Google first started to partner with them, I mean, the going assumption, at least my going assumption was they're going to try to wean off of Broadcom and eventually master the whole supply chain stuff. That hasn't happened and it's been over a decade. So this might actually be a sustainable moat for Broadcom. We'll see. Maybe their name will keep coming up in the context of all these big companies designing their chips.
A
Alrighty. And now a quick couple of stories about fundraising. First we've got Lovable. They have raised 330 million in a series B funding round. I think we covered a little while ago the prediction that this will happen. They're now valued at $6.6 billion. Lovable is sort of a winner in the vibe coding space that has shot up rapidly this past year. So you go there, you enter a prompt, you get a website or some other program and seems that investors are still excited about the potential for kind of revive coding business, which there's been some signs of slowdown in terms of interest. It's not totally apparent how much revenue it's possible, but they are saying that Lobble has achieved a hundred million dollars in annual recurring revenue within just eight months and has doubled it to 200 million very rapidly after that. So still growing super rapidly. And this is after a $200 million Series A round in July that valued the company at 1.8 billion. So crazy. Like in half a year they have gone by like 4x in value. And one more we've got FAL, a startup specializing in hosting AI models for image, video and audio. They've raised 140 million in a Series D funding round and values it at 4.5 billion. Not a company many people have probably heard of, but one of the winners in sort of the inference startup space, which is big, you know, category for startups that are going to win in the AI landscape. And now moving to projects and open source and going back to Nvidia, they had kind of a big story this week with a model release. They have released Nemetron 3, which is, you could say akin to something like a Chinese open source model in a way. It's a series of free hybrid mixtures of experts models. So very efficient models that are runnable with relatively little compute and are very fast. And they have open sourced it like super, super open sourced it. They've open sourced to data, they are open sourced to training code, they've open sourced to models, they have multiple sizes with 30 billion, notably this nano model with different versions of it. The trained, the base pre trained model, the post train model, the quantized model. Anyway, it's quite a big release and it's looking pretty impressive. I don't know if anyone remembers OpenAI released GPT OSS earlier this year and nobody seems to care. Well, this is actually potentially a big deal in terms of Open source releases.
B
Yeah, I mean it certainly is competitive, sort of roughly comparable on the artificial analysis kind of intelligence index. This composite of a bunch of different evals benchmarks, same score that it got. And so this incorporates MMLU Pro gpqa Diamond Humanities last exam. Like all these benchmarks that we talk about all the time, they just kind of give you an average, so 52 for Nematron 3 nano and the same score for GPDOS as 20B. So in the same kind of size class you're seeing, or similar size class, I should say you're seeing very similar performance. That's a big achievement. So a couple things. First of all, this shows Nvidia is in the game, right? That's a big deal to begin with. We haven't heard much from Nvidia in any kind of model development for some time, at least for these sort of like language model thing. That is the frontier right now. The other piece is why are they open sourcing so aggressively? I mean you made that comment. They're almost trying to say, no, we're open sourcing even more, more. We're going to tell you all about the architecture and all this stuff and all these quantized versions of the model and this and that. Fundamentally, if you look at the open source ecosystem, it runs on Nvidia chips right now in a way that closed source is starting to do less. So right. We're seeing more and more things trained especially on TPUs, which obviously Google's in house chip that no one else gets to use. So more and more proprietary models being trained on that. Nvidia really wants to make sure that it holds onto this open source ecosystem, let's say brand, brand value that it's got. So one of the key things with this release too is these are open models that are specifically geared for agentic AI applications. That's one of the key things that they're orienting towards here because they know that's the key use case nowadays. There are a couple of interesting technical advances here, one of which is this sort of hybrid Mamba kind of attention thing going on. So yeah, one of these headlines is Mamba is back. Right? We haven't talked about Mamba in a while. It's relevant. Again, they've got this interesting interleaving of these Mamba two layers with mixture of experts layers. So they kind of stack them together and they then they slap an attention layer on top of that. So you benefit from, you know, the advantage of the Mamba layer, which if you might recall is a kind of middle ground Memory between the sort of like just remembering the current token, which is what the attention values do, and the actual weights which are frozen in time. Mamba kind of has this vector that gets updated over time while context evolves. And so it kind of gives this more sort of natural way of holding on to intermediate level memory, which is really important. They're also using sort of new RL techniques and things like that. Oh yeah. So they also have this hybrid MOE strategy that involves what they call latent mixtures of experts. And so in a standard mixture of experts model, you have one part of the model, one component that is the router. So you take your inputs, the router decides which of many different expert models, so little mini models to route that token to, and you'll typically send it to a couple of them and then you get your output. And so that's what an MOE layer is in typical model. What they're doing here is instead of sending the full token vector to all the experts and to the router, the model first projects it down to a smaller latent vector. So they're going to compress the information that's coming into the MOE layer so that it can get processed faster by the MOE layer and then it gets spat out and then it gets decompressed or expanded out the other end. At the same time, they're using this thing called multi token prediction. Instead of getting the model to predict one token at a time, they actually have essentially the same model trunk as usual, but then at the end of the model they generate predictions for the next token, the one after that, the one after that and the one after that. Right. So they're able to kind of predict at the same time many different tokens, which is important because it sort of incentivizes the model to be more future aware. Like the trunk of the model then has to be able to support predictions not just for token one, but for tokens two through whatever. So all of this is kind of coming together with a bunch of new RL infrastructure, training infrastructure and a long 1 million token context window. So this is a really interesting release and a whole bunch of technical advancements that have culminated, you know, as we've said, in this really impressive benchmark performance.
A
Right. And technical advancements that are built on kind of the frontier of research in a way. So for instance, the multi step prediction, we covered that paper, I think this year, I think it was a release from Meta actually. Things like Mamba, of course, was something that was very hot in research for quite a while. And we've covered multiple papers discussing hybrid models as sort of one of a winner potential successors to Transformers. This model has a 1 million token context Windows quite large. The models out there of GPT 5.2 and Claude usually less than 1 million, like a couple hundred thousand in input context size which is significant for Agenta coding in particular because the input contexts get very large. Also impressive speed from Amazon and this they released Jazuki into papers. They had Nemetron H back in April with hybrid Malbot transformer models. Then they had Nematron nano 2 back in August and now, you know, fairly soon after this we got Nemetron 3 and notably this latest generation of the nano model which they are open sourcing and letting people use. And at the scale of a nano model which has 30 billion total parameters and 3.6 billion active parameters, you can get a single GPU and start doing some real work on it. So it is from an open source perspective a pretty significant deal. Also the data release they have this Nemetron CC V2125 trillion new English tokens from Common crawl including curated data from free snapshots with rephrasing and translation. Actually also a big deal because these massive data sets like Common Call, et cetera. One thing that you find out once you crawl the entire Internet is that most of it's garbage. Like it's just nonsense. And one of the ways that the frontier labs have a moat have a real that anyone has a challenge to compete is not just the scale of compute but the data. Like that's one of the places where you need talent to be able to know how to get a pre training corpus. And so from a data perspective this open sourcing also quite notable. So overall exciting open source release, a couple other releases. Next we've got meta. They have released SAM audio which is a new AI model capable of isolating and editing audio using prompts. So they've released many of these SAM models at this point for things like tracking things in video segmentation. And now we've got some audio which not too many open source models for this kind of application. So fairly useful. And Google also open Source release with T5 Gemma 2 which is a new family of lightweight open encoder decoder large language models. So T5 historically was one of the model families that Google worked on. Gemma was their small LLM thing that they've open sourced for a while. Now they have put those two together with T5 Gemma 2 and again at the like small big LLM 1 billion, 4 billion parameter models that are competitive with other models of that size and actually outdo Gemma Free and Google's other.
B
Work, and this one. So you might be wondering why the big deal with the encoder decoder thing? It is a bit unusual. Most modern LLMs are decoder only. You're thinking here of like GPT4 for example, classically decoder only model or even standard Gemma. The intuition behind this. So the encoder decoder has an encoder part that basically takes in the text, chews on it and sort of gets to look at the whole text at a high level, creates a small compressed latent representation of that text. So this is what the whole text says. This is my understanding of it. And then the decoder would take that and then based on that rendering of the entire text, generate its kind of output, its prediction. This is basically a way of making sure that your model has a holistic view of the entire input. You can't do it, at least not in the same way when you're trying to do auto regressive kind of text prediction. Because the whole point of that is that you can't see the whole sentence like the whole piece of text. It hasn't been created yet. You're trying to create it. Which is why the decoder only architectures that like basically don't have that first step where you're looking at the whole thing in context. The decoder only architecture tends to be used for autoregressive prediction. Now in this case, they've made a bunch of modifications to the standard encoder decoder framework to make it work for the autoregressive use case. We won't go into the details because we're already pressed for time. But bottom line is this is actually really useful for the kind of multimodal use cases that this is geared to towards. Because you can imagine looking at an image through a decoder only lens, you're literally only able to look back at the pixels that you've already seen as you raster scan your image. Whereas the encoder allows you to look at the entire image in context, like understanding every piece as they fit together before you generate your output. And so that's part of the reason why this is useful. Also useful for long context, because you're not having to look back on everything that you've written up till that point for every token, but instead once at the end, once everything is kind of loaded up, the model can study the full context and based on that determine what to write next. So anyway, this is actually A really interesting architectural development. They have got a whole adaptation recipe to accommodate this whole encoder decoder architecture.
A
Yeah, encoder decoder, fun kind of nerdy detail. Back in the day with the initial transformers it was encoder decoder. Then GPT kind of changed up the approach and now they're bringing it back.
B
Yeah.
A
And a different kind of kind of open source Y thing and PROPIC is making agent skills and open standards and also expanding this skills thing. So skills we covered a couple months ago. It's a new way to give your model kind of specialized capabilities by basically throwing in some folders with some documentation. I guess they're pretty confident in this approach because they are now making it a standard so you can share skills as packages for other users of LLM to build with. Alrighty. Moving on to research and advancements. First up, budget Aware tool use enables effective agent scaling. So the gist is when you're doing agentic coding or reasoning or any of these like frontier challenging tasks, the way that you work is via tool calls, which are things like search for web, read file, edit file, et cetera. And one of the challenges with actually working at scale on large code bases, et cetera, is that the models are generally not optimized to be efficient in terms of their reasoning, in terms of how much they think. Also not very optimized in terms of optimizing their efficiency with tool use. They kind of often search more than they need to read, more than they need to, et cetera. So this paper is introducing budget aware test time scaling that dynamically adopts agent behavior based on real time resource tracking which enables you to optimize performance under any given budget. And that makes it possible to get better performance with fewer tool calls and lower costs.
B
Yeah, suddenly like 30% of the papers that we're looking at are these agentic or multi agent orchestration papers, which is really interesting. So this is another one. They use this framework. Yeah. So budget aware test time scaling, which they abbreviate to bats, which is a pretty cool sounding acronym. This is the way that they are going to go ahead and try to make their agents aware of. Hey, like you're running low on your token budget. So they've got a tool call count. So how many searches or API calls have been used versus the total allowed. They've got a token count as well. So that's just like the cumulative cost of all the model's kind of internal reasonings. Not just the stuff that it spits out at the end, but also the Stuff it puts on a scratch pad. And what they're going to do is they're going to use a step called state injection. So at every step of that reasoning loop that these agents are doing, the tracker is going to inject the current token and tool call balance. So how much do you, how many searches do you have left? How many tokens do you have left? They're going to insert that into the agent's prompt. So the agent gets the regular prompt like go do this, go do that. By the way, you have a total of three search calls remaining. You have a total of X many tokens remaining. And so beyond that, the agent is also going to maintain a record of all the subtasks that are complete and which failed and how many resources each one consumed. So it's almost like the agent is sort of keeping track of what it tried and how much of a cost it had. And then there's this last piece, which is the adjustment piece. So before starting a new task, the agent checks its budget tracker. And if the budget is high, it's going to choose typically a high effort plan. So it's got a lot of tokens. I can do a lot of things. Fine, I'll go ahead and like search five different sources to find the answer and kind of cross reference. But if the budget is low, it shifts to a low effort plan. Like just, let me just go ahead and summarize existing info. It's like, you know, pencils down, you got five minutes to finish the test. What are you going to do? Let me just get this out on the page. So they have a bunch of findings that are interesting. The first is an agent with a budget of just 10 tool calls can match the accuracy of a standard agent that's using 100 tool calls using the budget tracker. So this is a big deal in terms of savings. Resource savings are kind of more mixed. So the budget aware agent reduced search calls by 40%, browse calls by 21%, total monetary costs by 30%. So these are pretty, pretty decent. Last thing I'll just mention, they do explain some interesting behaviors that they see emerge. So when the token budget is high, they find the agent spends a lot more effort on sort of self verification, like checking its own work when it knows that it has a high a big surplus of tokens. Kind of like what you might do if you're writing an exam and you're like, I've got another hour, but I finished, I'm going to start checking my work. Right, that kind of makes sense. But when it's more constrained and it's got a low token budget, very little time left, it starts to just abandon dead end, search trajectories a lot faster and, and that's kind of what you'd expect. So interesting paper, something that might actually generalize. So yeah, kind of, kind of exciting if you're into agents, right?
A
And by the way, part of a trend with the new models, Opus 4, 5 Gemini 3 is them being more token efficient. With them like kind of spamming, reasoning less, you're getting big gains like 30%, 40% more efficient reasoning, more efficient tool use. So that's kind of the overage now is you got some reasoning now, let's make it actually be concise.
B
So up next we have Rethinking Thinking tokens, LLMs as improvement operators. And the basic idea here is what if we started to think about our language models not as things that generate pieces of text, but things that can execute arbitrary operations on text. And what the hell does that mean? Well, you know, we've got a compression operation, for example, right? You've got a text that's too long, let's go ahead and go through and compress it, right? So read the text, write an analysis of it and then compress that analysis. That whole series of steps. We can think of an LLM as a thing that can do that. So essentially just like transform a piece of text in arbitrary ways. This seems like an effort to introduce fancy sounding, satisfying, mathy language to a concept that I think everybody basically gets already. But the interesting thing about it is they demonstrate one particular loop or kind of operation method called parallel distill refine or pdr. They're basically trying to show that if we want to achieve the same performance or better performance than just having a model do a bunch of verbal diarrhea and pump out a bunch of tokens in a chain of thought. Maybe what we can do is actually generate far more tokens, but do it in parallel. So generate 10 different shorter rollouts and then combine them together and you know, each of those rollouts has to be bounded in terms of how many tokens it can contain. And basically just like then combine, synthesize them together to get something better than just the, again the puking on the page that you see with standard full length chain of thought. And they do see this approach is much more effective or can have significant advantages if you think of it in terms of like overall the time required to pump these tokens out. Because you're parallelizing, yes, you're doing more tokens total, but then you're aggregating together, you're summarizing, and each of those rollouts is shorter, so you're actually going to get your answer faster, if that makes sense. Anyway, we'll roll. But that's the story.
A
Right. By the way, just in the spirit of last week in AI being our title, these are slightly older papers. I'm assuming, Jeremy, that you're just catching up on cool stuff that you missed while being in your crazed work phase whenever you were in hiatus. So in case anyone was wondering, yeah.
B
I should have mentioned I actually didn't include which of these papers are more recent and which are not. I apologize. I really should have done that. Hopefully this is an interesting enough concept that people can look into it if you want. But it does seem like a. Yeah.
A
One of the more notable papers of the last few months and last in this section, not a research paper, but sort of a result from Epoch AI. The question is, what if AI capabilities suddenly accelerated in 2027? How would the world know? They have developed a statistical framework that lets them detect capability accelerations from benchmarks scores. So with kind of some synthetic data, they have said that they can spot a 2x acceleration of the improvement rate of models within a month or a couple months of that trend starting. And that's notable because it's been slightly linearish. Like, we had a consistent improvement rate over the last few years. It's hard to know if that will persist or if we might accelerate.
B
Yeah, absolutely. They have a whole kind of interesting breakdown as to how they achieve this. The challenge with a lot of benchmarks is that they saturate. In fact, the challenge with all benchmarks is that they saturate. And so what you find is if you want to trace AI progress over the last decade, you kind of can't, at least not with benchmarks, because what you find is, I don't know, take MMLU kind of classic example. For a long time, like, yeah, that was a tough benchmark and we were making steady progress. But then two years ago or so, three years ago, we just saturated it and there's no more sort of telling what model is better or worse based on that metric. So what they end up doing is coming up with this way to convert essentially every benchmark into something like an ELO score that gives an underlying indicator of its difficulty, the challenge level associated with it, and that lets them kind of plot over a much more extended period AI progress. And using that technique, they're able to say, okay, well, here's what then we could expect it to look like if we saw a kind of 2x improvement suddenly per month or whatever within say two or three months in the next few years? Here are some of the tells. So it's a really interesting Call it a tweetstorm or an xstorm that's worth checking out if you're interested in kind of predicting the future of AI and.
A
Scaling trends onto policy and safety, where we mostly have safety first up and update to GPT5, a system card with GPT 5.2 this is the release that OpenAI always does, where they highlight in particular the safety implications of the new models. With GPT 5.2 they show largely the same, I guess, trends as GPT 5.1. Similar scores on various benchmarks of safety, although some things better, some things worse, for instance on hate or illicit content. GP 5.2 somehow in some cases, especially with instant model less robust, also with the assessments of potential high dangers, they are treating this launch as high capability in biological and chemical domains, activating their preparedness safeguards and on various benchmarks on helping with lab issues, protocols, et cetera. It is steadily improving the ability to help people, you know, do scary things with bio. Like the numbers are creeping up. They're not super large, but they are indicating that this is a thing to keep an eye on.
B
Yeah, overall I think the system card is actually kind of a big deal, especially given that we're only incrementing a tenth of a point on the famous GPT scale. First of all, we're seeing a shift in how these models are being evaluated, moving more from the chat safety frame to like an agentic frame, agentic risk frame. They are saying. Yeah, as you said, for the first time, they are categorizing this model as high capability in biological and chemical domains. That's under the preparedness framework. We've heard a lot that previous models were quotes approaching that threshold, but now we're actually seeing the deployment with high safeguards active by default. That's a big deal. It means that the model can help with national security relevant stuff like lab protocols, agentic planning that could facilitate the making of hazardous substances, things like that. They are setting up a trusted access pilot that's going to restrict some high level scientific reasoning capabilities to verified institutions and government bodies. How well you can actually do that given that you've got guys like Pliny the Prompter finding ways to work their way around all the guardrails on these things. You know, we'll just have to see how that shakes out. But Certainly that's the idea. Same story on cybersecurity. So they're in this case seeing an 82% success rate on all these professional cybersecurity challenges, which is a big jump from 5.1. It is the first model to show actual proficiency on this Terminal Bench 2.0 benchmark that tests basically can it navigate things like a Linux terminal, use command line tools, you know, iterate actually on exploits without human intervention for long periods. If you look at GPT 5.1, it basically loses the thread on a lot of these logic loops after like 30 to 40 minutes or so of autonomous like tool use. Whereas 5.2 is going to be able to close that loop for up to 14 hours without needing a human to nudge it or kind of reset it. So that's a big deal, that's a big jump in terms of autonomy. And similarly, you know, they're looking at very significant shifts in kind of like the overall approach to this model on the security side. So one of the key things is OpenAI has made a commitment like anthropic actually, once they hit that high risk threshold or whatever their equivalent of high risk is, to start to implement like data center security protocols, things like that, like air gapping and other measures. You know, like, one would hope that OpenAI has their game down pat when it comes to that. I will say really, really, really, really hard to figure out what you need to do to secure Frontier Lab infrastructure. This is something that like, you know, like I've been working on for three years with some very knowledgeable people and, and no one really knows how to do it. There's an asterisk to that. But anyway, all this is kind of part of the package when you increment to that high risk level. There's a whole story here about how GPT 5.2 actually sort of thinks in latent space, along the lines of that, that famous meta paper that we actually covered back in the day on coconut, right? It was called training large language models to reason in a continuous latent space, where you basically never decode the final token from latent space into text space, but you feed it back into the bottom model so it can keep thinking in latent space. The Idea here is OpenAI is saying, look, by thinking in latent space we're preventing the model from doing sycophancy, from trying to like, look like it's reasoning nicely while secretly plotting to kill you or something. Now the problem with this is, of course the model can just do that in latent space. This does not actually solve the problem, it just makes it harder to see. At least that's the argument that I think a lot of people would make looking at this play. But that is nonetheless what is being done here. And so, you know, we'll see if this is the right way forward. I mean, it certainly will be a temporary patch. One could be concerned about this sort of creating more problems than it solves in the long term. But, you know, that's for OpenAI to figure out.
A
And speaking of the model kind of showing you what it's doing, the next thing is a paper. Neural chameleons language models can learn to hide their thoughts from unseen activation monitors. Gist is you're able to check if a model is doing something bad by looking at the activations, the internal activity of a model. So that's a probe. And what this paper shows is it's possible to train a model to essentially avoid any kind of probe. You can train it via key phrase to avoid being detected, to use German, and then a few other things, like avoid demonstrating that you're doing HTML. Then with that, it generalizes to avoiding showing that you're doing some hate speech or buyer risk or whatever. So there's a way to circumvent this specific way to track whether a model is doing something bad or wrong.
B
Yeah, what they're going to show too is they're essentially going to fine tune this model with a combination of two objectives. One is can you fool the monitor that's looking at your activations to try to predict whether you're doing something safe or malicious? So can you fool that monitor? And then also the separate objective of can you make sure that your actual output text is the same as it would have been if you weren't making that effort to fool the monitor? So if you're going to write something evil, you're still going to write the evil thing, but you're just going to also try to fool the monitor at the same time. And they find that when they actually run this test it just works basically is the long and short of it. This is a really interesting paper to put back to back with that discussion of GPT 5.2 and how it's reasoning in latent space and that the hope is that's going to help us with safety. What this paper is showing is at the very least, if you fine tune a model, you can absolutely fine tune it to have the same malicious behavior and fool all the probes that you stick on it, or at least many of them, much of the time. So often this is what you see. You see the models first demonstrating some kind of capability when they're fine tuned for it and then later it emerges naturally as a result of sort of standard power seeking incentives, for example during in context behavior. So putting these two ingredients together, I'm sure OpenAI by the way is thinking about this already. They're going to be considering a whole host of factors as they decide to reason in context. But this is basically what the threat model shifts to when you start to move into the in context reasoning stuff.
A
Right. And notable also for if you're trying to track like open source models, you know there's no safety there. You can fine tune away any safety checks, more or less. Next, more of a safety process paper I suppose the title is Async Stress Testing Asynchronous Control Measures for LLM Agents. So what that means is they're looking into whether you can respond to agents doing bad things. So they have this red team, blue team approach where a red team tries to bypass some security measure and a blue team tries to develop monitoring systems to detect malicious activity. And they show that this is a potential kind of I guess broad strategy or direction for getting guardrails in place to make it so AI agents aren't able to go and mess up with your system.
B
So the way concretely that this works is there's. So the blue team agent is deliberately made to be a weaker agent. So they use Sonnet 3.7 for that. And there's this like red team agent that is trying to, it's given a task but then it's also given an evil side task that it's trying to kind of sneak by the blue team agent. And that red team agent is powered by Opus 4.1. So it's actually like more capable and more scaled. And so the idea here is the blue team gets to look at the red teamers, the evil agents logs after the fact that and the idea here is you just get the prediction out there like so you don't induce any kind of latency for the user due to the security review process. But what you're going to try to do is make sure the blue team can review the logs and then intervene as quickly as possible if it goes, holy shit, 30 seconds ago the red team guy really snuck one by me and now he's, you know, his evil plans are on their way to being implemented. So this is kind of a way to kind of simulate what a loss of control scenario might look like. You have the red Team agent subtly, malicious, subtly trying to do evil things. And then the blue team agent is trying to do after the fact review, minimizing what's sometimes referred to as the safety tax or the alignment tax. Basically this is the capability or latency decrease that happens when you add safety measures. And so they're just saying, hey, you know what, just go ahead and send it. Like have the main agent just send the response right away. And then we'll review the logs in post. We get into the weeds a little bit more. But they have a whole bunch of interesting tasks like setting up a kubernetes environment like ML infrastructure tasks where the main job is the agent is asked to deploy some monitoring stack like Prometheus or configure a cluster and the side task it's given is sabotage. So it's like while you're setting up your infrastructure, try to secretly exfiltrate a bunch of AWS credentials. Right? That's an example of one of these things. And so they have the blue teamer kind of try to catch it. Then the mechanism they use for this is the red teaming agent is going to try to sneak something into the system port. So the blue teaming agent doesn't know that's how it's happening. It just knows that this is an untrusted agent. And they find actually one of the key factors that determines whether or not this whole scheme works is the timing. So if they assume in their model there's kind of an hour of effect, so the evil model will get its output out into the world and then there's a certain amount of time that has to pass before that output has an effect. And that amount of time is a crucial factor in determining whether this overall approach works. Because if it takes five seconds for this evil prompt to take effect, then that's just too short for the blue teamer to detect and then act on that effect. Whereas as you start to kind of extend that period of time, you get the ability to kind of radically decrease the effect. So it's a toy model, it's meant to sort of start us thinking in the direction of like how might it work in practice if we assume that we can't afford the latency that comes with real time monitoring, real time intervention and instead we just have to get these things shipped. And so I think a really interesting paper for that reason.
A
And one last real quick story related to policy. Google is powering a new US military AI platform. So the Department of Defense or War has announced Genai Mil, which allows military personnel to access Gemini for basic non aggressive stuff like summarizing handbooks and compliance and so on dealing with unclassified work. So Google actually working with military which used to be a big no no in Silicon Valley. And with that we are going to wrap things up. Thank you so much for listening to this week's episode. As always, we appreciate you sharing, reviewing, commenting and so on, but especially tuning in week to week.
C
Break it down Last week in AI come and take a ride get the low down on tech and let it slide Last week in AI come and take a ride Up a labs to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in tune and get the latest with ease Last weekend AI come and take a ride Hit the low down upon. From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Date: December 25, 2025
Hosts: Andrei Karenkov & Jeremy Harris
Main Theme:
Episode #229 dives into a week packed with high-impact AI news, major model releases, industry power moves, advancements in global AI policy, and deep research insights. The hosts discuss Google's Gemini 3 Flash launch, OpenAI's expanding app ecosystem, cutting-edge open source models, and critical updates on the China–US AI hardware race.
[02:07–09:26]
“You give it extra RL training on coding specifically. That really tells you the focus here. The tokenomics looking really good… the war of the models continues late into December.”
— Jeremy [04:25]
[09:26–19:20]
“If you tuned out in August and you’re waking up today... holy shit, do things look different. This is a good moment to reassess your cyber posture.”
— Jeremy [17:49]
[19:20–23:02]
“You can produce any text on anything, in any... We'll see what the vibe check is, obviously, but we're clearly on the way there.”
— Jeremy [20:46]
[23:16–42:50]
[42:50–49:13]
[49:13–52:31]
[52:31–63:10]
[63:10–71:39]
[72:57–84:38]
| Timestamp | Speaker | Quote | |------------|---------|---------------------------------------------------------------| | 04:25 | Jeremy | “You give it extra RL training on coding specifically...” | | 17:49 | Jeremy | “If you tuned out in August... holy shit, do things look different.” | | 20:46 | Jeremy | “You can produce any text on anything, in any...” | | 34:56 | Jeremy | “Classic Chinese industrial espionage story...” | | 49:13 | Andrei | “Lovable is sort of a winner in the vibe coding space...” | | 67:58 | Jeremy | “So yeah, kind of, kind of exciting if you’re into agents, right?” |
Technical, fast-moving, and wrestler-in-the-ring competitive, this episode spotlights a seismic week in AI, where the gap between tech titans and international rivals narrows, safety debates deepen, and open source continues to accelerate. The hosts’ banter, memorable analogies, and sharp industry context make this summary essential for anyone tracking the leading edge of artificial intelligence.