Loading summary
Andrei Karlenkov
Foreign.
Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at Last Week in In AI for articles we will not be covering in this episode. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocade. And back with us after hiatus is our other regular co host.
Jeremy
It's nice to be back in the box.
Andrei Karlenkov
Yeah.
Jeremy
Hey everybody, thank you for your patience while I ducked out for that quick three month period. Yeah. Wow. What a. What a three months it was. And I'm super pumped to be back. This is like something that was missing from my life and schedule in a big way. So it is great to be back in the seat. What a week to be back. Deepseek 3.2 like all these TPU stories. All like insane set of model releases. Not a huge number but like a couple of really big ones and then quite a cluster of other things too. So really appreciate everybody being patient and obviously there were some great podcasts in between. So really thrilled to be back. This is going to be a lot of fun.
Andrei Karlenkov
Yeah. For anyone who has been listening or is going to go back in the archives, it's been an interesting couple months because I've just wound up rotating in different co hosts and then like random recording every week or two. So I think we are gonna do our best to actually get back to schedule and return to your regularly scheduled programming. And Jeremy, you did miss I feel like the last few months was like the last round of frontier model releases. Opus 45 Gemini Free GPT 5.1 I.
Jeremy
Can'T remember if it's like Star wars where they freeze the dude and reanimate him. I'll say Austin Powers because I know it happened there but you know like they thawed me out and it's like whoa, like all these models happen. So I'll be hon honest with you. Things have been so busy in the last three months that I have not looked into the big Frontier releases. I have to go back and see what the hell happened with anthropic with OpenAI, with with gra, you know, all these things. So and obviously Google. So I'm excited to just like learn by osmosis as as I get it from you. Obviously I'm all caught up and read the papers from last week but everything before that is this like strange wilderness. So it'll be an interesting catch up time. I hope everybody is in for a lot of like dumb questions from me because that'll be a big part of the show.
Andrei Karlenkov
And we are doing 30 new stories an episode so you'll as always try to keep it to an hour and a half. We'll see how it goes. To give a quick preview of what we're talking about, we will be Starting with DeepSeek 3.2, the big model release, I guess one of that like frontier model release cycle. Beyond that, more actually models for images and video. This week, applications in business. We've got a whole bunch of stuff. Some developments on the hardware front, some new startups raising money stuff going on with OpenAI anthropic, a whole bunch of stuff like we're gonna have to race through it. Then we will have a couple of open source stories and research stories which are just interesting. Looking into the developments and a bit into the future of where we might go and then a few stories on policy and safety to round things out. So should be a fun episode. And let's dive straight in with tools and apps beginning with DeepSeek 3.2. This is the new development from DeepSeek, the developers of Deep seq v3 which is their sort of LLM comparable to GPT and Claude. Then they have deep seq arm one of course which was kicking off 2025 with at the time. It was a massive deal. They showed basically with an open source model they could train it to reason as well or comparatively well to O and some of these other models at the time where the movement towards reasoning and test time, compute scaling was just beginning. So now we've seen all the other labs do that as well. Right now everything is thinking, everything is reasoning. Opus 4.5 Gemini 3 Gemini 3 doesn't allow you to query it without having some reasoning at this point. And now we have DeepSeek 3.2 and it is as you might expect, actually a quite a big leap. I don't know why they haven't called this DeepSeek4. I think they could have, but lots of big headline stories here. So first of all 50% cheaper. It's way cheaper than these other offerings like compared to Anthropic for instance, super affordable, performs great on the benchmarks. Some of it you can even say outperforming. GPT5 just neck and neck and even better in some cases some interesting technical stories. With the release as before with Deepseek they give us a Lot more insight into what they're doing and how they achieve these results compared to the other developers. So we know details like sparse attention which makes it able to be faster, not just cheaper, but also faster. We know some of their training details like they have refined the reinforcement learning objective of some of the learnings, for instance about the optimization method you use and stuff like that. I don't want to dive too deep into it, but this is deep seq 3.2. They are working towards deep seq R2, which is meant to be like the hardcore reasoning model. So this is not even the reasoning one. This is not like the deep think. This is the opus 4.5 or I don't know, whatever you want to call it, the base chatbot model, not the reasoning maxed model. So very exciting for, you know, Vista is open source again, so people will be able to build on this, fine tune it, et cetera.
Jeremy
Yeah, it's actually, you know, I remember, God, I want to say this time last year when Deepsea came out with V3, the base model.
Andrei Karlenkov
At the time, late December.
Jeremy
Yeah, right. Yeah. And at the time, actually we talked about it on the podcast and said, wait for it, there's going to be a big reasoning breakthrough from Deep seq. People are underpricing Deep Seek right now and lo and behold, R1 came out. The reason it was so easy to make that call at the time was that your base model is a big determining factor in the quality of the reasoning model. You train downstream from that. And if you want to hear it from someone who knows infinitely more than us, just listen to the Dwarkesh podcast with Ilya Setzkever that recently came out where Ilya is making the case his entire company is based on this assumption that there's something wrong with pre training, that that's the reason that we're hitting, in his view, not a full on wall, like things will keep improving, he thinks, but that ultimately they won't improve all the way to asi AGI, whatever your definition is, unless there's something fundamental that's changed with pre training. So pre training is a big part of the story. In fact, it's a huge part of the story. This is a new base model. To your point, it is a bit weird that we're not calling it the, you know, four, especially since it was R1 and now we're doing R2. So they are incrementing the integer on the R version but not the base model, which is weird. But anyway, we'll, we'll set that Aside, they get a pass on this. So there are a couple of big things they do here. Deep seek sparse attention is one of the big breakthroughs. I'm just going to quickly gloss over this. This is going to be probably the one paper that we do a kind of deeper dive on just because of how much there is and how much transparency we're getting into the makings of this because of course it is deep seq. So one piece here is deep seq, sparse att. Okay, what happens when you do a normal transformer architecture? You have your attention mechanism that's going to pay attention to all of the tokens in the input sequence and then it's going to figure out basically which tokens need to be essentially accounted for or weighted more heavily based on the relevance to the in the sequence for the token that you're currently attending to. What they're doing here is they're saying, well, this is a very expensive calculation. You need to compute attention weights accurately for all tokens. And why don't we train instead a lightweight indexer just to get a rough idea of the attention scores, an approximate attention score, and only keep tokens with high indexer scores, in other words, tokens that we are approximating to have or expect to have high attention values. And then you just toss out all the tokens that don't fit that criterion. And this indexer, by the way, is very lightweight. It has a lot fewer essentially attention heads. It's got around 8 to 16 instead of 128, which is what you see for the full on attention mechanism. It's lower precision. So FP8 instead of FP16 or BF16 for the full scale thing. So all these things are being done to make it super, super lightweight. Very quick approximation. Which tokens should I pay attention to? Ditch the rest and keep just those. And so what they end up doing is they end up keeping about 2,000 of these tokens regardless of the length of the input sequence, which is really interesting, right? You might expect that for a fixed number of retained tokens, in this case again about 2000, you'd end up discarding more information the longer the full input sequence ends up being right, you're only keeping 2,000 out of, you know, a very large number of tokens. If you have like a full context window of 128,000 tokens, you're really keeping less than 2%. It turns out that that actually does not lead though to a significant performance degradation, even over long context tasks. And the reason ultimately is that you Know, information is very sparse in context. There are very few tokens in practice that you really need to inform that next prediction. And in practice, you know, 2,000 tokens is like a couple pages of text. So you actually do end up having quite a lot of information to base, I think.
Andrei Karlenkov
Yeah. In some sense this is like actually attention, Right. Versus transformer attention is like look at everything and then keep everything in mind all at the same time. Even if it's like 2,000 pages or whatever. This is actually looking attention in a more human sense of like pick out information to process and then process only that information.
Jeremy
That's a great. I haven't heard that point me. That's a great point. Right. Because in a way it's like, it's sort of like a human being reading a book and being like, okay, you need to memorize every single page. Except also remember that page one is really forgettable, but you need to remember page one, but just know that it's super forgettable. What they're doing here is saying, ah, fuck page one. Throw it out completely. Just keep those pages much, much more compute efficient. And by the way, classic Deep SEQ thing, right? I mean, when you think about what made Deep Seq v3 a really capable model, it was all these little kind of optimizations. It's the software engineering almost of making these models go that is the secret sauce behind a lot of these releases as well. There's a new RL paradigm that they use that's got a whole bunch of things going on under the hood, one of which is just scaling the crap out of their RL compute budget. So their RL compute budget is about 10% of the total compute budget dedicated to training this model. That's a lot. You'll note, of course, that Groq has actually hit, they claim, 50% RL compute budget. So it's not unprecedented in the whole space of frontier models. It's the first time we're seeing an open source model with anything like that kind of RL compute. So that is a big deal. There's a whole bunch of training stability improvements. When you do RL at the scale, RL is notoriously fickle. It's really, really hard to get a stable training run out of your rl. And so there's a whole bunch of things that they do. There's a trick called off policy sequence masking, which roughly speaking, it involves them like keeping so doing a bunch of RL rollouts. So get the model to produce a bunch of solutions and basically throwing out Solutions that lead to incorrect answers. And they're also really inconsistent with what your base policy, your initial model, would have produced. And this is basically just to avoid situations where there's a question like, like what is two plus two? And the response that you would get is like, you know, banana, cucumber, polywoggle, right? It's like a complete, like implausible and wrong answer. So there's nothing to really learn from there. And so what they'll do is they will keep answers that are wrong if they look plausible. And under the original model, so you know, what is two plus two, the response five would actually get retained as an incorrect response. You can actually learn from that. You know, it's clear enough that it's not just noise. So that's another kind of micro optimization here. The only other one I'll mention here is this keep routing operation. So it turns out that most people, when they do training and inference, you have to do a lot of inference to do your RL loop, right? So that means essentially generating all those rollouts, all those solutions that you're going to get the model to produce and then ultimately train on later doing that. People tend to use different frameworks to do training and inference and in this case they're using a mixture of experts model. So essentially a whole bunch of experts, little sub models, mini models. A query comes in, there's a router that sends the query to a subset of those models. And it turns out that during training you end up sometimes accidentally routing that query to different models than during inference. And so you do your inference, you calculate rollouts, and you're like, oh, this rollout was wrong. Okay, so I need to punish whatever expert produced this. Well, if your inference produced that rollout from a different expert than what's actually being trained on, you're going to end up propagating the corrections, the gradients to the wrong experts. And so what they're doing here is just making sure that the expert that creates the rollout is the same one that's having its gradients updated in the, in the training process. Again, doesn't tend to matter at smaller scales because this becomes more of a kind of a fringe issue. But when you're doing it large scales, it leads to huge training instability. So there's a whole bunch of stuff here too around large scale agentic task synthesis. And they have a pipeline for it that we're not going to get into. But the bottom line is this is a very, very impressive model that maybe one issue that they're, they're flagging is poor token efficiency compared to Gemini. So in other words, the average number of tokens that it generates to solve a problem is actually quite a lot. It's quite high. It's a very verbose model. And so they haven't found a way to get it to reason as efficiently per token. Each token that it generates, each word that it generates in its output is in a sense contributing less to productive thinking than what you see in a Gemini model. So they flagged that as a thing that needs to improve. But look, I mean, zooming out. Holy shit. Like, just look at the benchmarks.
Andrei Karlenkov
This is a wild one, right? And we positioned this initially as the base model as not the R2. But we should also say that like GPT5, like Sonnet4.5, like Gemini, really, this is an integrated model. So they have deep seq v3.2. They also have deep seq v3.2 thinking. And they have also this deepseek v32 special which is an experimental. I need to look up exactly what I say, but it's basically like a preview of R2 or where they did train it for to investigate the potential of extended thinking, they say. We also developed an experimental variant which was trained exclusively on reasoning data with a reduced length penalty during rl. So we get like a little preview of this model which they're not planning to keep on their endpoint. And yeah, if you look at the benchmarks, Amy, you know, HLE Humanities last examination, they are near the top, beating GPT 5 high, beating Cloud for 5 sonnet, getting near or in some cases beating Gemini Free Pro. So it really is, you know, state of art. As you mentioned, one of the limitations is the context efficiency with token use efficiency, which we've seen with reasoning in general, the potential failure mode if you optimize for reasoning. They also mention a fun detail in the conclusion. They acknowledge certain limitations when compared to frontier classroom models. First, due to fewer total training flops with breadth of world knowledge in deep seq, the 3.2 still lags behind leading proprietary models. And then the second, this is one of the two things I mentioned. First, we weren't able to train as much, so he doesn't know as much even though it's as smart. Second, the token efficiency is a challenge. So they're saying that these are two of the things they're working on. And third is complex tasks where they're still not quite as far. And presumably that's where Deepseak R2 or whatever you want to call the next model will Be a couple other interesting factors before we move on. So this is a generous model in the sense that R1 introduce this paradigm of training with verifiable rewards on code tasks and math tasks particular. Right, that's the beginning of reasoning and then what the other thing that happened throughout 2025 I would say is the move towards actual agents, the move towards full on tool use. The way we got to cloud code to these coding agents wasn't just reasoning, it was also they must optimizing them for tool use, for operating in an agentic environment. And that's where this task synthesis part of this comes in. They have tens of thousands, over a hundred thousand tasks with environments for coding, for search, for code interpreting and for general tasks that they synthesized and so on. Things aside from code, also for search for presumably research, deep research, although I don't think they mentioned this year for tool use. It is again not quite as good as the frontier models, but very good. And I think it's interesting to compare it to Kimi K2 thinking which also recently came out on things like humanities like last exam, MMLU Pro. Just the broadly intelligent evaluation criteria. They aren't super far ahead. So actually QMEK2 thinking is also a very intelligent model that we recently discussed. But when you look at a difference in tool use and in sort of agentic execution more broadly, this is where it I think really shines.
Jeremy
Yeah, absolutely. And to that point, right, part of this is also the. So the RL stage of their training loop is very interesting for what it's trying to do with respect to catastrophic forgetting, which is one of the classic challenges you run into. You want to train a model, a thinking model that will reason well. And then you also want to train tool use and then you also want to train coding. And the problem is if you layer these in one after the other, the model will catastrophically forget. In other words, if you train it to do coding, after you train it to do reasoning, it will forget some of the reasoning and then pick up some of the coding. And so the solution to this that they, they work into their training loop is to use what they called mixed RL training. So they're going to actually instead of separating out reasoning agentic training and then alignment training, they actually merge everything into one RL stage and they use specialist distillation for this. So they train a bunch of these domain specialists like domain specialist models with heavy duty RL compute. They train a math specialist, a code specialist, an age like agentic specialist. Yeah, point, a search specialist too and then they use those specialists to generate high quality training data for the final generalist model. And that's all layered in this kind of mixed process to avoid the impact of catastrophic forgetting. So kind of interesting in that sense too, this sort of rethinking of. I've lost track of what we're supposed to be calling pre training anymore at this point, but the rethinking of some aspect of that training loop.
Andrei Karlenkov
I think pre training is still considered the next token prediction stage where you ingest all of the Internet.
Jeremy
Yeah, that's what it's always been to me. I've just seen papers now where they're like, oh my God, let's add RL to pre training.
Andrei Karlenkov
Let's do this and that. And then what? What is pre training? Post training, mid training, nobody knows.
Jeremy
Fine tuning. There's like this gradual shift in the type of data where you're moving from what once we would have called pre training to something that looks more and more like fine tuning, but they still call it anyway. It's a whole thing anyway.
Andrei Karlenkov
But yeah, the takeaway is deep seq V3.2 Very good reasoning. First models built for agents. That's what they call it. So it is basically a successor to R1 for all effects and purposes and.
Another shot in the trajectory of open source models. Very much kind of keeping neck and neck with frontier models. And I think you mentioned Ilya's interview with Dwarkesh, who in case our listeners haven't heard, there's been some very good interviews lately on the Dwarkesh podcast with Ilya Satskover and Satya and Karpathi. Oh yeah, a lot of high level discussion of where we are at right now and where we are going. And there is kind of a consensus forming that our current recipe isn't complete. We need some new ideas, new kind of like something big to change up recipe. And the way that Ilya put it, which kind of is very popular, is removing from the age of scaling to the age of research, which I think this paper, you know, you look at the technical report, which is fairly detailed, has a lot of like these different bits and pieces that they put together. Makes me wonder whether, you know, Deep SEQ is more in the age of research than OpenAI and anthropic@ this point. Because open anthropic.
Jeremy
When I look at this through that lens, what Ilya is pointing out is finding more clever ways to make things scale that don't hit general reasoning. And what he's really getting at is a kind of sample Efficiency argument fundamentally that like it's not just about doing more of the same thing. Like essentially what this Deep Seq paper is doing I think Ilya would say is kind of a waste of time through, through the lens that he's using where they're saying okay, let's continue to pepper this model by giving it even more examples of like a million different coding problems to train it to solve those coding problems. But he's really focused on this whole out of distribution generalization thing that just doesn't seem to be getting cracked by the current paradigm. There are people by the way like Anthropic and OpenAI are at least publicly very much on the scaling seems to continue to work side. So yeah, we'll see But I think.
Andrei Karlenkov
It'S a spectrum, right? Spectrum of how much, how innovative or different you are, particularly with sparse attention I think is quite interesting. When we get to research we'll be talking about some work from DeepMind which I think is very much along these lines and very interesting. But for now we finally move on and try to get through more stories. So next up we've got Black Forest Labs. We haven't talked about them in a minute, but they have released Flux 2, the next generation of their image generation and editing system. So this is another thing that's pretty noteworthy. Over the last couple months with GPT5 image and especially Nanobanada and Nanobanada Pro we've seen like a quiet revolution in image generation and editing capabilities in a way that I think most people haven't predicted. So now these models are able to synthesize very, very very precise prompt correct. Kind of like you were getting to AGI for image generation almost right. And blackforce Labs is a startup that's been around for a while, spun out of Stability AI and Flux for a long time was one of the leading text to image models. They also had open source variants. So with Flux 2 they are introducing basically you could say the Nano banana generation, the GPT5 image generation of image synthesis for their system. Lots of details as you might expect. They have a bunch of variants. Flux 2 Pro, highest performance, Flux 2 Flex, Flux 2 Dev, which is a 32 billion open weight model, Flux to Klein which they are aiming to Release under Apache 2.0 and their VAE. So they are still doing this partially open source thing that they've started with and have kept with. They say this wins out against various models, Quantimage for instance. As far as open source things, you're probably not going to do better than this. Not up there with Nano Banana Pro, but much cheaper and faster. So I think in the world of image synthesis and in that environment, this is, you know, potentially you could say, similar to DeepSeek 3.2 in the sense of the impact on open source and also on competitive kind of environment versus Frontier labs.
Jeremy
I continue to be interested in the business model of open sourcing stuff and how viable it'll be in the future. I still don't see it, but I'd be very curious to see.
Andrei Karlenkov
I think the standard is open source for weaker stuff and keep the actual frontier stuff to yourself, which is also what Mistral is doing and everyone's doing for sure.
Jeremy
Yeah. I guess the thing I'm wondering about, especially on images, given what we've seen with nanobanana, what is the ceiling on image generation capability beyond which people stop caring? I don't know. I suspect that it may actually turn out to be fractal. Right. It may turn out to be like your image generation tool gets so good that you can get it to, I don't know, like make a circuit diagram for you for a next generation circuit. Right. We may get into that space in which case it gets more and more niche and potentially higher and higher TAM and value. But yeah, I guess I, I'm sort of curious. It kind of seems like, you know, you're drowning in a ship, the water level is rising as the open source models get better and better and then like you can only move so high until you hit the roof of the ship and then you. This is a very like dark metaphor, but I'm curious about. Yeah, how, how viable it ends up being. But I've also been saying that for like three years and this is continuing, so. I don't know, dude, like.
Andrei Karlenkov
Yeah, yeah, I mean looking at the numbers, they show that this is better than not a banana in terms of human preference. Comparable cost, not quite as good as nanobanana Pro, but much cheaper. Forex cheaper. Nanobana Pro is incredibly expensive. Seems to be because the way these models work now is they do reasoning. They don't just generate images, they generate kind of reasoning tokens or whatever you want to call it. So there is a substantial amount of progress in the space and it's exciting to see Black First Labs still be a player in that space and you know, provide competitive pricing and options for developers to work with. Right now, not a banana is kind of definitely leading the pack, but this is an alternative option. Interestingly also, we won't go on for this forever, but they say they built Flux 2 on top of Mistral 3 using the vision language model based on Mistral Free. So a bit of kind of I guess open source environment building on top of each other vibes there. Moving on along. Next story also about Not a Banana. The story is Sora and Not a Banana Pro are being throttled amid soaring demands. So apparently Google and OpenAI have reduced generation request limits for these models due to high demand. Free users of Sora are now limited to six generations per day. And Google has decreased the free image generation limit on nanobana Pro from three to two per day. So just an interesting thing to observe. Presumably people are using these a lot, especially for Nano Banana Pro. I could see it just being very useful for people to make presentations. Memes. A lot of people have been making memes Not a Banana Pro and it's very good. So not worthy to see that aside from chatbots, this is now a very significant part of the realm of AI.
Jeremy
Yeah. Bill Peebles. So who heads up Soar at OpenAI? He says free users are going to have six video generations a day at their end. His statement is, and we've seen this before from OpenAI, our GPUs are melting so very much.
Andrei Karlenkov
They love to say it on fire. Melting.
Jeremy
Yeah, I think that's kind of like a Y cism like from back in the day that I heard that a lot. I think Paul Graham would say, you know our server, when your servers are melting, that's when you know that, you know, product market, fit, whatever. So I think that's kind of maybe the meme. But yeah, anyway, so they're saying it looks like it may not actually be temporary and it's possible that it's just like you'll need to purchase additional generations as needed beyond that point, maybe indefinitely. So kind of interesting.
Andrei Karlenkov
Next we've got actually Mistral. So Mistral has released new open Weight Frontier and small models. They have launched this in the Mistral Free family and have made substantial improvements. So these are generally of a smaller variety. These are 14 billion, 8 billion and 3 billion parameters. And these come in several variants. They have the base pre trained one, the Instruct variant which is chat optimized and the Reasoning variant which is optimized for complex logic analytical tasks. So at that size obviously they're not going to be competitive with the latest generation, but they are on par with something like Llama Free or Quinn Free Omni, other open source offerings and deep seq v3.2, deep seq R1 are very big models by the way. They are hundreds of billions of parameters. So at the smaller model scale, which is where Mistral has seemed to focus in on and sort of specialize a bit more on, these are actually useful offerings if you're trying to work at that range. So we won't dive in because we spend so much time on deep seq v3. But I think, still worth noting, Mistral is doing a lot of development and releasing and it's hard to know exactly where we're at. But my impression is they do have substantial customers, at least in Europe, and I'm still rooting for them, even though I know, Jeremy, what you're going to say. I know you're going to say, I don't know how Mistral is going to stick around. Whatever.
Jeremy
You don't know that my opinions haven't changed in the last three months. But yes, they haven't changed in the last three months. But yeah, it's interesting. The one thing I'll say is that when you think about the scales here, it's important that they're hitting 3 billion to 14 billion parameters. That is in the sweet spot, right, for the open source developer community. When you do look at these larger models though, they can be impressive. The V3.2s, the V3s and so on, they're just way too big for the average person to like have them sit on their laptop or their one little GPU. So yeah.
Andrei Karlenkov
Next up, moving back to video clings. Video01 launches an all in one video model for generation and editing. So this is pretty interesting. This is Cling AI. They've been around for a while in the video generation space. And this O1 model is a unified multimodal video model for both video generation and editing, which to my knowledge is not something that you can do of Sora or any other kind of big video tool. These are all generally for generation and Sora 2 is able to do very impressive generation. VR3 is able to do very impressive generation. But actually dealing with editing is a whole other thing. And so here they are editing in the sense of changing weather and swapping protagonists. So I think that is something you can actually do in Sora and VEO to some extent. You can condition on various things, but the focus on unifying it in this one zero one model is significant. And they claim that it outperforms Google or Vio 3.1 and Runway Aleph and video creation and transformation tasks. So, you know, video space, very competitive. Still, V or 3.1 was kind of the king along the sword too, but it's not Going to be here for long.
Jeremy
Yeah, we actually don't have much information about this multimodal visual language model that they use to bridge between like text and multimodal inputs. But so kind of, yeah, kind of interesting in that sense. We don't really know how they're doing it, but it is pretty compelling. They, they show a pretty solid win ratio against Google VO 3.1 as you said, 62% win rate with 32% ties only losing 6%. It's kind of similar with Runway Aleph. So it's definitely a marked improvement in quality in addition to in kind of form factor and sort of user experience like the things that you can do with it. So you can upload up to seven images and tell it to, you know, like in Japanese anime style, like this person should be wearing this outfit and the hat from this person should be on their head and so on. You can kind of see it come together. It is really impressive 3 to 10 seconds of video that you can get out of this. So just for a bit of context there.
Andrei Karlenkov
And next up, also on video, Runway has rolled out their gen 4.5 AI video models that again they're saying are outperforming Veo and Sora in independent benchmarks. So same deal, very high resolution videos, much more refined, dealing with things like physics, human motion camera movements, cause and effect. We're getting into world models arguably with these video models they're kind of, you know, actually more advanced than some people might have expected. And one way is again a company focused entirely on video generation and editing. That's their bread and butter. So wouldn't be surprised if a as far as actually a tool that people use in their workflow, this will be more impactful than Sora or veo.
Jeremy
And on the artificial analysis text to video leaderboard, it is still number one as of now as of time recording number two is VO3, no audio and number three is cling AI. So to give you a sense of. And these are all kind of within margin, within the 95% confidence interval. So. So it kind of is anyone's game at this point at the top of the game.
Andrei Karlenkov
So moving on to applications and business. First up, Nvidia's partners are beginning to tilt towards Google's TPU ecosystem with Folkscon reportedly securing TPU rack orders. So quick background TPUs are Google's specialized chips for LLM inference, particularly where transformer not transformer tensor processing units have been working on it for a decade. There was a bit of a stirring of drama on Twitter when someone pointed out that Gemini 3 was trained entirely on TPUs, even though that isn't new for Google, I guess people have realized that Google have TPUs now and it might be a problem for Nvidia. So this is interesting in the sense that TPUs have largely been within Google's ecosystem. You've been able to pay for them through Google Cloud. Google has used them internally for training their models. The idea of Foxconn and Nvidia's partners beginning to work with Google on TPUs does seem interesting.
Jeremy
It's a big deal. Foxconn take the GPUs that they get from Nvidia and then they essentially package them together into server racks that then go into the data center. So they kind of sit in between. You can think of like in between the GPU companies and the, or the systems companies like Nvidia increasingly is and the, the data centers themselves in the supply chain. And so they're, yeah, they're sitting here historically having been an Nvidia partner, now working with Google on their TPU deployments. Google, which is looking to make TPUs available in data centers for companies like Meta and others to compete directly with Nvidia. This is really interesting, right? This is the move of a company that rightly concludes, hey, there's a really big market here. Now, one thing I want to observe, not something that I've seen commented on much, but it is absolutely true and maybe the single most important economic factor when it comes to AI hardware. Today Nvidia makes like 85, 90% margin on their TPUs. Okay. Google is selling their TPUs to companies potentially that it will end up partnering with. You think about like their partnerships with a variety of different entities. There's Google DeepMind, let's focus on them because they're actually inside Google. Google DeepMind gets to use Google's TPUs at cost. At cost. So 90% margin becomes 0% margin. In other words, $1 of Google DeepMind compute translates into 10 times the amount of compute that OpenAI does. Assuming OpenAI is going with Nvidia, that's a really, really huge deal. You can upend the scaling landscape completely and get really thrown off off if you're comparing apples to oranges on fundraising. So if OpenAI raises a billion dollars and Google throws a billion dollars at their development internally through DeepMind, very, very different consequences. Right. So all kind of part of what is in store here for companies that continue to rely exclusively on Nvidia. And that's going to be an interesting thing to watch. In the landscape. You know, Foxconn here is manufacturing both Nvidia and Google. So they're in this unique position where they're hedging their bets, right? They're going to make money regardless of which platform wins. Another little detail here is they talk about this one to one supply ratio, supply ratio. So basically Foxconn ships one computing tray rack for every TPU rack that they get from Google. And that really suggests substantial, like very structured orders rather than just experimental deployments. And so that's a big deal. Also TPUs, by the way, more compute, more energy efficient rather than Nvidia GPUs at scale. As energy starts to become, as power starts to become that key bottleneck, this is going to be a big deal. So yeah, look for the landscape to evolve based on this. Don't write off Jensen, of course, I don't even need to say this, but Nvidia is a powerhouse and you better believe they're coming back.
Andrei Karlenkov
The engineers at Nvidia are working overtime, you can be sure of that. And yeah, this is notable because TPUs, to my awareness, have not actually been used externally. This is kind of the beginning of targeting of external adoption of TPUs. For a long time, my perception of TPUs was as a competitive advantage for Google. They wanted to keep it in house because they could then price their LLMs cheaper, train them for cheaper, scale Google Cloud to be cheaper. So it's an interesting strategic choice also by Google to allow some competitors to potentially use them. For instance, Meta. You know, maybe it makes sense for Meta to be able to use them because Google isn't competing with Meta on the frontier model development.
Jeremy
I would also call back, right, so Satya, I think said on his Dwarkesh podcast appearance, he's like, you know, people think of Microsoft as a software products company. We're not in the future of AI agents. We are an AI infrastructure company. We are supporting the running of trillions of agents around the world. Well, if you look at Google, I mean that's kind of where the margin is accruing so far at least. Nvidia sure as hell seems to be enjoying a lot of margin. It's not obvious that OpenAI is. It's not obvious that anthropics margins may be actually slightly better, but still the compute level seems to be really, really interesting. So if you're Google, you might just think, hey, can we be the compute infrastructure layer not just for ourselves, but for all these other companies, including behemoth like Meta? So we'll see. But that may Be part of it.
Andrei Karlenkov
And next story very much relevant. Amazon releases an impressive new AI chip and teases an Nvidia friendly roadmap. So this is from AWS, Amazon Web Services. They have unveiled Trainium 3 at their recent conference and the Trainium Free Ultra server system. So this is their in house hardware for model training. I don't think Inference as much, but notably used by Anthropic in particular. They have a pretty deep partnership and I do believe Claude has been trained on Trainium chips to a significant extent. They've also teased the development of Trainium 4 which will support Nvidia's NVLink Fusion technology, allowing interoperability with Nvidia GPUs. Also interesting to me to see whether Amazon decides to try and expand beyond in house or Trainium again could be a competitive advantage on the cloud competition front for providing training support. We are seeing now more competition in the fine tuning space with for instance thinking machines targeting that vector as opposed to in house model development. Anyway, Trainium again, Amazon has been working for a long time. They haven't sort of made a dent in Nvidia, but I could see this is another kind of thing to look out for.
Jeremy
Yeah, and the funny sort of reference here as well to Anthropic, right. They're saying AWS customers like Anthropic are going to be using this chip to significantly cut their inference costs. So there's, I believe they're running on Nvidia chips as well. Anyway, Anthropic's got a whole bunch of different frameworks they're having to accommodate, which is a really interesting challenge for them. And one assumes they're using AI to help them map their training frameworks onto those different platforms. But yeah, you know, the focus is everywhere. You'd expect it to be with this, you know, energy efficiency. So 40% improvement in energy efficiency from the previous generation of Trainium chips you mentioned. Yeah, that they're for training. The inference line is called Inferentia. So there you have it. They've got those, those sort of separate lines. And then there is also a big focus on not just logic though that is four times faster, but also memory. There's four times more memory capacity, which is going to be an interesting option, especially as inference becomes sort of rollouts become more important. You're fitting more and more into, into memory.
Andrei Karlenkov
Moving on to OpenAI, they have declared code red as Google catches up in the AI race. Another fun conversation starter. So this is reportedly something that happened within OpenAI. Sam Altman has declared code red to improve ChatGPT, especially because of Gemini free and real statistics showing that consumers are moving towards Gemini. ChatGPT still dominates heavily. I think it's something like 80%. But Gemini is gaining and Anthropic has been gaining rapidly in Enterprise. So OpenAI has been losing out on enterprise for years. With Anthropic kind of getting at market, they have dominated consumer usage, but Gemini is crawling up and going towards something like 10, 15%. So I wouldn't be surprised if this did happen. You never know how serious Code Red actually is. But OpenAI must be feeling a bit of pressure, no doubt.
Jeremy
I mean, you look at their position in the ecosystem, it is not what it once was. Anthropic, I think in November, I don't remember exactly because this was part of my dark time, you know, they're raising it at like $350 billion from Microsoft Nvidia. OpenAI is at 500 billion. Like I'm old enough to remember when Anthropic was supposed to be like an order of magnitude behind. They have caught up and as you say in the enterprise segment, which by the way, probably is where you're going to go looking for your best margin as well in terms of value per query. That's an interesting challenge that OpenAI is going to face. So this is a no fail situation for them for, for sure. There's going to be a daily call apparently for those tasked with improving the Chatbot, according to this internal memo. And Sam is encouraging temporary team transfers to speed up development. This is also, again, what would Ilya say? I think Ilya might say, well, this is the distraction game you get into when you're trying to build products rather than just a straight shot to asi. They're just doing pure research, heads down and you're seeing these kind of team transfers where it's like, yeah, you're having to defend your product position now because your whole thesis is based on scaling and you have to find, you know, the investment, the revenue to generate that next level of CapEx spend.
Andrei Karlenkov
Yeah. So in particular, Altman is saying that they'll be delaying initiatives like ads, shopping and health agents, personal assistant, Pulse to focus on ChatGPT. And this is something that OpenAI has been doing this year is trying to expand with things like Pulse, where you get a daily update, group chats, these very much more product level features as opposed to the base model quality GPT5, I guess famously or infamously, when it did come out, people were like, oh, is that, is that it? Like we were rating for GPT5 from GPT4 for like a year or a year and a half or something and it barely is better. But in any case, yeah, OpenAI has become a product company more so than an R and D lab and they potentially have lost some of their edge in being able to be at the frontier. So, you know, Code Red could mean a lot actually in the startup space and we wouldn't be surprised if they do kind of gain the lead again. And on their competitor Anthropic, reportedly they're preparing for a massive ipo. So the talk here is that they have engaged law firm Wilson, Sonsini, Goodrich and Rosati for the potential ipo. They're saying that they want to pursue private funding that would value it at above $300 billion with a 15 billion commitment from Microsoft and Nvidia. This IPO presumably would try to be next year, as soon as next year, which is, I mean, lots to say there from a business perspective perspective from, I guess also company level perspective. Once you go public, you have public investors, it changes the game. So Anthropic, you know, same. They've been kind of in the underdog position for a long time, but their position is starting to seem a bit stronger and they seem like they're on a push on that.
Jeremy
Yeah, from a sort of compute efficiency standpoint, it does seem like, or algorithmic efficiency standpoint, I should say. It does seem like they're certainly competing with and possibly exceeding OpenAI pound for pound. I mean it is pretty wild what they've been able to pull off, especially the last like year and a half. I feel like they've truly, truly ascended. So yeah, this would be a big deal. I'm curious, I'm not a lawyer, but I'll have to do a bunch of research to understand what the implications of the public benefit Corporation structure of Anthropic is with respect to an ipo, what this means for their, their governance structure, which famously has a board of oversight. I think it's like about half a dozen people who get to tell the company not to do things that violate its kind of founding mission. A different spin on, you know, the structure that OpenAI had sort of jettisoned quite famously. So yeah, it's interesting, an access to obviously the famously deep capital markets of the United States just at a time when all of the scaled build outs are happening. Right. So, you know, Anthropic, I think committed to a, something like a $50 billion infrastructure build out fairly recently. So, you know, this is, this is what they need to bridge that gap.
Andrei Karlenkov
Yeah. And I think also points to like the private market may be tapping out at this point. Like OpenAI and Anthropic may have just sucked up all the VC money and now they need to IPO to get more money.
Jeremy
Do you remember like five years ago when VCs could not fund, you know, multi deca billion dollar raises like this is insane.
Andrei Karlenkov
Yeah. And speaking of raising money, going back to Black Forest Labs, along with the announcement of Flux 2, they have raised $300 million at a 3.25 billion valuation. So this is a big number. We haven't seen hundreds of million dollar raise in quite a while. Must indicate that they are doing well on the business front. I don't think we have much insight as to their lead in the API space, but I would imagine they're doing well and that's pretty much it. This is a series B, it's roughly one year, one and a half year old company and that's a lot of money. We got another startup, Paris based AI voice startup Gradium. This is a seed round. So they have just emerged from stealth with a $70 million seed round. They are developing audio language AI models with ultra low latency voice responses. So 70 billion for a seed round. This is like ridiculous bananas numbers that used to not happen before AI and has reduced as a trend since 2023, 2024, but here they are able to get there. So surprising to me to a little bit because eleven Labs does have a pretty strong position in this space and some other competitors as well. Must mean they have some very strong talent. Okay, now moving back to hardware, we've got some developments on OpenAI's buildouts. So they have announced 1 gigawatt Stargate cluster in Abu Dhabi back in May and that has actually begun construction and you can, we've seen some photographs and so on and there is some skepticism. It seems that they'll be able to reach the 1 gigawatt number very rapidly. They'll hit 200 megawatts initially and I think Jeremy, you have more kind of thoughts on this front.
Jeremy
Yeah, I mean it's not typical, not atypical rather for first power to be pretty close to when you hit full scale. So 200 megawatts may actually be fairly close to when they do get to that, that 1 gigawatt. But basically this is from Epic AI. It's a tweet thread that they put together on X that goes over their assessment of how plausible it is that they'll hit the one Gigawatt planned, one gigawatt in time. And it looks like delays, basically. So, you know, when will the UAE is Stargate in the UAE reach a gigawatt? They say that they don't see clear signs beyond 200 megawatts. Optimistically, they say eight more 100 megawatt buildings could start construction in December and take one and a half years like the first two to complete. That would put one gigawatt at Q3 of 2027. So this matters because when you look at the timelines of different labs kind of years to get to their first 1 gigawatt, you see quite a bit of variability. But you've got, for example, XAI that they've pegged at sort of like early 2026 anthropic, mid-2026. Sorry, that's the. I'm sorry, mid-2026 would be OpenAI, Stargate and Abilene and then Amazon, Anthropic and New Carlisle. They've got a build that would be early 2026. So beating opening eye to the punch across both Abilene and the UAE Stargate. Optimistic scenario. So this is interesting because it means Anthropic really does seem to know what it's doing in terms of building fast. Like, this is pretty wild stuff. And you know, historically, look, the thing with these announcements too, keep in mind they get delayed. That's what builds do. The functional kind of process here is a Neo cloud or some kind of company will approach a lessor, basically a property owner that has. That will claim that they have access to enough power to build a data center or a bunch of them on their site and it looks good. You check in with the local community. How much power can the transformers accommodate? Everything looks good. And then you get started and you find out the lessor lied to us. They actually have this like weird term in the contract that doesn't actually let them get the power in time. And this is what you see over and over and over. So a lot of what these companies are doing is getting really, really good at assessing how credible is a site. It may look good on paper, but in reality it's not. So that's certainly been the case in North America where power is very scarce. I'm curious if this is in fact a delay in the uae. I would have expected that problem to be less of an issue there just because they have such a surplus of power. So that is a bit of a bit of an update. I don't know nearly enough about the UAE's power situation, but that'll be something I'll be looking into in the next few weeks, I'm sure.
Andrei Karlenkov
Few more stories in the business front now moving into partnerships. Another Trend this year. OpenAI's investment into Thrive holdings is its latest circular deal. And this one is truly circular, it looks to me. So OpenAI has acquired an ownership stake in Thrive Holdings, a subsidiary of one of the of its major investors, Thrive Capital. Thrive holdings is basically a private equity firm. It acquires companies that could benefit from AI in sectors like accounting and IT services. Apparently OpenAI will send employees to work within Thrive's companies to accelerate AI adoption, which I was not aware that that's a thing that happens, but that's interesting.
Jeremy
Yeah, it puts off those circular economy vibes that people have been talking about. And by the way, we don't know the terms of the deal. All we know is what you said. OpenAI is going to send employees and product teams to work with Thrive's companies. So cool. Apparently if that succeeds, then OpenAI stake will grow somehow and they'll get compensated for their services. Really unusual configuration, but we'll learn more over time.
Andrei Karlenkov
It could make sense because as one of the things you want to do as tool provider, as a, you know, API provider is get startups to use your tool, right?
Jeremy
Absolutely.
Andrei Karlenkov
To get adoption. So if you're able to send employees to go work at these companies to use OpenAI, great for OpenAI, you know.
Jeremy
Assuming you achieve lock in, I guess that's the big bet here.
Andrei Karlenkov
Yeah, yeah, it looks circular, but actually it's, it might be the opposite. Right.
Jeremy
100%. So I actually, in general, I think the whole circular investment argument is a bit silly. There, there is real value being created here anyway, we could do a whole episode.
Andrei Karlenkov
We could do a whole episode. But anyway, it's a little nuanced. And OpenAI also going to be acquiring Neptune, an AI model training assistance startup. So this is a startup specialized in monitoring and debugging tools for AI model training. We have previously collaborated apparently and Neptune is going to be going offline. The financial terms were not disclosed, so kind of interesting, I would have thought OpenAI has already mature infra of their own and wouldn't really benefit from something like this, but it seems that's not the case.
Jeremy
Yeah, apparently there's already been a collaboration between OpenAI and Neptune to build metrics dashboards that help OpenAI's teams build foundation models. And so this is even tighter collaboration. It's so fascinating and almost funny that you have such a niche use case with really I mean, the number of users for this is tiny, right? It's just the value per user is so insanely high. So that's really what this is all about. We'll see if it translates into faster development at OpenAI.
Andrei Karlenkov
We'll see. And then another acquisition, Bianthropic. We have acquired developer tour startup BUN to scale AI coding. So this is major acquisition. They say that cloud code has reached apparently 1 billion annualized avenue run rate since launching earlier this year. So Bun is developing something like a JavaScript execution environment, something technical like that for running code. And in that sense, seems like Anthropic will be buying this to build the infrastructure for future software generations and basically double down on cloud code and this kind of work. All right, last business story. Microsoft drops AI sales targets in half after salespeople miss their quotas. So this is sales growth targets for its AI agents products after many salespeople apparently failed to meet Dakotas for the fiscal year ending in June. So these are the products that deal with multi step task doing autonomous execution, part of the big 2025 push from Microsoft and others being added to Word, Excel, PowerPoint, Microsoft 365, Copilot, etc. So, you know, it tells you something. Hard to say if the goals were unrealistic in the first place or if adoption is indeed slow, but both are very plausible.
Jeremy
Yeah, absolutely.
Andrei Karlenkov
And now onto projects and open source. We begin with DeepSeek Math v2. This is slightly older than v3.2, so we kind of pushed it off. Releasing in November of the 27th. And as it sounds like this is Deep Seq's math specialized model. And in fact in The Deep Seq v3.2 report, they mentioned that they have incorporated the data from this into the training of deep seq v3.2. So it's a bit of a subset, not too much for me to say here. Basically doubling on math, doing a lot of self verification training specifically for things like proof generation and achieving some of these benchmarks like gold level performance on IMO 2025 and CMO 2024. Neck and neck again with Gemini and others on this frontier of math reasoning.
Jeremy
Yeah, the core of this is there's like a generator and a verifier, which is a very standard setup. Of course, the generator generates solutions, verifier checks them, and you sort of have this interaction between the two of them that improves them over time. The challenge you get into is that sometimes the generator can get a correct answer with incorrect reasoning, for example, and in those instances you need a way for the verifier to sort of account for that in some way when it's really just like looking at the final answer that doesn't tend to work well. So they developed this meta verifier that they also train and then have folded into this loop and include its score in the overall reward signal for the verifier's training. And essentially what the meta verifier is doing is it's confirming that the kind of issues that are identified by the verifier before it produces its final score are actually real and that they justify the predicted score that the verifier gave. So it's sort of a who watches the watchers thing, the one interesting. And then they get a bunch of human experts to score the quality of verifier analyses to create a meta verifier data set. Now, while they do train the verifier and the generator in tandem in this kind of sort of generative, adversarial way, you can think of it that way, they don't continuously train the meta verifier. And so that's kind of interesting thing. I mean, you could think that like, eventually, at a certain point, you might reach the point where you do need to start doing that, because eventually the generator and verifier just get so advanced that the meta verifier can't keep up, but they're actually not doing that. That was kind of the most interesting sort of omission, at least that I found in the paper. They've got a bunch of scores, really impressive scores, by the way. So on the Putnam 2024 exam, it scored 118 out of 120. It solved 11 of 12 problems completely with just minor errors and it surpassed the highest human score of 90 by a wide margin. Right? 118. That's pretty wild. This is the premier undergrad math competition in North America, by the way. So pretty pretty.
Andrei Karlenkov
This is like genius level people, by the way. Like, like, when you say high school, we don't mean, like, we mean the real.
Jeremy
It's undergraduate math, right? So you're like, you're. Yeah, you're beyond the high school stuff and, and even like the IMO gold medal scores, right. They solved five out of six problems there. So this is crazy.
Andrei Karlenkov
This is. Yeah. So this is coming pretty quickly. After Google had a paper towards robust mathematical reasoning, they also announced their IMO results with Gemini Free Gemini Deep thinking. It was a big deal to reach the gold medal. It was actually like the first time. And looking at the numbers using the IMO proofbench that Google released just earlier, like a month ago. If you look at DeepSeek R1, it had 4% performance. If you go to now Deep Seq Math V2, something like 70%. Right. This is a massive leap over a year ago where the models were in terms of this level of complex mathematical reasoning. Next up we've got a paper from Google Evo Memory Benchmarking LLM Agent Test Time Learning with Self Evolving Memory so going to the note about Ilya's conversation broadcast. This is I think going to be a trend in 2026 and in coming months where people increasingly are thinking about memory and thinking about adaptation and basically going beyond what has been the paradigm for AI for a long time, which is you train the model, you deploy the model and it does in context reasoning, you give it some examples maybe and that's it. It doesn't ever kind of have a long term memory of any kind of for most part. So here that is addressing that problem of learning over a period of time. They have two different evaluations here. So they have XP rag, which is basic baseline that stores interactions as a structured record, input, output, feedback and can retrieve similar past experiences as in context examples. This is pretty, pretty normal people do this and they also have REMEM as a more sophisticated framework with a fake arc refine loop where you can actually try to basically organize your memory, retrieve, prune, reorganize your memory during inference rather than just treating it as a sort of like, you know, big pile to throw stuff onto. Yeah, this is still pretty early. So this is kind of a proof of concept almost and an initial evaluation of this kind of overall category of a capability that LLMs really don't have inbuilt at least.
Jeremy
And to make it really concrete, by the way, there's a paper we'll be talking about a little bit later that to your point on trends, this is very much becoming a thing and people are trying to take different bites at the apple here. But there's a paper we'll be looking at that'll dive into this even more in a couple minutes to give you a concrete sense of like what is the kind of thing that this benchmark that the whole space right now is trying to solve for. So if you think about a kind of task that someone might put to you, like put a, put a clean apple in the microwave if you didn't know where stuff was in your kitchen or whatever, you might look for the apple in different places and then eventually Realize, oh, this person keeps their apple in the fridge. So you go get the, get the apple from the fridge. So okay, cool, that's round one. If next someone asks you, okay, put a clean potato in the microwave, right? Same task. Now you're asking about potato. You might, based on the previous task you've done, you really ought to have learned in context that oh, interesting, the vegetables maybe are in the fridge, maybe fruits and vegetables are there. So let me turn to the fridge now instead of looking literally everywhere and a bad model or a model without the sort of active memory would just again look over the counters, da, da, da, all these places before going in the fridge. And so if you, if you then ask, put a, put a clean, on a clean tomato in the microwave, like over time you're going to start to refine the rule in your head from oh, the apples are in the fridge to actually looks like apples and potatoes. Oh, okay. It looks like all fruit and vegetables are in the fridge. And so that kind of massaging of this sort of, it's not the weights of the model that are being updated and it's not the attention values that are kind of being updated with every single freaking token. It's almost an intermediary frequency update. There's updates that are happening every sometimes in terms of like the, this memory that you're using to navigate the world. And so anyway, this is a bit of a teaser for the paper we'll talk about later where different frequencies of updates, mechanisms that learn at different rates are hypothesized to be really important for this sort of in context learning.
Andrei Karlenkov
And I just want to mention this kind of thing broadly like memory frameworks has been a thing that's been coming about as actually startups dealing with memory maintenance and so on, but it is all rather ad hoc, like it's giving agents tools and telling them to store and update and so on. This is an example of that paradigm. And now as you move into research and advancements, the paper that we'll be focusing on is called Nested the Illusion of Deep Learning Architecture, which is a very interesting title that basically presents the alternative option which is instead of trying to add memory on top of a neural net on top of model, where it's told to now write down its memory and then later retrieve it and refine it and whatever, what if its core component of the way the model itself works. So essentially they position it as looking at current LLMs. They have this property of only ever experiencing what they call the immediate present. So once you finish Pre training, right. You have ingested all of the Internet and the model sees its input, the context, and produces an output based on that context. That's how the model works. There's no continual learning, there's no continual updating of anything within the model. At best, what you can do is store something external to a model and then retrieve it and put it into the input. And that's what RAG does, that's what memory frameworks do, et cetera. So there's a big question that's been a big question. Continual learning in general has been a topic in machine learning for decades and has been a topic in Transformers as well and large language models as well in research for the last couple of years, but not sort of a priority. But this paper is showing, from the DeepMind's perspective, their kind of latest effort on this front of how do you make a neural network architecture a training paradigm that encodes continual learning and different levels of memory within the actual model itself and the neural net weights at a high level? The way this works is a couple things. So first they have this notion of nesting, which means that within the model, if you look at a typical transformer, it's kind of one big thing. You take an input and it goes through a whole bunch of layers where you alternate attention. And MLP is basically attention and kind of processing on top of attention, and you sort of have what they deem as like a single frequency of information update, a single frequency of thinking, so to speak. The core conceptual leap in the paper is nesting in a sense of having multiple layers of reasoning frequency and learning frequency. So they say this takes inspiration from the brain, where we do seem to have these layers of memory. Right. Working memory short term, long term. We also seem to have different rates of updates of different areas of the brain. So through various technical details, which we'll get into the gist of, it is being able to have nesting of different amounts of memory and rates of update and other things like that within a single neural net. Building on top of, by the way, previous research of theirs on Titans, there's been a bit of a continuation of research on this front from DeepMind and this is the latest in that line of work.
Jeremy
Yeah, it's actually, in a way, quite interesting. In a way, it feels like just sort of putting words to an intuition that I think a lot of people have had for many years. You know, this includes people who worked on RNNs for a long time or like recurrent neural networks or state space models. There's a lot of attempts to kind of actually do this to instantiate the theory that they're putting together here. But they're actually kind of trying to codify it and put a word to it. So the idea here, as you say, is like during inference, at inference time, all the models weights are frozen. They do not update at all, they do not learn anything. So they've done their learning during training and they're frozen in time. Every time though, you put a new token through the system, the attention values for that token get recalculated from scratch, right? So that means that essentially while the weights are updating with basically no frequency, they're never updating. The attention values are being recomputed every single time with every single token. So they're updated with almost an infinite frequency. At least that's the way the paper is going to frame it, which I think is debatable. But anyway, so you've got essentially this world of extremes inside a transformer where the core architecture is frozen in time, but the attention mechanism is just frantically updating all the time. There's no middle ground where we're sort of absorbing slowly a bunch of context and information as we go, and also kind of slowly updating it over time to respond to things that are learned. It's sort of you're in or you're out, you're like fully all about this token or you're only frozen in time. It's almost like the weights have a infinite momentum, right? They're static and the kind of attention values have zero momentum. They're flying all over the place. So extrapolating a bit, this seems to imply that you might do better architecturally by defining some additional component of the network that updates at some kind of intermediate frequency. And that is exactly the kind of de facto memory that an RNN might use where you're sort of deliberately updating it only every N tokens, let's say, to create this medium term memory in the system. And that's what they're going to do in the paper. So they define this thing called continuum memory system CMS. And this is basically just stacking multiple MLPs, multiple neural networks on top of each other where each MLP updates at a different frequency. So it updates every, you know, N tokens. And this gives the model the ability to dynamically sort of to have some dynamic range is one way to think of it in terms of its memory and actually learn on the fly. Long, short, medium term memory, if you want to think of it that way. So quite interesting, they've got a good breakdown of how to think about what qualifies at what level of, of, of memory. But anyway, there's a whole bunch of stuff we get into with analogies to this, but I think we probably got to move.
Andrei Karlenkov
Yeah, at a high level. This also ties in neatly to the line of research on trying to find a sort of hybrid between recurrent models and transformers. So we've talked about mamba, we've talked about the resurgence of recurrent models in general, where obviously recurrent models have memory in the model itself. They don't have learning per se, but they have memory from all their inputs. In the past. One of the issues with recurrent models is that the memory degrades because you don't update the weights, you don't store it except you store in this like little input that you keep updating. So RNN's recurrent models is also an aspect of this model. But the big deal is really the fact you update weights. And there's some very interesting technical details of how they reformulate gradient descent as just a general update rule that you can apply and not kind of the standard view of it. We probably shouldn't dive in too deep. But really interesting paper. And there's also a blog post by DeepMind you can take a look at. Next up we've got kind of smaller paper multi agent Deep research training multi agent systems with M-GRPO. So we have had a lot of excitement about reinforcement learning with agents. And one of the complexities of reinforcement learning is well, what if you have multiple agents Right now your environment isn't static, your actions aren't static and in general everything gets much more complex. So this is formalizing that process where you have a vertical multi agent architecture, it extends the GRPO credit assignment, the like math behind it to handle that kind of hierarchy of credit that you should give to multiple agents.
Jeremy
Yeah, and this is, I mean there's a bunch of detail in it. This is one of those open problems where you want to have ideally a kind of orchestrating agent, a main agent that could be different based on a different language model than the sub agents that it calls. And this creates a problem because then you can't back propagate your gradient updates through the whole system in the same way. And so what they're doing here is figuring out a way to, not a janky way, I'd say a fairly elegant way to do this, to also make it so that the sub agents are graded just in a way that accounts for their performance, but in A way that accounts for the overall performance of the main agent as well. So, so how well the main agent did overall is factored into the sub agent's performance alongside how well the sub agent executed its own specific subtask. And that second one by the way, is judged locally by a, like an LLM evaluator because the subagent is doing something that you can't necessarily get a metric for from the overall outcome. So you do need some kind of like local LLM judge saying like, hey, does this look like a good intermediate output? Yeah. So anyway, this is like all part of people trying to figure out agent orchestration and training agents to work together explicitly instead of just this all sort of in context jamming together of a bunch of language models where you try to wrangle them together using prompts exclusively. This is really like how do we train the whole system together as a unit? Which is, you know, a new trend. As of the last sort of like 18 months or so, people have been really putting a lot of thought into.
Andrei Karlenkov
This piece and last bit of research search we've got state of AI an empirical 100 trillion token study with Open Router. So this is. Open Router is basically a gateway you can use to access different AI models. They let people call, make calls to alarms of different kinds through them. And this report is looking at, as it said, hundred trillion tokens. So essentially a whole bunch of usage of a whole bunch of different LLMs. They focus in on the past year they took all this data, they didn't look at the actual text of the prompts because that would be crazy, right? They looked at a metadata of all the prompts, all the calls to different models and did classification of a small subset of all these prompts to see what kinds of things broadly people are doing. And so there's, as you might expect, a whole bunch of interesting things here. Mostly people are using closed models. There's a somewhat stable split, but open source is gaining especially over the course of 2025. Token usage is going up and up and up at a very rapid pace. It's gone up like crazy since 2024. All these sorts of things. I don't know what stood out to you, Jeremy.
Jeremy
Well, yeah, one piece to your point is the Chinese open source models, you look at your deep seqs, your quins, they went from about 1% of usage to 30% in one year. Right. So that's a really, really big deal. That's market moving stuff. Medium sized models tend to be preferred. So you know, 15 to 70 billion parameters. We talked about that earlier in the context of Mistral's launch. That's kind of the new sweet spot. Right? You're balancing your, your efficiency with your capabilities. Right. And, and so that's where we're seeing a lot of the consumer facing stuff go. Role playing is a really big deal. I was not aware of this, but I guess it makes sense. Apparently over 50% of open source model usage is creative role play and storytelling. So not coding. That's what I would have guessed, but quite interesting. Yeah. There has been an experience explosion overall though in programming queries, so in coding query, so they went from 11% to over 50% of recent token volume. And then they're also noticing like average prompt lengths have quadrupled from 1.5k to 6k tokens. And that's almost all driven by coding related tasks. You know, you tend to be dumping a whole bunch of code in context to do that. No surprise. You know, that's where anthropic dominates 60% plus market share. So there is, there's obviously intensifying competition there. But the story of the last two years has been anthropic. Just mercilessly climbing that ladder is super, super impressive. A whole bunch of, yeah, stuff around agentic inference that's becoming more and more important. You look at these reasoning models, they're now taking over like over 50% of all tokens. So we're moving from this world where people are interacting with chatbots by asking questions and getting answers and more towards multi step rollouts with tool calling, work flows and all kinds of stuff like that. So maybe the, the last thing I'll mention, price does not matter for demand. What they've shown is there's very weak price elasticity in the market. So if you cut the price per token by 10%, this will yield only a 0.5 to 0.7% increase in usage. And so that's pretty interesting, right? This race to the bottom on price that's not at least further now where things are going, it seems like it's a race on quality, which is a really interesting bit of information for the Frontier labs because obviously that's, that's where their sweet spot is. And that does explain why the open AIs and anthropics of the world are seeing margin and everybody else who's fast following is really struggling here. So yeah, it is quite an interesting, interesting report. You can find something for everybody here. I would say dump it into Claude and just like ask it the questions that you, that matter to you because it is so wide ranging. I wouldn't recommend going through the entire thing. I did that for way too long.
Andrei Karlenkov
There's a lot there, I guess, as a way to frame it, as a way to think about it. Open router is an API gateway, meaning that basically this is usage by other products, other tools outside of Claude and ChatGPT and Gemini. This is if you're talking to a chatbot or you're talking to an endpoint is where those things are happening or what is happening there. So you can take away various things here. Like for instance, Verizon programming indicates that all these startups doing Vibe coding, lovable replit, there's like 20 of them. US at Astrocade also are now doing basically wipe coding. This shows that the market and the set of tools being developed is there. The fact that roleplay is so big actually is not surprising to me because what we know from the broader usage patterns is things like character, AI and a million other. Like literally there's thousands of these, like talk to AI characters or role play with an AI girlfriend, et cetera. So it's interesting as a portrait of outside your frontier labs and outside of Core, ChatGPT, Core Claude, et cetera, what are people doing to your point?
Jeremy
I guess what my brain was doing was I guess distilling kind of fundraise numbers for these. You know, if you compare the fundraisers for Vibe coding apps, it's insane. The fundraising for character for like storytelling is very, very limited. But of course that's gotta be.
Andrei Karlenkov
It's very fragmented. It's just such a large space. It is.
Jeremy
And it's also the value per query when you're doing storytelling is like really shitty versus the value per query when you're doing Vibe coding is like enterprise. Like of course you're going to get way more dollars per dollars per token. So that probably is the thing that explains this the most. But yeah, it's. It is interesting to see it in numbers.
Andrei Karlenkov
Here onto policy and safety first, we've got Trump signs executive order launching Genesis Mission AI project. This is a federal initiative to enhance American AI research and development, likened to the Manhattan Project in its urgency and ambition, which I'm not sure that's fair, but okay. The order is going to outline steps to expand computational resources, improve access to federal data sets and focus on real world applications, especially in scientific fields. Michael Kratzias, assistant to a President for Science Technology, will lead the initiative and apparently they'll be American Science and Security Platform created to centralize infrastructure and Provide researchers with necessary computing power and data sets. I guess this goes back to the notion of a national AI cloud, which was discussed quite a lot and will be interesting to see if this comes about.
Jeremy
Yeah, I mean, historically, the big challenge that the US Is US all governments really have faced is that they've got a bunch of data that the government just has sitting on, like in databases, not really doing much, and it's hard to access and the government doesn't have the computing power to access it. And you can imagine how important this data would be too, Right? A lot of it is like classified bio data or, you know, military or intelligence data. There's tons and tons of value to be extracted from it. What if you could in some sense smush that together? You've got all kinds of challenges when you try to do that, obviously, which is that information that's classified is siloed for a reason. You know, you don't want to combine your, like, secret A with secret B, where the combination of it could give you, like, weapon C in ways that are too sloppy. So there's a lot to be figured out here. But, yeah, this is a really, it's a really big initiative. The Manhattan Project framing is interesting. There is a kind of a Department of War sort of equity in this, so that's, you know, that'll be part of it. But the Department of Energy is the department that led the Manhattan Project, so that's kind of where some of this is, is coming from. So anyway, it's, yeah, going to be drawing on about 40,000 DOE scientists, engineers and technical staff, along with private companies and stuff. So again, hard to see how this will come together. I think that's a story yet to be told. But the goals are lofty and this is at least consistent with David Sacks's and I think Kratzios AI Action Plan that came out earlier this year too.
Andrei Karlenkov
Right. And this is, by the way, building on top of the National AI Research Resource, which was established in 2020 to provide the shared national research infrastructure and do things like partnerships and data sharing. So presumably this will aim to accelerate that, expand it all, that kind of thing. Next, we've got some research on safety and interoperability. The headline from MIT Tech Review is OpenAI has trained its LLMs to confess to bad behavior. So they released essentially Study, a paper where they tried the basic thing of asking the AI to be like, tell us when you're doing these bad things. And they were given the GPT5 thinking was given tasks that encouraged it to lie or cheat. And it confessed to such behavior in 11 out of 12 test sets. So it's things like being able to cheat on problem solving, writing code to solve a moth problem, simple MOF questions, things like that. And this kind of follows up with other recent research from Anthropic, I think, where, you know, some basic kind of tricks might be more powerful than you might expect for making model deployment less dangerous.
Jeremy
Yeah, this is one of those, those things where a lot, I feel like alignment research, a lot of it does look like this where people come up with these dead simple ideas. You're like, why, why hasn't anyone tried this before? And it's like, well, it's only obvious now. So yeah, what they do is they basically have their, their model about 25% of the time. After the model is asked to like 25% of samples, the model will be asked to produce a confession after its answer with about 25% probability. And then they'll get an LLM judge to reward confessions based only on honesty. Right. So this is an LLM as a judge framework. That reward is going to only apply to the, the confession tokens. So there's no signal that's back propagated associated with the tokens, the reasoning tokens beforehand. So you're really preventing the model from doing any kind of optimization around the main text that it created. It's forced to only optimize for reward on its confession, which is important to prevent sort of hacking at the level of the, the main, the main generated text. So they actually induced hacking. So they deliberately use a weak and hackable LLM judge for both the main answer rewards and the confession rewards to just kind of see what tends to happen here. And no surprise, the model learns to hack the weak judge. So you'll see it output stuff like say it's asked to generate a sentence with 12 words or something. It'll generate a sentence with like 18 words, but then it'll put the number 12 in brackets at the end of it. And that'll often fool the weak judge. And so then it'll be asked like kind of confess. In the Game of Thrones style, they do show a really impressive improvement in performance and confession accuracy almost, almost 100% by the end of training, despite using the same weak judge, it just turns out. And their, their theory of the case here is that the equilibrium for confessions is actually honesty. It's like the lowest. The path of least resistance is to actually give an honest answer, and they hope that that will continue to be true. And as the model scale, which, you know, might be a kind of.
Andrei Karlenkov
There's something there, you know, I think the interesting thing for me is the specific bit about rewarding honesty over helpfulness. This kind of thing actually happens in practice when you're coding. Like sometimes the coding agents will just like break a test, like comment out the condition and the test passes. And it's happy because it is trying to get the job done. It's not trying to get it done necessarily the proper way. So there is something there in terms of making the model not worry too much about doing what you asked, but also worry about other things.
Jeremy
Yeah, absolutely. And so I think a big question with this will be if this is meant to be a super alignment strategy. If we're going to be using this for genuinely super intelligent systems, if that's part of the idea here, we need to see how it scales with more compute, more optimization pressure thrown at it. It does not prevent misbehavior, by the way. This is just a after the fact, like, oh, shit, it's a monitoring solution which could, you know, you could fold it into a preventative loop, but just intrinsically is more of a monitoring thing. And it does assume at its core that confession, that honesty is the path of least resistance in this framework. But if you have a sufficiently capable model, it might find some clever ways to kind of hack the confession judge. As I said, the optimization pressure just could be dramatically higher at scale. They do acknowledge this. So they say additional training at scale will be needed to demonstrate that this assumption holds under high optimization pressure. So presumably they're actually going to experiment on that.
Andrei Karlenkov
Yeah, I think it's interesting. We know Vivoluigi effect, where models turn evil, whereas the models kind of have an evil character to them. So anyway, Lyman making progress gradually. All right, just a couple stories left. US Senators seek to block Nvidia sales of advanced ships to China. This would be the Secure and Feasible Exports act, the SAFE act. And it would order the Commerce Department to halt export licenses for sales of ships to adversaries, including China and Russia, for at least 30 months. So this is, I don't know the details of, like if this is all chips or just any chips. Jeremy, maybe you can get deeper.
Jeremy
Yeah, it is. So they're saying for advanced chips to China. So they are looking at a compute threshold. I haven't actually looked to see what that threshold is or which chips would be associated with it, which would be in or out. But you know, historically a lot of the made for China chips like the H800 are actually weirdly performant. And when you look at them, it's like, it's not obvious that they aren't strictly better than a lot of the chips that like, say the H100, which was supposedly the, the souped up version, if you connect them together in the way that actually Chinese labs tend to. So that is an important open question. You know, Jensen was in D.C. on Wednesday. He met with Trump and Republican senators on the Banking Committee and he said that, so this sort of standard Nvidia position that, look, Beijing's not going to accept degraded chips and US Companies should be able to export our most competitive chips to China. This all depends on what you think chips are. If you think that they are a uranium stockpile for a wmd, then this is insane. If you think that they are just another technology that's going to make everything great and overwhelmingly positive and that the downsides, they're just like not weaponizable, then, yeah, this sounds pretty plausible. That's kind of the main axis of disagreement here, by the way. Interesting kind of inside baseball on the Republican side of things. So Steve Bannon, who many will recall as being sort of a guy who worked with Trump in Trump 1, at least for part of it, is now hardline kind of anti sort of exporting Nvidia chips. So he's on the side of saying, well, I'll give you the quote here. He says, quote, david Sachs has acted as the agent for the Chinese Communist Party and Jensen Huang is the arms merchant. That's some fiery shit from Steve Bannon, who, you know, 10 years ago was a kind of pro Trump guy. This is shots fired. And by the way, the administration is, is kind of trying to figure this out. It seems like, it seems like they haven't yet. We've seen some like, oh, yeah, we'll let these chips go. Actually, let's, let's not, let's enforce export controls. So we're still trying to see really what, what this will all shake out to. And Congress now seems to be moving to take, take action independently. So we'll see. We'll see what happens.
Andrei Karlenkov
Right, exactly. This is amid Nvidia trying to talk its way into being able to sell chips. And notably this is in the Senate. So this is the legislative department trying to pass a law to actually say what should be done as opposed to dealing with the executive deciding things about export restrictions. So there is a bit of a executive versus legislative situation that might be going on probably behind closed doors. There's some more details.
Jeremy
Absolutely.
Andrei Karlenkov
And that is it for this episode. Great to have you back, Jeremy. We're back in the like talk for 30 stories.
Jeremy
Yeah, that's right.
Andrei Karlenkov
Stop. And we will try to be doing that weekly. So thank you for listening, thank you for viewing or commenting. And please do keep coming back.
Jeremy
Tune in.
Andrei Karlenkov
Tune in when the AI.
Begins begin.
Podcast Outro Singer
It'S time to break it down Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride Up A letter to the streets as we can reaching high New tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in, tune in get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide last weekend AI come and take a ride I believe as through the streets AI's reaching high.
From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Date: December 9, 2025
Hosts: Andrei Karlenkov & Jeremy
Theme: In-depth discussion of the latest AI news, including major model releases (DeepSeek 3.2), hardware shake-ups (TPUs, Amazon chips), business moves (OpenAI, Anthropic), and advances in memory and agent research.
This week's episode marks Jeremy’s return to the podcast after a three-month hiatus, with both hosts diving headfirst into a busy week in AI. The show centers on new model releases (DeepSeek 3.2, Flux 2), hardware battles beyond Nvidia, startups, industry shifts, and emerging research directions like memory and multi-agent learning.
"In some sense this is like actually attention... in a more human sense of like pick out information to process and then process only that information."
— Andrei, [09:55]
"Holy shit. Like, just look at the benchmarks."
— Jeremy, [14:21]
Note: Deepseek continues to drive the open-source ecosystem, with impressive transparency, and is considered a major competitive force against closed models.
"Our GPUs are melting."
— Bill Peebles/OpenAI via Jeremy, [28:53]
"On the artificial analysis text to video leaderboard, it is still number one as of now... within the 95% confidence interval."
— Jeremy, [35:00]
"$1 of Google DeepMind compute translates into 10 times the amount of compute that OpenAI does... That’s a really, really huge deal."
— Jeremy, [36:24]
"The core architecture is frozen in time; the attention mechanism is just frantically updating all the time. There’s no middle ground..."
— Jeremy, [69:37]
"The Manhattan Project framing is interesting."
— Jeremy, [82:55]
This episode provides both high-level news and rich technical/deep dives—recommended for anyone wanting to quickly catch up on the state-of-the-art in both AI development and industry moves as of December 2025.