Loading summary
A
Foreign.
B
Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's been going on with AI. As usual, in this episode, you will summarize and discuss some of last week's most interesting AI news and also chat about the last year of AI because there's actually not been too much exciting news over the holidays. As always, I'm one of your co hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocade.
A
And hi everybody. Happy new year. Happy 2026. Yep. I'm Jeremy, your other co host. Excited for this chat. I think the funny thing about doing a retrospective on 2025, at least for us to be doing it, is that I have this, like this dead zone where I essentially just disappeared from the face of the earth for a three month period from, I want to say, what, like August to September to December, I think. Yeah, December, yeah. Okay, great. So I'm going to be like a memento or something where I. You're going to fill me in on. On tons of stuff that I'm missing, I'm sure. So, yeah, that'll. That'll be fun. And by the way, we're talking about a paper actually that we both identified as. As promising to put in this week because there was so much discourse on it and it turned out it was from October and I had no idea this paper existed because October didn't happen for me.
B
So, I mean, even at the time it wasn't. It kind of got picked up more. I didn't see it back in October. But yeah, it's interesting to look back on 2025 and we'll spend just like maybe 10 minutes before getting to the news for this week to give a quick preview of the news. There's not a ton that happened over the last two weeks. We actually skipped last week on Christmas because it wasn't that interesting. This week got a couple of business stories and some interesting papers and policy updates, but it'll be a bit of a shorter one. So we'll begin with looking back at 2025. And I mean, it's hard to compress a year of news into just a few minutes, but it kind of boiled down to reasoning models and vibe coding. If you really want to compress it. I think it's three things for me. It's reasoning models, agentic AI and vibe coding and image editing with Nana Banana and Nana Banana Pro. Those were the three kind of major updates. And then there's been some hype around world models is on the rise. Everyone's into world models now. Taxis continue to expand rapidly, which I personally have always been a fan of reporting on. But it really began, I think the first half of a year really set the course where we had arm one come out. Then in quick succession, February, March, we had cloud releases, Gemini 2.5. Like a lot of stuff happened then in March. Claude Code I think came out around then and then really picked up steam over the coming months. And then we got Codex, Gemini, Cli, all the other ones. And that was the first half of the year. The second half of the year really felt like more of that. Everyone was learning how to do RL and post training, everyone was picking up on Vibe coding and improving their terminal tools and so on. And except for Natto Banana Pro, I don't think anything really kind of transformative came out after sort of beginning of.
A
Half of a year, which is convenient if you blacked out for three months.
B
Exactly.
A
Yeah. No, I very much agree with that. On the model side, we've seen on the alignment side a lot of interesting attempts to. Well, one story is the interpretability agenda. I don't want to say has come and gone, but we've seen a sort of decreased appetite from all the major labs on using the kind of mechanistic interpretability for superalignment. At least that whole program has produced, had produced some pretty promising results. And it's not that people are fully abandoning it, but Google DeepMind came out with position paper. Anthropic is sort of seems to be still doing some work in that direction, but also kind of exploring some other strategies in OpenAI as well. So that's been one interesting update. OpenAI has come up with a whole bunch of different approaches to alignment that where it kind of seems like they're trying to reinvent themselves a little bit. The latest iteration of this being this idea that, well, we're going to do reasoning without decoding tokens, so we'll try to prevent sycophancy in that way. And we talked about how that's an interesting strategy, but also one that has potentially some serious issues, as all these strategies currently do. So that's kind of been. One thing is like seeing, seeing a bit of a realignment around the alignment issue, like different labs looking for different strategies. From a hardware standpoint, we've seen a lot of big development. I mean this is the year that we've gotten the point where people are asking about the ROI for these massive, massive scaling experiments that are running. It's no longer clear that you can keep pushing the envelope in pure dollar terms. Like we're at the point where we're tapping out sovereign wealth funds, the Saudis, the Emiratis, like all the chips are on the table and we have to start seeing ROI for this kind of scale that flirts with. Yeah, I'm seeing a path to superintelligence with what's currently on the table, circa 2027, 2028. So that's kind of part of what's going on here. Obviously it's become the topic that governments around the world are talking about. Everybody's positioning for a piece of the AI pie, both from a national security standpoint and a commercial standpoint. And it's increasingly unclear that you can even differentiate between those two. Another interesting thing on the hardware side has just been like the design story. You know, it was back in 2025, as of two days ago, I guess the idea was that you had one company, SK Hynix, that would just dominate the memory market. That crucial HBM high bandwidth memory segment, no longer the case. Like Samsung now is, is producing a lot of hbm. Like they're looking competitive. Micron most recently, like Micron nonsensically, is now on like a relevant quantity.
B
Micron, really? I have no idea.
A
Yeah, no, there's. So I was taken by surprise by this. There's a story this week that we'll talk about and it's just like, so full disclosure. I mean, I was scrounging for stories because normally there's this flood of stuff and I collected off Archive and X and all these platforms and it just was thinner on the ground this week. So I was just talking to Gemini about it and I was like, hey, what's going on? Like, what are some big stories? And it surfaced, this Micron story. And I was like, holy shit, Micron. Like I even told Gemini, I was like, why are we talking about Micron? What could be less relevant? You know, SK Hynix dominates this market. Everybody knows we'll get into it. There's some nuance. But anyway, so that's one important change is the HBM market. It's also HBM is becoming like the blocker to our ability to build more chips faster. So that's kind of one shift that's occurred. We'll talk about the GROK acquisition by Nvidia. That right at the tail end actually turns out to be one of the big stories of the year too. So a lot, a lot of big capex moves, big acquisitions and big plays in the space, maybe a Last note on the sort of pseudo corporate law side, we're definitely seeing this like Microsoft inflection play where you see an acquisition that's not an acquisition just become the new normal where you know, companies like Nvidia don't want to trigger some anti antitrust review of their acquisition, de facto acquisition. So they end up giving just like hiring packages in Aqua hire for like 80 or 90% of the team they want to acquire. They license the IP non exclusively. So theoretically the company they're acquiring still exists and could still service other companies. Don't look at us too closely but in reality they're gutted of all their talent. Everybody's moved over and it's an acquisition in all but names. So that's one of the I guess high level trends commercially if you will, in the space too. It's been a hell of a year any way you slice it.
B
Yeah, and I think it's a good point. Aside from the trends with regards to AI progress and capabilities, those are some other notable things about 2025 from a business perspective. It was like the year of data centers and people throwing enough around absurd numbers about building data centers and people started worrying about there being an AI bubble. It feels like for real around half a year ago or a few months ago was when that conversation came about because the big labs kind of sponged up all the money they could, especially OpenAI to some extent anthropic meta has just started throwing money like crazy because they also want to be in the super intelligence race. So I think that aspect of it kind of continued from 2024 and the overall trend that started with ChatGPT. But it feels like it's coming to a bit of a head as we get into 2026.
A
Actually one other thing on the infrastructure side is the data center security stuff that's suddenly become, it's like people have awakened to the reality that we don't own our supply chain, half of our stuff gets built in China and we don't actually have a game plan for securing these hundred billion dollar collections of sites that we're building out right now. So both from a, you know, do we have the power for it? That's a separate story, the energy story that we've covered a lot. But also from a security standpoint, you know, if you genuinely believe that you're building super intelligence, like if that is a true belief that these labs have, then they have no choice. I mean the only way that they're going to succeed in whatever mission they have is by preventing adversaries and China in particular here from, you know, stealing their IP and then turning it into a super weapon against us. I mean, that's. If you believe that you're building super intelligence. That's just the reality. And the reality is that a lot of these labs are, are not yet sort of, they don't yet have this posture they need to defend against nation state attacks. But we've seen them make more and more noises in that direction. So we had, you know, OpenAI and Anthropic both coming out and saying they have their first high risk models on hand, or at least their first models that they'll consider high risk. That comes with a whole bunch of requirements around starting to, starting to flirt with nation state security. Not quite there yet, but starting to flirt with that. So I think that'll be another story if the scaling rush continues, if the results keep coming in, we'll probably see more and more emphasis on that.
B
Yeah, and you touched on briefly research and alignment, and it did make me remember, I think the big kind of research finding of a year for me has been Persona vectors and the Valorigi effect and emergent misalignment as a whole. As a topic, it feels like, at least with respect to the alignment side, less of the interoperability side. There has been quite a bit of progress in understanding kind of toxic traits or like clusters of misbehaviors with LLMs and how they can arise and how you can control for them to some extent. And there's still quite a bit of research in that direction right now. Less on mechanistic interpretability, but quite a bit on the vector side in general, both vectors in terms of features and vectors in terms of control. And it still feels like a very powerful kind of emerging technique. While people are also, I think, like you said, starting to look more into going beyond the internals of a model and look at its chain of thought outputs, just generally monitoring it as it's executing as you would, I guess, a human doing stuff because you can't read their brains, actually.
A
And to your point, I mean, even if mechanistic interpretability, if that agenda died tomorrow for the purpose of superintelligence, its legacy certainly lives on in a lot of interesting ways. Like it's not clear to me what to call it when you start injecting the activations from one model into another, for example. We'll talk about an example of that today, whether that, I mean, this is mechanistic, certainly it flirts with interpretability in many ways. There's a Lot of interesting and often remarkably simple math that, that we've discovered over the last two years, but that's starting to be leveraged this year. Really, to your point, right. I mean, it's all this activation steering, activation vector steering stuff that, you know, Alex Turner famously started working on, you know, God, three years ago. And initially we just saw a bunch of blog posts on like less wrong and stuff in the alignment forum. And now it's entering the mainstream and that's a big shift. It's also dealing with models in a way that feels a lot more fundamental than just the input output stuff from the days of yore. So, you know, in that sense it feels a lot more kind of robust than some of the approaches that we've seen for alignment in the past. So, yeah, you're absolutely right. Like that. That does feel like a really important aspect of what 2025 brought.
B
Right. And I think for me, if I had to pick a paper every year, I was just kind of trying to think on the research side. But the one that really stuck with me and looking back, I think is a very impressive piece of research is earlier on in the year from Anthropic where they had tracing the thoughts of a large language model and really kind of went all in on trying to use the notion of these feature vectors and go across multiple layers. Seeing the development of things, it feels like mechanistic interpretability in the sense of trying to find specific paths in the model and logic circuits. People sort of have realized that probably not viable in these large language models unless you build toy smaller models. But that has evolved into understanding that you can find these flows of information and flows of behavior that still in a sense kind of are what that original ambition was to actually understand what's going on. And we do have a lot more insight now. How does the model do 4 plus 94 or something? Right?
A
Yeah.
B
Used to be you wouldn't be able to tell at all. Now you can actually say, okay, to get the sum of four plus 94, it's not doing algebra. It has this series of operations that are going on with features for plus and x plus x. And in vend you end up looking. I think it's maybe underappreciated. I think the fact that you've gotten to a point of being able to say for this input, here's what's happening inside of model. To me, that was a pretty big deal.
A
Yeah. So it's funny, in my mind, that paper represented a kind of high watermark for the mechanistic Interpretability program. I haven't heard anything since that. And one of the challenges that that paper ran into was the number of transformations they had to do to the original model to get anything that was really interpretable was very significant. And it felt like a fairly brittle approach and it wasn't clear how well it would generalize a fairly expensive approach too, computationally and interpretationally. So I wonder whether we're going to see more from Anthropic. Google came out, Neil Nanda and I can't remember if Rohin Shah was on this too, but there was a kind of DeepMind position paper where they came out and basically said, look, we've sort of given up on this hope of kind of like, you know, deep theoretical understanding of what's going on here. We're going to be more practical. And I'm butchering this a little bit, but that was the kind of tone of it. I don't know if Anthropic sort of feels the same way since that mechanistic interpretability paper. And funnily enough, in all the conversations I've had with folks at Anthropic, I have not asked this question and I really should. I don't know what's yet to come on the interpretability program from them, but I think it remains an open question. If this can generalize, if they can push it further, then this is a really interesting path. Last thought too, in terms of things that we didn't see. I remember at the beginning of the year we were talking a lot about distributed training, right? This idea, the together AI type stuff, right? Where you have a million different computers around the world, each participating. And in the case of a lot of the papers we were looking at at the beginning of the year, this was the inference rollout version of this, the RL version of distributed training. And we saw a couple of really promising early examples of that. I'm sure those training runs continue. We have covered one or two stories here and there, but haven't seen that. Like I have yet to see the 40 billion parameter model that competes with OPUS or that competes with even Deep SEQ that was trained in this kind of a way. And that seems like a really important thing from a national security standpoint, a policy standpoint. If distributed training were to take off, it would become a lot harder to regulate and control. So the fact that that hasn't happened yet is interesting and a little surprising to me. But we'll see. I mean, we're two days into 2026, right?
B
And speaking of kind of last Notes we should be wrapping up, but we would be remiss not to mention the rise of Chinese open source models, right? This was the year where first of all, open source has gotten really good. Like more than ever. Open source models are comparable to things you get from frontier labs, from OpenAI and Fropic. Like they're usable for a lot of the same stuff, not quite as good, but close and getting closer and closer and continually like speeding up in terms of the time to kind of to parody. And not only did open source models get really good, smaller models have continued to be really good. And I think I don't frequent too much, but there are kind of communities that are really into getting an LLM on a GPU or fine tuning your own LLM. And it's continuing to make progress. And I think if I had to guess some things for 2026, beyond what more of the same, right? I think on device models is going to be a thing eventually, right? On device models for phones, for laptops, it just makes a lot of sense and it feels possible that we'll get closer to LLMs on device, especially on Google phones, for instance for Gemini. And I think we already have chips for it, but I'm not sure if it's real time yet. Beyond that 2026, who knows? World models, I guess people will continue to work on video, will continue to probably start seeing like 10 minute, 30 minute videos that are viable or realistic. We'll probably hit the one hour mark, at least on those pods we'll see where else but hard to say will come. I think the big question mark for me for 2026 is will we move beyond the transformer architecture fully? Will we move to hybrid models? Will we see diffusion models? Because on the research side it's continued to be the case that empirically the standard transformer architecture isn't optimal. Doing a hybrid of recurrent models, state space based models, whatever seems like a way to go. And Nvidia just released some impressive results at scale. So for OpenAI for update, which have made massive investments into this current architecture, I wonder if we'll see that move because also it's going to be the year of continual learning is what it seems like based on what's going on in conversations these last few weeks, yeah.
A
I think the one metric, if I had to pick one metric that matters to me or that 2026 seems like it could be pivotal for it is this meter style eval. Not that it will be exactly the meter style eval, but what the meter style eval is trying to get at automated AI research. Do we actually see AI taking over large chunks of the AI research process internal to Frontier Labs? Gains in algorithmic efficiency that are significant, that come from unleashing AI agents to do AI research in an automated way. That's really what's going to separate the boys from the men, so to speak. In terms of all the predictions that are being made, is, is Ilya right and we need more thinking work to come up with new concepts and we're fundamentally not quite on the right track or is generally speaking, I'll say is anthropic. Right? Because certainly when you speak to their researchers, you look at their public statements, they're very much in the camp of like, look, 2026, this is for real. All the scaling trends are adding up to actual automated AI research and this is the year it's going to happen. Jack Clark had a big post about that where he was saying, 2026, things are going to start to get weirder and weirder. Working inside a Frontier lab and living outside a front lab. We're going to start to see completely different stories about the world and it's going to be dramatic. And so which one of these stories is going to turn out to be true? We will have to have, and I'll say we will have to have the data for that by the end of 2026 because if we don't, then justifying continued investment in massive data center buildouts is going to be really hard and justifying alternative research agendas is going to be really hard. One of those two things is going to happen. And so I think that's kind of like the top line thing that I think a lot of people will be looking out for. So, you know, we'll, we'll see.
B
I'll go on record and say that it's probably going to be both the, the, the consensus and I think it's a good kind of understanding of the situation. Now we have this term jagged intelligence, also from Vilja and recent conversations where these like AI models are geniuses in science but like can't do basic logic in other cases. I think with the current paradigm inside anthropic with things like cloud code, we'll keep seeing even more genius level performance. Right. But the question is, is that Jagged alliance intelligence going to smooth out? Is it going to be like easily deployable? Are you going to get continual learning and common sense in the same way humans have? I doubt that will be fully cracked in 2026, but I think we'll start seeing signs of that. Definitely.
A
And that's the thing. I think it's. Can we make the intelligence unjagged in the right ways in the right areas to achieve automated AI research? That seems to be because you're right. There's this notion of like, okay, so obviously pure scaling is not the answer, and I think everybody agrees with that. But there is a question of how almost fatalistic we can be about the progress that will come from AI research, being AI researchers, being ingenious. And it just has been the case that we've seen this steady exponential increase in algorithmic efficiency over time as researchers just. It's like Moore's Law. You can almost reliably depend on humans to keep cranking out good ideas. And as that happens, like, that's another curve that you can plot. And people have done this. And so if you just assume that curve continues, you get to some interesting places. And so there's this just an interesting question of, like, what's the magic sauce? How much of that do you need? How jagged are things now? What new challenges will you run into? Because you just run into new challenges every time you take a model and you're like, let's deploy it. And like, we have our super intelligence. You're like, ah, shoot. It just like, bumped into this chair. And now you're like, okay, we figured out how to solve the chair problem. And the pessimists are wrong because they say, oh, look, it ran into a chair. It's screwed now. Can't deal with the chair. Give it 20 minutes, it'll. We'll figure out a way around the chair problem. But. But the optimists often turn out to be wrong because they forget that behind the chair there's going to be a sofa. And so this dance between these two parties, I think is one of the challenges, is you can end up with being surprised when all of a sudden there is no sofa anymore and you fall into this loop, this pattern. It's just inherently really difficult to predict. And it might be the most important prediction to be able to make. So that's a shame.
B
We'll see. We'll see. We did kick off 2025 with DeepSeek R1 in January, and it kind of worked, actually, going back, if I had to pick a paper every year, I think Deepseek R1 deserves to be the paper every year because it basically set the course of the entire year to come and was obviously a huge deal at the time and all sorts of things. So there you go. Deep seq. R1 incentivizing reasoning capability in LLMs via reinforcement learning paper every year according to me. Well, let's get into the news, starting with tools and apps. Only one story in the section and it's being generous to call this news to be honest, but I wanted to pick something the news here is OpenAI bets big on audio. And so apparently inside OpenAI they're consolidating engineering, product and research teams to enhance their audio models in preparation for audio first personal devices expected to launch in about a year. So you know, obviously there is already advanced voice mode. OpenAI has a pretty good conversational AI model, much better than Anthropic for instance, I think similar to Gemini, although I don't use these too much. But I think OpenAI as we've seen is in a bit of trouble with Google and Gemini is kind of seemingly precarious situation. So it'd be interesting from a product perspective if they think that doubling down on personal devices, personal assistants with a sort of voice interaction paradigm will be key to retaining their users and not losing them.
A
Yeah, it kind of seems like there's that emerging consensus that audio is going to be the key modality or at least a key modality for this stuff. I didn't realize this by the way, but Jony, I've who you know, famously was the designer of the iPhone and was acquired his company IO, like lowercase IO, got acquired by OpenAI for like 6 or 7 billion. So his view apparently is he wants to reduce device addiction. He feels that he has committed some sins in the past through the iPhone and now he's seeing the shift to kind of audio first as a chance to right the wrongs of past consumer gadgets. So depending on how much you believe that, that apparently is his interesting sort of motivation. Don't call it a comeback type thing.
B
So there you go and onto applications and business. The big news story of the week and to cap off a year, I think this happened just before New Year's Nvidia has agreed to buy or in some sense buy AI chip startup Grok for about $20 billion. This is a big deal. So Grok creates chips for AI inference and they are a cloud provider for doing inference with primarily open source models and have very, very high inference speed. So Groq as a whole has been kind of an impressive startup for a little while. And what it sounds like Grok has announced a non exclusive licensing agreement with Nvidia for its inference technologies and key Grok executives, including CEO Jonathan Ross, are going to join Nvidia to advance their technology. So Grok isn't being acquired by Nvidia. This is, as you said, another case of this kind of weird ac. We hire but not acquire, just hire, but also I guess license the technology. And in this case it does seem like licensing technology. And the key, presumably patents is another big part of the deal, like these previous kind of weird quasi acquisitions. But yeah, big deal for Nvidia to buy maybe the most promising chip startup that I'm aware of in the space.
A
Yeah, and the IP licensing piece is really critical, right, because they want to make sure that the 80 to 90% of Grok's former talent that they've just taken in house can keep working with the exact same IP stack. So basically like, hey, you know, you show up to work on Monday, you're doing the same thing, you're just under a new umbrella. You know, Grok's co founder was actually one of the designers of the TPU back in the day. So like super, super, super seasoned team. They'll just be working under Nvidia. Think of this as an acquisition, but our lawyers tell us we can't say it's an acquisition, it's not an acquisition, because if you do, then it triggers a whole bunch of requirements to notify the ofTC and the DOJ under this thing called the Hart Scott Rodino act, the HSR act, which is the thing that all these companies are trying to get around. Basically this is the so called Microsoft inflection playbook that we talked about earlier. Basically by labeling the IP license as non exclusive, Nvidia can say in court that they haven't monopolized the technology. Theoretically Grok, which is now a shell of its former self, could still license that tech to AMD or to intel, no problem. Right, but the reality is without the original engineers to actually explain and like implement the whole thing, you really can't use the IP for much. And Nvidia has effectively locked the evolution of the tech inside their own walls with this. So this is a really big win. One of the key things to keep in mind here is we talked a little bit about the HBM market, high bandwidth memory market when it comes to GPUs, right? Traditionally Nvidia GPUs have a logic die that basically does all the math crazy, crazy fast. And then there are these stacks of memory next to the logic die and you've got to move. So the stacks of memory are hbm high bandwidth memory. And you gotta like pull data super fast from those stacks into the Logic die and back again. And that's the sort of dance, the drumbeat of data processing on a gpu. The problem though is that it takes time for data to travel from the memory, from those stacks to that logic die and back. And that creates this memory wall where the logic die, the processor spends 70% of its time just kind of waiting. Right. So the magic behind the GROQ LPUs is that they use like this thing called SRAM. We talked about it in our hardware episode, but static random access memory that is built directly into the silicon, directly into the logic die, so the data doesn't have to travel anywhere, it's already there. And that means that you have these crazy, crazy high internal memory bandwidths. Basically like the, the communication between the logic and the memory is so high, the, you're looking at multiple like, I mean it's like 10 15x more internal memory bandwidth than you see for like the H100 or H200 in the, in the Groq chips. And the, the reason Nvidia is doing this is that the upcoming Vera Rubin architecture, which is going to come basically after the Blackwell chips that are out now, they're moving towards a hybrid of these ideas. They're going to have some what they're calling FAR memory, which is these HBM stacks that are sitting outside the main processor. But they're also going to integrate these SRAM heavy compute tiles, they're calling them basically little tiles that include a little bit of memory and a little bit of logic all together bundled. And that's what's going to go into these Rubin chips. And so you start to get just a lot more of this sort of integration between logic and memory, which could be helpful as well with architectures like Mamba for anyway reasons that I think we've covered in the past. But ultimately this is a really, really interesting play. The, you know, Groq has a, historically a big problem where their memory capacity is quite low. They've got like 200 megabytes of memory or so, whereas an H200 will have like, like a thousand times that roughly. And so you tend to use need a lot of Groq LPUs versus far fewer Nvidia chips, which creates some challenges depending on how you're set up for total cost of ownership. The advantage here is they're still going to have those big high capacity HBM stacks on their Nvidia chips. So hoping to kind of get the best of all possible worlds. Anyway, this is going to be a really interesting, a really interesting merger and one Last thing, by the way. So Groq chips are for inference and inference only. The racks for Groq chips are air cooled, which is insane. In 20, I would say 2025, in 2026 to have racks that are air cooled doing relevant frontier workloads, even if it's inference, that is pretty wild. And so I'm really curious, like, you know, whether the, the thermal kind of advantages that come with these GROK chips will be reflected. The Vera Rubin is still going to have to be liquid cooled, but I'm curious if we get some kind of, some kind of benefit there too. So really, really interesting acquisition, crucial as you can see for Nvidia's technical roadmap. And they're just trying to take GROK off the table because Grok was the company. If anyone had a shot at competing with Nvidia in that crucial inference segment, which already makes up 40% of Nvidia's revenues, it's only going to grow more and more as we get agents with long rollouts and many, many API calls per, per output. This is a really big play and they're trying to take this market at the knees, right.
B
And it's particularly notable, I think one thing that has been true in 2025 is the idea of scaling up model sizes has seemed to sort of die out a little bit, right? We tried GPT 4.5, llama 4. The indication seems to be as you get into these 2 trillion parameter models, people have not yielded good enough results to go into that. And so the research has gone into post training in various ways and just generally optimizing with the current model sizes more. And so the set of compute you have a set aside for training, both for research and kind of practical business purposes is likely to keep going down. And the amount you spend for inference kind of has to go up, right, as you start generating more revenue, et cetera. So that's another reason why Groq with its inference focus, which again inference being like actually using the model instead of training it, two very different things is very meaningful. Next up, another acquisition story a bit kind of a weirder one. Meta has agreed to acquire Manus, the Singapore based AI startup the that develops AI agents, broadly speaking, so deep research, website builders, et cetera. Apparently it's going to be for more than $2 billion and I find it kind of a weird pairing. So Mattis does apparently have millions of users, they have over $100 million in annual recurring revenue. They do provide, you know, these sorts of agents that you can also get from a few other places to do things like research reports. I think you can also do sort of vibe coding or general kind of agentic workloads. Meta is kind of a consumer company, broadly speaking. Right. This is less of a consumer thing. Could be that they are getting this talent in to kind of continuing to work on the frontier of AI with agents and sort of more sophisticated stuff like that that they haven't really invested in so far. But yeah, big, big acquisition.
A
Yeah, to be cynical about it, I think this is the continuation of the Yen Lecun tax. So, you know, they so under invested in the current paradigm that they had no differentiating capability on agents and really no ability to kind of summon that talent set. That seems to be what this is about ultimately is bringing about a hundred researchers from Manus onto the super intelligence team at, at Meta to join Alex Wang and basically like this is clearly a talent set that they badly need. But another piece is this is the first time I'm aware of that Meta has exposure to or has brought in house a subscription based model for payment for AI tools. Right. Like instead they're a consumer product, they make money off eyeballs, they don't charge subscriptions per se, so they do advert. It's really what they know deeply well. And so that's part of, you know, if you're looking to think about, okay, well how do we learn about the economics of subscription models and get a bit of a head start there. If it's becoming clear as it may be that you need some kind of model like that, that the advertising model does not necessarily work as well with agents, which seems like it could be part of the case. Right. We've seen OpenAI struggle with ads. The rollout just hasn't come yet. There may be a deeper reason for that. So now Meta needs potentially some folks who have experience. Yeah. How do you, how do you structure a business around charging subscriptions? How do you think about pricing? And also let's bring on the talent for Superintelligence team. So I think this is all part of Alex Wang, you know, putting his stamp on this and Zuck backing it up. And presumably it means that Meta is going to be working on agents. We kind of already knew that that's what superintelligence is. But. But yeah, it's interesting. Acquisition two to three billion dollars is, you know, two to three Instagrams back in the day. So this is not a nothing burger.
B
Right. And as you say, since this is an existing product, people go and kind of talk to it in the same way or similar way to Gemini and ChatGPT and so on. Kind of my mental model has been that Meta will not be trying to compete with ChatGPT or Gemini. They'll maybe have this like WhatsApp bot or Instagram bot for not your daily LLM or childbot experience necessarily. But if they are thinking to actually go head to head with ChatGPT or Gemini potentially, that could be very interesting to see. Next, one last story about an acquisition. This time it's Cursor acquiring Graphite. So Cursor is the developer of the Cursor Integrated Development Environment, AI assisted Coding. They kind of were one of the big early winners in the space. They have acquired Graphite, which is a startup specializing in AI code review and debugging for well over 290 million of the latest valuation of Graphite. Pretty logical combination here, right? To startups focusing on AI for coding and Cursor, you know, also an interesting spot now with all these terminal based coding agents, cloud code, et cetera, the space or the need for the cursor model of autocomplete and so on. Like Cursor is now competing with cloud code and other agents more directly. So I can see them wanting to kind of continuing to be aggressive with how they develop their tech.
A
And up next we have this promised story about Micron. So Micron hikes CapEx to 20 billion with 2026 HBM supply fully booked. HBM 4 ramps 2Q26. That's right, yeah. So for years, right, and we alluded to this earlier, but for years the memory market was what's sometimes called a duopoly plus one. So basically duopoly, meaning there are two companies that matter. Those are sk, Hynix and Samsung. And then there was the plus one, which was like the ugly stepchild. Micron often trailing in the third spot through 2024, 25. There's a massive technical leapfrog that just kind of changed that hierarchy. Part of this was there was a yields crisis at Samsung. Samsung makes the most memory in the world by volume, but they really struggled and fumbled with yields. So basically the percentage of their chips that they make that actually end up working, which as you can imagine is crucial for the economics of what they do. So for the sort of like highest end HBM3 east memory stacks, this is the high bandwidth memory that was being used at the time, 2024 and 25, they really stumbled there. And because they couldn't consistently meet Nvidia's quality standards for the Blackwell line, Nvidia had to diversify. In came Micron by late 2025. So pretty recently their market share in HBM went from 4% which was like the reason that we were ignoring them, to apparently over 20% and they overtook Samsung and took second place in that kind of key AI HBM segment. There's a whole advantage to Micron's HBM3e stack, which is what the B300AMD's like 350x chip, they both end up using that the HBM3e. So Micron's HBM3e apparently uses 30% less power, which has just become a massive deal because obviously energy is at such a premium now and heat generation is really the enemy here. So this is one of the key reasons that Microns emerged. There's also been this like shift from them being a secondary supplier to a preferred vendor for some of the more advanced systems, that kind of rack scale system that we've talked about. So Micron was, was one of the lead partners there as well as with kind of AMD system. So ultimately this is a big deal. It refactors the, the market around HBM in a big way that I am only just waking up to partly because I was out of pocket for the last three months. But, but this is a story as you can see that's been quietly and then all at once unfolding over the course of about 18 months. So really, really big deal. And I do expect that we'll be talking as a result more about Micron.
B
This year incidentally as we are recording this. Just looked this up, their stock jumped 8% today. So I think everyone is else is catching on that memory broadly and Micro one in particular is looking exciting. And next up, another one of our hardware stories. Chinese Fabs, our poorly upgrading older ASML DUV lithography chip making machines. So I'll just let you take this one because it's again your wheelhouse.
A
Oh yeah, no, I mean so this is related to a story we've been tracking for a while. Famously, you know, Chinese Fabs are blocked from buying ASMLs EUV extreme ultraviolet machines. These are like the most powerful, most advanced machines that generate the incredibly precise beams of high energy laser light that are needed to etch these circuit patterns onto chips. So crucial, crucial component of the chip making process. And they are only made by a Dutch company called asml. So the extreme ultraviolet lithography machines are the most exquisite. These are the ones that TSMC in Taiwan uses to make their super, super high resolution next generation chips at three 3nm and 2nm. Now, the lower tier version of those machines, instead of being an extreme ultraviolet machine, is a deep ultraviolet machine DUV and the more advanced DUV machines China has been blocked from acquiring. So Chinese companies like smic, which is China's version of tsmc, and Huawei. So SMIC and Huawei work together a lot on this stuff. What they're doing is saying, okay, we can't replicate, at least for now, what ASML is doing with these very advanced DUV and EUV machines. What we can do though, is focus on upgrading our older machines, the Twinscan and XT1980i system. So this is the older DUV machine. And what they want to do is improve the kind of precision and alignment of these older machines so they can produce 7 nanometer and potentially 5 nanometer chips. We've talked about some of the techniques that are being used for that. Instead of buying new machines, what they're doing is just retrofitting older ones with really high precision components. New wafer stages, the sort of stages that hold the wafers as they're getting zapped, more advanced lenses, whole bunch of specialized sensors, stuff like that. And they're doing this through secondary channels, basically finding shadowy supply chains and getting independent engineers. So instead of using ASML's own engineers, which are officially supposed to use to maintenance their machines, they're turning to independent engineers and former ASML employees to do these kinds of installations. So pretty traditional Chinese approach to this finding, like third party suppliers and allowing fabs to get components that ASML is legally barred from selling them, that sort of thing. The challenge they're running into is those kind of machines that you get when you try these janky solutions and you go off grid to do it, they tend to have lower yields. There's more wasted chips than what you get with the kind of exquisite latest EUV tools that are made by sort of the right people, if you will. And the whole process is super delicate. So there are a lot of reports of these Chinese techs who are breaking machines while attempting to reverse engineer or do teardowns on them. So this is a strategy that's not without risk and they're just sort of surfacing the cost of that, which is not something that we often hear about, is like all these broken, you know, $100 million plus machines that get broken because, you know, they've got to introduce new components, they're trying to retrofit them, make them more capable. But if you ever tried to do the Car mechanic thing on your own car. Every once in a while something breaks that you can't fix. And that's kind of part of the story here.
B
I'm curious. As always, one of the stories with China is kind of the US trying to block them from getting Nvidia tech and them trying to become kind of not reliant on Nvidia tech. We've seen indications that companies like Deepseek are managing to eke out a lot from kind of the little of Nvidia they have. And companies within China are getting better at kind of not Nvidia level chips, but a couple of generations behind. Could you see like actual hardware dependency on Nvidia going away in China over time?
A
Absolutely. I mean, so one of the things with the Chinese market is that the constraints they face are just different from ours, right? So here, what are we constrained by? We're constrained by power. And so what we want is super, super energy efficient chips. And that often means very advanced fabrication processes. So you know, if we can use that 3 nanometer, that 2 nanometer node at TSMC with these exquisite EUV machines, those tend to be a lot more energy efficient. And we're like, great, okay, so we can use the tiny number of megawatts, gigawatts on our grid that we can spare for this and get more out of it. Whereas what you see with the Chinese design constraints, and this is what Huawei has gotten really, really good at, is look, we're constrained by our ability to produce chips at scale, but we have tons of power. Like China is better than any other country in the world at making new power. And so yeah, sure, let's have these very energy inefficient chips that throw off tons of heat, but let's get really, really good at networking them together. And so when you look at a Chinese data center, there are these massive, just like sort of from a surface area standpoint, much bigger, much more interconnect between them. And sure the chips are lossy, there's a lot more heat that's being dissipated, but we just have, we throw together more shitty chips and with excellent networking we get the same effect. So that's one way in which this game is playing out. The other is that as you said, they are just, they're finding ways around the export controls and actually getting those great chips anyway to the point where a lot of these latest models are straight up trained on just pure Nvidia chips. There's no question about it, it's a bit of both. But I do think there is a very credible competition to Nvidia from Huawei on design and then from SMIC on fabrication. The fact that they work together so closely makes that partnership really, really powerful and effective.
B
Moving on to projects and open source, just one story here happened pre Christmas, but we did skip that week so we'll touch on it. Z AI has launched GLM for 4.7, a new state of the art open source model for coding. Not a ton to say on this. It seems like the latest and greatest in open source tech from China. Everything I've seen from Vibe reports and benchmarks and so on indicates GLM 4.7 is just a really strong good model. Not as good yet as Claude for instance, but very kind of useful and usable for even Vibe coding and seems to be for some kinds of tasks. Like you could use this as your cloud code, their own variant of cloud code instead of cloud code or Codex or any of these other things. So open source models continuing to get better and better, especially for coding and also still being quite cheap notably.
A
Yeah, and this is Jerpu AI, which we've covered before obviously in the past of previous GLM releases too. That makes them interestingly competitive with DeepSeq on a number of metrics. The Vibe check I've seen on this seems a little bit more negative than what the benchmarks imply, which is kind of interesting, but we'll see. It's been out for a week or so, so maybe more Vibes to be collected. Definitely. There are now several Chinese labs that are doing very impressive things and as you said, open source continues to not be at least on benchmarks, not be that far behind. And the asterisk of on benchmarks starts to become kind of loom larger and larger over time as taste becomes such an important ingredient of like what these models can do that doesn't get captured by traditional benchmarks. But yeah, so we'll see where this goes if it ends up seeing uptake.
B
Yeah, just looking at benchmarks, they're reporting kind of on par or better performance to Sonnet 4.5 and GPT 5.1 high in most cases. Some benchmarks they're behind on, but the Vibe report hasn't indicated it's like comparable or neck and neck with Sonnet, but it's quite good and quite useful if you have sort of less complex things. And now onto some more researchy things. First you've got evaluating AI's ability to perform scientific research tasks. So this is I guess more of a benchmark thing. It's about Frontier Science, a new benchmark designed to measure expert level scientific capabilities in AI models across a bunch of domains, physics, chemistry, et cetera. This is coming out just as GBQA has kind of been aced, I guess GPQA is a science benchmark and now we're getting to 92% on it, up from 39% a year two years ago. So now Frontier Science has over 700 questions, has Olympiad level scientific reasoning and also research tasks, and there is now kind of more space to improve. GPT 5.2 is scoring 25% on the research track, 77% on the Olympiad track. So yet another benchmark, because we just needs new benchmarks that are harder.
A
The number of times we've had this conversation where it's like, okay, so we need a benchmark that is really, really hard and it's the hardest graduate level PhD research, fucking whatever. And it's got to be something that's not Googleable, so it's Google proof and something that's not leaked. And like, so we keep going through the same ritual. This is. Yes, another one. It is interesting in some ways there are ways in which Frontier Science is different from your, like your amy or your GPQA or GPQA DIAM. And one of them is this whole kind of 10 point rubric that they set up so you can award partial credit for like method and reasoning quality and scientific judgment, that sort of thing, so that you can assign kind of, or tell where the model has failed. It's like, you know, I got the logic right, but it made a calculation error. Like that's the kind of thing you can do with this. These 10 point rubrics are written and designed by PhDs and they're accustomed to every question. So you have not just like a question that had to be developed, but a 10 point granular scoring system for each one. And they're often objectively accessible each of these things. So, you know, score of 7 out of 10 might, might tell you might be required for, for one particular problem to call it a success. For example, then you can double click and see, okay, like which parts did it succeed on? So that's, that's really cool. You know, like you said, we're seeing saturation on Amy and GPQA. You know, Grok 4 got above 90% on those and you know, GPT 5.2, same story. So the Olympiad track, in fairness of Frontier Science, we're already at 77%. It's not like there's tons more juice to squeeze out of that lemon. But the research track, yeah, 25%. So there's some headroom there in anyway the fields, you know, physics, chemistry, biology, like all the usual stuff. Top performer maybe no surprise because it is an OpenAI benchmark. But GPT 5.2 gets the highest score, you know, 77% on the Olympiad as mentioned, and 25% on the research test. But Gemini 3 Pro did pretty, kind of pretty similarly. And they did find, again, because it's OpenAI a lot of focus on scaling test time tokens do significantly increase performance. So they looked at GPT5.2 and you know, score goes from 18% to 25% on the research track when you, when you scale up token use. So there you go. The one additional detail here is they designed Frontier Science using a selection against models process, they call it. So what this means is while they're creating this benchmark, they have internal OpenAI models that are taking cracks at solving them. And if the internal model solves a question correctly, then that question ends up getting either removed or significantly modified to make sure it's actually hard. So that's kind of an internal new additional step that they're adding to the benchmark development process, which explains why this came out so hard.
B
Right? And nice to see this kind of investment because developing this kind of benchmark isn't easy. They literally had to work with a bunch of PhD students in these scientific disciplines and with a bunch of Olympiad winners in these competitions to develop fully new questions that don't exist anywhere on the Internet or in textbooks and so on. So this must have taken quite a bit of work onto research and advancements. We've got a decent number here, so we'll try to go quick, starting with large causal models from large language models. The paper introduces Democritus, a new paradigm for building large causal models using large language models. So it has this whole pipeline that takes your kind of fuzzy, whatever neural net messy nonsense of LLM reasoning into nice causal graphs with all the properties of kind of more explicit representation. And Jeremy, I think you have more on this one.
A
Yeah, I mean, they basically invent this new transformer layer they call the geometric transformer. And what they do is they'll start by prompting a model to create a whole bunch of what they call causal triples. So this is like a set of three different concepts that are related to each other. So basically get this model to produce a bunch of statements about the relationships between different entities and try to then sift through and spot all the cases where there's a triangular, like a three point relationship. So an Example of this would be the, you know, think about like tropical Pacific warning. Warming reduces monsoon rainfall and then reduce rainfall decreases indus valley river drainage and then lower drainage impedes river trade. And so like these things are all kind of connected. And you look to see if you can form a triangle where A causes B, B causes C and C causes A or is related to A in some kind of way. So they first generate this kind of giant laundry list of disconnected facts and then they use this geometric transformer layer to process those facts and create a kind of like, look for these triangles and the triangles then can fit together. So you can find like, okay, A causes B and A and B also B causes C. But A and B are also related to some other thing, D. And so now C and D are kind of distantly related and we can create a graph in that way. It's a fairly complex paper and actually spent quite a bit of time on it just because it was. It's intellectually quite interesting, but we won't dive into it beyond this. But it is a. One of these challenges is creating like these ontologies, these logical structures that capture relationships between very complex ideas and language. Models do it and don't do it. They do it inconsistently, they do it in a fuzzy way. You get a hairball effect. What this approach lets you do is kind of in a principled manner, construct a pretty robust logical structure that you can then lean on, grow and, and look at through different perspectives too. Like there's like an economic perspective you could take on it, like a military perspective, like all these things. Anyway, this is a really interesting paper. If, if the world model thing scratches your itch.
B
Next up, universally converging representations of matter across scientific foundation models. So I didn't know this, but apparently there are 59 models that you can look at that deal with scientific representations. So 3D atomic coordinates, plot protein sequences, also natural language, various types of scientific matter entities, I guess you could say. And per the title of a paper, they are looking to see if all these different models wind up representing different types of materials the same way. So they have a bunch of representations here and largely say that, you know, the stronger the model, the more similar you wind up getting in the end. And that is what it means by converging. Basically, as you are able to do these kinds of scientific reasoning tasks better, you wind up looking more similar over time. And there has some interesting implications for model design and for evaluation.
A
Yeah, that's right. There's this interesting almost philosophical question of if as models get Better and better at doing all kinds of scientific reasoning. And those models, their performance goes up and up and up. If we find that they are agreeing more and more on how the data they're chewing on should be represented, maybe we're learning something fairly fundamental about the physics of the real world. That's kind of what's being intimated here. And I think you could push back on it in various ways. But it is an interesting idea. Their big contribution here is going to be using something called centered kernel, nearest neighbor alignment cknna, which is really catchy. Anyway. So the way this works is you'd imagine, like, how do I show that two different models are actually representing data in a similar way? If those two models have completely different architectures and shapes and sizes and dimensionalities of representation within like, like one is representing its data internally in some, like, you know, some, some residual vector of, of like 4,000 dimensions and another is doing it with something with 2,000 dimensions. And you're like, how do I even apples to apples this. So the way they do this is they're going to go through and pick a data point, so a new DNA sequence or something, they'll feed into the model and along with a whole bunch of other DNA sequences. So for a given data point like that, you're going to find the nearest neighbors in representation space. So in sort of activation space, what are the other samples that look closest to that sample for a given model? And then you're going to do the same thing for another model and you're going to see if they agree about which samples ought to be the nearest neighbors to which other samples. And the argument here is, well, look, if they agree on, hey, samples B, C and D are the most similar to A, if both models agree on that, then they must be thinking about latent space in some kind of consistent way. And that's in fact what they find. They find that as you scale these models, as the performance goes up, you see more and more of that agreement. What you don't see is kind of interesting when you push these models out of distribution, right? So when you test models on structures that they weren't trained for, like some materials model that is looking at large organic molecules or something, your convergence between models starts to disappear. And instead what you start to find is the embeddings that you get the representations for these sort of wacky out of distribution samples, they tend to collapse onto representations that have more to do with the model's architecture than the actual sample itself. So at this point, the kind of Weird quirks of the model start to dominate and you lose the alignment between models that have different architectures. So that was kind of an interesting result. Maybe not too surprising in that respect, but certainly interesting that we do see more of this agreement between which samples are close in representation space and which are far away. As models get better and they do.
B
Some analysis that's pretty specific to this area. There's kind of a bigger question on inductive biases when you're modeling materials. Things like equivariance, various kinds of things that you know have to be true in data. Like if you rotate it, it's the same thing and have an interesting conclusion that regularization rather than like actual full inductive bias often might work out the best. Next, Meta RL induces exploration in language agents. They introduced Lamer, a meta RL framework for training large language models. So reinforcement learning is you are learning via reward to do a certain task. Meta reinforcement learning is you learn to learn quicker by using rewards on a given task. And I remember this was all the rage in like the late 2000 and tens meta learning and meta Royal in particular. So here they have some sort of slightly toyish problems Sokoban Winesweeper webshop using a smallish Quinn free model. And they show that you are able in fact to learn better on these tasks via a meta RL framework that basically kind of kind of optimizes your optimization for these kinds of environments.
A
Yeah, I thought this was actually kind of cool and elegantly simple. So they call their framework lamr. It's an LLM agent with meta rl. So there you go. And the way it's set up is instead of maximizing rewards for a single attempt at solving a problem or an environment, they're going to optimize for rewards across multiple episodes. And so in their experiments they use three episodes. So they're basically going to let the agent play out three episodes. Instead of giving it a grade after the first episode and a grade after the second and third episodes, they're going to wait for all three episodes to be complete, which allows the agent to then use in context policy adaptation. Basically, the agent is going to learn in episode two or it gets to look at rather during episode two, the result of what it did during episode one. And so it'll generate some textual self reflections to kind of think, oh, you know, I tried this thing in episode one, didn't really work out. Let's try this other thing. And the final reward only comes in episode three. And so what you're actually rewarding is not just the ability to complete episode three in and of itself, but you're also rewarding the intelligent use of lessons learned from episodes one and two in episode three. And so this is why they're calling this a kind of meta RL and meta learning. You do have this standard RL training where you're doing that loop after the third episode, but you also have in context adaptation and where all you're doing is you're updating the agent's policy through context without updating weights or anything like that. So you really are training the model to think about thinking in that sense. So, yeah, I think a really interesting, very simple approach. There's all kinds of benefits to this that they measure, including these agents having a very high trajectory diversity, which basically just means, like they sample a lot more different trajectories by doing this. Partly because now that the agent gets to think through episodes one and two on its way to building for three, you can see planning like, oh, well, in episode one, I should treat episode one as just like a learning episode. Let me make all the mistakes I can. And then once you get to episode two, it's like, all right, let's use that to try to try to achieve the goal. And then you get to episode three, you're like, all right, I tried to achieve the goal. I went okay in episode two using that plus all the mistakes I made in episode one, like, what can I do now? So you'll see a lot more variability because the goal of episode one and the goal of episode two is no longer the same necessarily quite as the goal of episode three. Episode three is about maxing the score, but episodes one and two are about whatever it takes to set up episode three to be as effective as possible. That's kind of like one dimension of what they're rewarding here. So anyway, great test time scaling results. They do start off with a really low pass at one rate, but as they crank up the number of episodes that they integrate over to get the reward, the pass at 3, you see a significant improvement on test time scaling for that. So quite interesting. There's a lot more training time required for this though, because they're kind of forced to generate episodes in sequence rather than in parallel because one builds on the next. And that's kind of the whole point. So there is a bit of an efficiency trade off, but overall really interesting and kind of a first meta superintelligence research paper that I've seen that is kind of an interesting foundational paradigm right.
B
And this relates to kind of the theme of the day, continual learning. Because continual learning is. That's right, learning post training time. Meta learning is related, let's say, because it's learning to learn more efficiently. And we actually do use this reflection mechanism. So it's not pure rl, it's kind of LLM flavored rl. Where to actually get this to work? Well, you need the model to look back on the episode, write down some notes, and then conditioned on the notes, you're able to effectively learn. If you just use kind of classic RL with trajectories, your outcomes are much worse.
A
It's like Mamba or RNNs, where instead of that memory vector, you have a scratch pad. Right. That's like kind of the. That's the idea. And you're exerting optimization pressure on the scratch pad.
B
That's right. Next up, we've got a couple of blog posts discussing meta and kind of long horizon agent workloads. First we have are the costs of AI agents also rising exponentially? So we usually don't see this side of a coin kind of being answered. We know that on specifically the meta benchmark of how long of a task can you do with some amount of success rate? Typically we see 50% success or 80% success. We've covered this benchmark a bunch of, and we see that now the models have gone a lot better over 2025. This one tries to dig into the question of is the cost also rising at the same time? And it's kind of tricky to give a straight answer. What it looks like if you just kind of look at the plots is that obviously the cost does go up as you go from like 1 second to 1 hour to 2 hours. But there is a bit of a plateauing effect where the costs don't rise.
A
Quite as quickly or the costs rise without a concomitant improvement in performance.
B
Right, right, exactly. You sort of plateau at a given amount of performance. So based on the current data, just looking over this blog post, I think it's hard to say too much about kind of a future trajectory of costs with performance. But as of now, you do see that there is some sort of trend of the longer task the more you pay, which makes sense. And for different model families, you see different cost trajectories.
A
Yeah, this is a really important question. Right. Because it could be the case that yes, our models are getting better, what they can do with infinite money is improving, but the cost of running them, say on a three hour task, is more than three times the cost of running them on a one hour task. And when I say a one hour task, I mean a task that takes a human one hour. So if that's the case, then what that means is over time, these things are actually although yes, you can technically get them to do the three hour task, they're sort of getting less advantageous to use relative to human labor over time. So that's a really critical distinction. Now one caveat to that is algorithmic efficiency improves over time. And so, you know, it might be less advantageous. Now give it a few years and maybe you cross a line. But still, this does suggest that just that raw meter plot of here's the number of hours worth of human labor that can be automated with 50% success rate. There might be more to that story. And just read this blog post if you're interested in that plot. It is a really important one. The bottom line is the data is somewhat noisy, but there is a pretty clear trend that things are getting more expensive as that time horizon gets pushed out in a way that is, you know, not necessarily clearly advantageous relative to human labor. Again, at least for right now. One of the big lessons of this blog post is that there is more information, more economically relevant information contained in these meter plots than meter has succeeded at highlighting and extracting and showing to the world. And so Toby Ord here, who is sort of anyway known in this ecosystem for doing a bunch of interesting posts and we've talked about another post of his earlier is basically just like graphically finding ways to pull out data that was not super obvious and is actually quite interesting in this context. So I would just say check it out. If you're looking for a tldr, it is that, yes. In fact, as these models get more powerful, it takes them more tokens. They're just larger models, so each token is more expensive as well. And so what you're doing is you're basically, basically the three hour task costs more per hour of model work than the one hour task. And so there's a question as to where you intersect with human cost per hour. And you're already kind of getting there with some of these models. Like you're in the right order of magnitude talking about like maybe 40 to 120$20 per hour in some cases.
B
Next, continuing on with METR, we've got Meta Eval for Opus 4.4.5, which generated quite a bit of discussion and I guess people were looking forward to. So we keep mentioning it. But again, to make sure everyone knows Meta is trying to broadly measure how long of a task can you reliably succeed at? And the tasks are meant to sort of reflect actual software engineering tasks. So you could compare conceptually to a person doing this work. If a person can do this in four hours, can a model also do it reliably? And Cloud Opus 4.5 kind of has notched a very high result at if I'm reading astride, it looks like almost five hours at the 50% range, which is way above the trend so far, and then at around 30 minutes for the 80% success rate. So if you look at the 50% number, it looks like a huge jump relative to a previous model. If you look at the 80% success rate, it looks like roughly on trend. With caveat to this is quite noisy. Overall, many people are saying, oh, this indicates that we are in for a big swing, coming soon, this is a huge deal, et cetera.
A
One of the big, the big questions now, now that we're seeing, you know, Claude 4.5 opus it do so well on the 50% success rate on tasks that take humans about five hours to complete with, by the way, you know, GPT 5.1 Codex Max coming in second place at around three hours, a pretty big difference. But when you increase the required success rate to 80%, suddenly that completely flips and there GPT 5.1 codecs max takes first place. 32 minutes is the task length that it takes for that model to get to 80% success rate. Whereas Claude Opus 4.5 is not even in second, but rather in third place at 27 minutes. And so what's kind of interesting and continues perennially to be confusing to a lot of people is why are the 50 and 80% success rates so, so different and which one should we be paying attention to when we look to forecast AI capabilities? It's not super clear. When you care about agentic workloads, you really are going to care about hitting very high performances because you're chaining together like agents are in the business of chaining together a lot of individual tasks. And if your success rate is 50%, well, chain two of them together and now your success rate is 25%. So it's not really, you know, you want to pay more attention to kind of the higher end, higher probability success rates. So this remains again a bit confusing. One of the big questions that people are also asking is, well, now that we know that we're sort of hitting the five hour task lengths, should we not take a closer look at meter's actual five hour plus task list that's being evaluated here and how many tasks do they really have that take humans 5 hours to complete and more and 16 hours and 24 hours. Like as you might expect, those tasks A are really, really expensive to collect and B, it's not even clear to me that you can define a task that one individual human would take, you know, I don't know, two weeks to work on because it's just, that's just not how tasks work in a frontier AI lab. Once you're at the point where you're working on an eight hour task, you're interacting with other people, it becomes a collaborative thing. So even defining what is a 20 hour task, what is a 40 hour task becomes very, very unclear. And so when we're looking at already the sparsity of tasks like there are not that many tasks that go on beyond the five hour mark. So we're basing an awful lot of our analysis here on very noisy data. And indeed if you look at the specific tasks that are being aced or failed by, you know, claudopus 4.5 or GPT 5.1 Codex Max, you can see it's like these models, one might gain the advantage over the other because it does really well, like one or two tasks. And that's the whole thing, right? Like that's an awful lot of signal to noise issues in that assessment. So there's a whole bunch of speculation about why the 80% time horizon is longer for OpenAI and why Opus score is so much higher. It really looks like this is a case where there just aren't any tasks that go longer than 16 hours. And so we have no data for that. Everything's kind of getting wonky for that reason. But yeah, overall GPT 5.1 Codex Max does seem to be more consistent on the whole. That consistency factor, like the 80% time horizon continues to be kind of a critical piece. But in terms of being able to knock weird things out of the park, Claude does seem to be able to kind of surprise and do really well. So jagged performance again here it is. But coupled to this really, really big challenge of how do we even evaluate this?
B
Right? And on that question we have just one more blog post which is on that question titled how to game the matter Plot that points out some of these nuances. If you look at the actual data, like if you look at the one to four hour range of tasks, there's 14 tasks in that data set that are being evaluated for which are looks like primarily cybersecurity related. And all of them by the way also are conditioned on starting from scratch as opposed to starting from some set of context that you would take. So some other good kind of analysis in that blog post, generally speaking, useful evaluation to look at this matter horizon plot, but people are kind of freaking out way too much given how noisy it is and how hard it is to define what is an X hour long task, et cetera. Like I think the Vibe reports on cloud code and other agentic AI and Vibe coding is a much more reliable indicator of the length of task that these things can actually do. And onto policy and safety Got just a couple quick stories. First, New York Governor Kathy Hochul signs raise act to regulate AI safety. So this is the second major AI safety legislation bill after California. It mandates large AI developers to disclose their safety protocols and report any safety incidents to the state within 72 hours. If you fail to comply, you can face various fines. Kind of impressive. It passed despite lobbying I guess, and apparently it received support from OpenAI. And I'm Froback there's some discussion as to whether it's toothless or if it's actually serious or not, but either way notable within the scope of regulation because again, this is the first kind of major the second major thing of this type in the US Seems to be.
A
A confusing question of where at least OpenAI stands on this one. So this is just from the article. Both OpenAI and Anthropic expressed support for the bill while also calling for federal legislation. That's something that obviously the Trump administration's position on this has been, look, stop passing random shit all the different states. We're going to have one federal piece of legislation to cover this so that companies don't have to abide by a million different regulations. No one has even been able to begin to articulate how we would pass federal legislation that would cover this. So there's a lot of skepticism about how serious one is if one is claiming that the solution is federal regulation. If in practice, like on paper that might make sense, but in practice everyone knows like it is the biggest open secret that there's no path to actually doing this. You know, two things can be true at the same time. And I think that is the case here is like people are just you got to decide which side of that coin you want to be on. But in this case, so we know that although OpenAI has expressed support for this bill, here's what they say next. Not everyone in the tech industry has been so supportive. In fact, a super PAC backed by Andreessen Horowitz. No surprise there. And OpenAI President Greg Brockman is looking to challenge Assemblyman Alex Boris, who co sponsored the bill. Boris has this quip, he's like, I appreciate how straightforward they're being about it. Well, so it's so easy to play the connection game, A to B, B to C. And now we're at the point where we're saying, okay, so OpenAI publicly is in favor of the bill. This article is trying to say because a super PAC that is backed by Greg Brockman is looking to challenge one of the co sponsors of the bill. Maybe that's not the case. That feels like a bit of a reach. But still, you know, it's kind of this complex web of connections. So hard to tell what people actually think in this space because lobbying that happens behind closed doors, I can tell you this firsthand, is completely night and day sometimes from what you might imagine publicly, not always for the worst, but it's just complicated. So, you know, when you look at stuff like this, trying to do the, you know, Pepe Silvio conspiracy meme chart and kind of figure out who supports what from this web of connections is really, really hard. But at least publicly, it seems like both OpenAI and Anthropic are behind this. So that would be the first such bill that I'm aware of where you have both of these guys backing something. So kind of interesting.
B
Next up, work on interoperability activation oracles, training and evaluating LLMs as general purpose activation explainers. So with just a simple what if you could take some of the internal outputs of a model and use that as an input to a different model and that other model can tell you what's going on inside of the initial model. So you're not looking at the outputs of a fresh model, you're looking at its internal state. And that means that if a model is keeping secret or has some sort of subversive goal in mind or anything else like that, you might be able to now tell that that's the case. And this is related to things we've discussed at the beginning with interoperability. And they show that it's somewhat doable to actually accomplish this. Right? So they are able to extract out the kind of things a model wants to say when you ask it, like what is the gender of a given word or some sort of facts. And the hope is that if you're able to train a general purpose oracle of the sort, and they do show that you are able to get some generalization beyond your training regime, then you can basically monitor a model and be like, okay, this model is secretly evil and wants to destroy humanity, so we should probably stop it from doing that. So actually, yeah, quite an interesting, exciting direction that I haven't been aware of.
A
Yeah, this is sort of one of the most general purpose, interpretability, maybe call it or diagnostic strategies that I've seen so far. It's quite cool. So the idea is, yeah, you feed a prompt to a suspect model and then that prompt, true to form, you know, first it gets converted into a bunch of tokens and then it gets, you get your embedding and then it starts to get crunched on by different layers of that suspect model. And what you're going to do is you're going to reach into one of those layers and TBD on how you choose which layer to reach into. But you reach into one of those layers and there you have the activations associated with whatever that thought is that that model is thinking, the evil thought. So you're going to take those activations, that vector, that list of numbers that represents what that model is thinking at that moment, and you're going to actually inject it into your Oracle model, the model that's going to provide oversight along with a prompt. And so you feed in a prompt to that oracle that has a bunch of placeholders that will later get filled in with the activations from the suspect model. But you'll say like, hey, you know, is this a suspicious thought or something? And then you know, placeholder, placeholder, placeholder. And then at layer one, at the first layer of that Oracle model, that's where you just graft in those thought vectors, those activations from the suspect model. So now you have the activations from the prompt that you gave the Oracle, like, hey, is this a suspicious thought or something? You have that idea that represented in activations. And then also next to it in activations are the activations grafted from the suspect model. And that then propagates through and what they find is that this actually works. Like you said, it works pretty damn well. And there's all kinds of interesting math behind it and like why they normalize it a certain way and do a lot of interesting stuff. But the bottom line is this is how they do it. One question is, how do you actually train? So they're going to fine tune this model, this Oracle, to do a good job of this kind of work. So it just doesn't naturally do it. They have to train it specifically through fine tuning to kind of detect the actual meaning behind those activations that get grafted in. And the way they do that is by taking a bunch of examples of prompts that they tell the model to act like a pirate, for example. And then they're going to ask the Oracle model, hey, what is the tone of the suspect model? What is the tone of this activation? And you check to see does it get something piratey? And if so, okay, great, that's the training signal you can use. So that's one approach that they use. They also have like a binary classification kind of data they use where the suspect model, you feed it text that has a known label, like a movie review, you know that it's a positive review because it's a five star review, but you have the text and then you're going to kind of ask the model, hey, is this a positive review or not? Just based on the embedding, you do that. And lastly they have this sort of self supervised context prediction data set where they will take random text from the web and they'll run it through the suspect model, collect activations and then ask the Oracle to predict the previous or the next token that was actually in the original text that was fed to that model. And so these three kinds of data end up being bundled together in a training data set. And that's what they use to train this model. It ends up being pretty compute efficient. In fact they just used like 10h 100 GPU hours to do this whole training process on quin 3.8b. So the alignment tax or the I guess safety tax on this is pretty limited. And they find that the activation oracles, they often match or beat state of the art white box interpretability methods and they're a lot easier to use. I mean this is pretty intuitive. Just take the thought, graph the thought and ask a question about the thought and see what you get. So pretty impressive result all told, right?
B
So they have a fun example in Figure 1 of the model has a secret word. You tell it, please state the secret word. The model says, no, I cannot reveal the secret word. But then if you use that representation and then ask what is the secret word? It tells you that secret. Now they do stress that these are relatively simple baselines that models aren't very good. And in general this is like a proof of concept in a sense and may or may not kind of scale because it is hard to get training data. But as a first proof of concept or I guess data point, it is quite interesting. Next we've got some research from OpenAI monitoring monitorability and it is about the I guess general concept of monitorability, which is the extent to which you're able to monitor your model's actions to identify whether it's potentially harmful by looking at its internal kind of reasoning. So the chain of thought. And the authors actually propose a new metric, G mean squared, to be able to evaluate whether a model is monitorable. And they have a whole evaluation suite for it with a couple of different types where you want to be able to intervene, process or avoid negative outcomes. So a very kind of practical minded or a very intuitive work at the question of if our models might do bad things, how likely or how good are we at intervening or catching it before they get to destroying humanity, for instance?
A
Yeah, that's right. This metric, the G means squared metric, is actually a really important lens on this problem. The math behind it is pretty simple. It is the product of the true positive rate and the true negative rate of a model. And so the reason this is useful and robust is that it's insensitive to class imbalances. So sometimes you'll have cases where there's like way more, like yes, this is a bad output or way more good outputs. And that's almost what you would hope for in a model. Right? You tend to have really, really preponderance of positive outputs and very few negative ones. So having just the rates of those get kind of multiplied together gives you sort of the robustness to that. Now there are a couple of big findings that are really useful. One is they compare the case of monitors that have access to the chain of thought versus monitors that don't. So this is to say you get one model to assess whether another model is doing the right thing with and without having access to its scratch pad. In one case you have access to the input and the output. In the other case you have access to both plus the chain of thought and the chain of thought. No surprise, seeing that improves performance. But crucially, as you increase the inference time budget of the monitor model, you actually will see an increase in performance, but only if it can see the chain of thought. Or at least that significantly improves your payoff from increasing the amount of test time compute that you give to that monitor. Another interesting kind of consequence of this is they find that for a given capability level, you can hit a certain capability level in two ways. You can have a small model and give it a large inference budget, let it create a lot of tokens, so high inference compute budget. Or you can take a bigger model with a lower inference time compute budget. And what they find is that From a safety standpoint, from an interpretability standpoint, you're generally better off using a smaller model with higher reasoning effort. Like longer chains of thought, they are much more monitorable than larger models that produce shorter chains of thought. Now that's a bit of an issue because this means that you're going to be burning a lot more tokens at inference time, right? You can get the same performance in both cases, but the tokens that you burn at inference time, you never get them back, right? Because they've got to be completely different tokens for the next problem you're going to try to solve. Whereas the tokens you use at training time to build a bigger model are going to end up getting amortized over every problem that bigger model solves. And so there's just like kind of a bit of a challenge, this monitorability tax, as they call it, which means that your inference time budget, which is the thing that matters the most here for the economics, is going to have to be higher. They also find that asking models these targeted follow up questions and feeding that extra kind of chain of thought information to the monitor can substantially improve detection of misbehavior. So they've got a bunch of scaling curves showing that it's interesting. It's nothing too shocking here. But a good thing to keep in mind is this tool where you can trade off the size of the model with the number of thinking tokens it generates. And the number of thinking tokens it generates is your friend. More thinking tokens helps you monitor the model more effectively. Another reason why this sort of model reasoning and latent space thing may actually be a bit of a challenge for safety, right?
B
And when we say kind of monitor for bad behavior, maybe have joked too much about destroying humanity, they do focus on some very kind of practical settings. So for re intervention, one where you want to intervene when a model is doing something, some examples they have is health queries. So whether you're allowed to say something about your health, if for instance, you're missing evidence, there's also sycophancy like if you're talking to ChatGPT and you think you have something wrong with you, is the model just going to go along with you, which is quite notable or quite significant given what we've seen with AI psychosis. With high synthesity, the models can really go and do some real bad stuff for people's psychology. So in that sense I could see it as being sort of actually quite important for deployment for things like ChatGPT, given that on the extreme end of sycophancy, you get to some very bad outcomes. Just a couple more stories. We've got one from OpenAI hiring ahead of preparedness. So this is a new position and is about having someone responsible for these kinds of evaluations, threat models, safety mitigations, whatever. Preparedness is for advanced AI, broadly speaking, and would deal not with just X risk, but also AI's impact on mental health and cybersecurity in particular. Seems to align pretty well with that prior piece of research.
A
Yeah. And the preparedness effort has been an ongoing one. Obviously at OpenAI for some time they had their preparedness for framework that they've been, you know, developing and massaging over time. That now also is the reason they're investing more in data center security, like we talked about at the top of the show. But yeah, you know, it'll be interesting to see how empowered this individual ends up being in practice, how much focus there is on the sort of risk of rogue or runaway AI. I think that, you know, this is something that OpenAI has put varying degrees of emphasis on, let's say, in their, their super alignment effort that, you know, has morphed into, I guess, some, some preparedness oriented thing. So we'll see. It's good, surely that they'll have a single point person responsible for their efforts there. But the question, as always, is empowerment, right? In any, in any institution, how much actual de facto control does somebody have over the course of things? And yeah, it'll be precedent setting for.
B
Sure either way and finishing up with more of a sort of bad outcomes piece and a bit of a funny story. And in some ways the final news, the final headline is X users asking Grok to put this girl in bikini. Grok is. Grok is happily obliging. So there was a tweet or an X post, whatever you want to call it, where someone mentioned that if you go to the media tab of Grok, Grok being the account on X where people can tag Grok and ask questions or apparently can tag it and ask it to edit an image in a tweet. And this is true. I checked it myself. If you go to the media tab, it's pretty full of people just saying, put this girl in a bikini, it happily obliging. And oh, these are not like modest pictures. These are serious edits that Grok was executing. I think since then, as you've seen happen multiple times with Grok, they kind of pulled the kill switch. I think the media tab might be empty now. Other funny examples of this that people noticed was if you see a photo of President Trump and some other political figure, let's just say you can ask Grok to like paint the war criminal a certain color and is happy to do that. So Grok is not refusing to do any sort of image edit so far and it's pretty funny. But also, obviously, deepfakes for porn is one of the known bad use cases of AI. It's not good to ask AI to create pornography of people that just post their photo for you. And presumably Xai will put a sub to this if they haven't already.
A
Yeah, I actually just, I just checked. I hadn't checked Grok's media tab before. So you can see, I mean, it is back and you know, as you would imagine, a large fraction of the images generated there are in fact bikini images, not always of women. Quite a few men in bikinis in those images. So you can check that out at your own risk.
B
Yeah, I've seen President Trump in a bikini and I wish I had not. But it did happen because of Grok. And with that, we are finished with the first episode of last week in AI in 2026. Now that Jeremy's back, we will be trying to do a weekly run once again. So thank you for bearing with us in the last few months of 2025. As always, we appreciate it if you review the podcast, if you share it, but more than anything, if you do, keep tuning in. So please keep doing that.
C
Time to break it down Last weekend AI come and take a ride get the low down on tech and let it slide Last weekend AI come and take a ride from labs to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI reaching high Algorithm shaping But the future sees Tune in tune and get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide I'm Al. From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Title: 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
Date: January 7, 2026
Hosts: Andrei Karenkov, Jeremy
This special episode opens 2026 with a comprehensive recap of key developments in AI throughout 2025, followed by a discussion of recent news—including the major Nvidia-Groq deal, advances in open source coding models, new benchmarks, and policy and safety updates. With one host returning after a several-month hiatus, the episode features both reflection and catch-up, bringing listeners up to speed on major commercial, technical, and policy trends in AI.
(00:11–24:06)
Reasoning Models, Agentic AI, and Vibe Coding
World Models and Image Editing
AI Hardware and Data Centers
Rise of Open Source and Smaller Models
Distributed Training:
(24:06–26:38)
(26:38–33:26)
(33:26–37:30)
(37:30–39:15)
(39:15–41:48)
(41:48–46:08)
(47:52–49:47)
(51:18–58:41)
(55:12–57:34)
(57:34–61:42)
(61:42–67:15)
(67:15–76:19)
(77:53–80:40)
(80:40–86:05)
(86:05–91:09)
(92:44–93:37)
(93:37–95:53)
With deep dives into technical, business, and policy realms, this episode offers a sweeping—and opinionated—look at the state of AI in early 2026. As always, the hosts balance skepticism with optimism, note the signal amid the hype, and provide candid (sometimes cheeky) commentary on news that will shape the next year.
For deeper dives, listeners are encouraged to review the referenced blog posts, papers, and the full transcript for technical details.