Loading summary
A
Foreign. Welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. And you can also head over to Last Week in AI for our text newsletter with even more news. I am one of your regular co hosts, Andrei Karenov. I studied AI in grad school and now work at the startup Astrocade.
B
And I'm your other co host, Jeremy. Jeremy Harris. I'm. I don't know, I do stuff. I'm a Gladstone AI. That's the one. I do a lot of AI, national security things. Yeah. And excited to be back on the podcast because this is like, I don't know, we're half a dozen episodes into the Return in the sense and we missed last week. That was on me travel. But we will not be missing weeks like that. In general, we are going to be recording these. There was fortunately not that much that was going on last week. There was a deep seek paper that is worth paying attention to and that we'll talk about. There's a couple little things. Certainly Cowork is a big deal, the anthropic release, but yeah, not a, not a huge week. So kind of forgiving, you know, doesn't always happen when we miss a week. Usually we get flooded, but in this.
A
Case we can cover kind of both weeks in one episode I think without going into crazy overtime as we might otherwise.
B
That's right.
A
And yeah, this episode got kind of like a real mix of stuff. Some significant and minor updates to tools. Gemini also has some interesting updates business wise, some new 10 billion, 20 billion kind of dollar deals which are pretty interesting. Got a decent amount of open source compared to most weeks. And then yeah, quite interesting papers in research and advancements dealing partially with sort of this question of how do you scale up memory, how do you go next step beyond what we've done in terms of learning. So pretty fun episode to come. And we'll go on and start with tools and apps with Anthropic's new coworkers tool. So this is probably a big deal, as you said. And it's a big deal because cloud code is a big deal. So at this point it's almost like a joke within Silicon Valley that like people are going crazy about cloud code. And that's because it's quite powerful and just does do a lot of work for you. And what people have observed is cloud code can do a lot more than just code like it can edit videos, it can compile spreadsheets, it can do all sorts of stuff, just goes into your computer and does things that you ask it to do. And that's effectively what this is. This is anthropic, integrating Claude code, but without sort of the coder programmer interface of a terminal. You don't need to install it as a package or anything. It comes bundled in the Claude desktop app and just is its own little tab that you can switch to and then ask it to do stuff. And it goes on and interacts with a file system very much like cloud code. So given that cloud code has found many uses and many proponents, including myself, Cowork could similarly have a lot of fans to come.
B
Yeah, absolutely. I mean, if you've used Claude code, you know, you're almost used to now one shotting, like some pretty complex things, like, you know, things that in the past might have taken me, you know, three hours to do, to be honest, it just knocks them out of the park. So it is pretty wild. And now we are seeing that translate into this sort of desktop agent model, which is what this is like. Cowork is kind of just an all purpose aid in your, like in a new way to interact with your computer. So you can do things like point it at some messy downloads folder. I mean, I can speak for myself and say my downloads are just a kind of a crap pile of like random things I've downloaded. And you can say, hey, sort everything by file type and date or by, you know, theme or whatever. And it actually can look into a folder full of, you know, say screenshots or whatever and automatically build Excel spreadsheets and basically just like dive in, do a bunch of work that you might have an intern do or something and then, and then give you an output. So that is like, it is a pretty broad set of capabilities. There has been some conversation about the sort of security side of things I think it's important to note. So yes, it is a combination of sort of local access to your computer, web access and autonomy, which are sort of like, this is almost the lethal trifecta of three things that people talk about when they think about loss of control scenarios. One important thing though is, and this is often lost in the noise, so Anthropic is actually using like sandboxed virtual machines to contain these systems. They're not letting Claude run Wild on your actual Mac, they're running it in a digital container. So this is kind of a new gold standard in 2026 in terms of what security like looks like or for AI agent safety, people are saying, okay, well we're going to have to, for competitive reasons, we're going to have to let these things rip in some sense and do big functions on your Mac and on increasingly sophisticated systems. How do we retain guardrails in that context? Anthropic has always been in this interesting position where they have historically oriented towards, well, we don't want to release a true frontier capability because we don't want to make the racing dynamics worse. It's clearly like that. It's no longer the philosophy, but instead there's this view that, okay, let's at least shape the trajectory of the technology. And so you see them doing that with their responsible scaling policies. Other labs do that too, but they are explicitly trying to come up with frameworks and precedent. Like this whole idea of having the model run in a digital container that puts pressure on other labs to that are going to fast follow to do the same thing. So this is kind of at the margins how Anthropic is like spending some of its safety budget, let's say, which is an interesting play. And by the way, this is like a pretty interesting price point too. So they're looking at 100 to $200 a month when this comes out for the cloud max tier anyway. So this is, you know, when we're talking about OpenAI potentially releasing, I guess when we were talking about GBD5 back in the day, oh, this could be a, you know, $20,000 a month or something like that. We're now hitting genuinely like, you know, hundreds of dollars a month, thousands of dollars a year. So this is pretty interesting. It's quite a price point, right?
A
That's the same pricing that they have had for cloud code. And at those kind of very higher price tiers, you kind of unlock a lot of tokens, a lot. A very high amount of usage, which when you do these sort of agentic, almost like assistants you do wind up using. And I know personally I'm on the max plan, so I would imagine a lot of people are actually paying those $100, $200 price tags. One more thing about the safety, I wonder if this also kind of implies a vote of confidence by Anthropic on the alignment side, where this kind of tool even forgetting potential future amplifications like in the present, you could ask it to go hack someone or do go write spam or all sorts of very boring but real misuses. And I think we are at a point in alignment and safety compared to a few years ago. Where the Frontier labs may be more comfortable believing that their agents are not going to go off and do things they're not supposed to. Such as in this case.
B
Yeah, it's an interesting question. I mean, certainly in the short or immediate term, like with this model, they're comfortable having it run in this context. Right. With the reputational risk that comes with that and everything else. It's also worth noting too. I mean, Anthropic is selling mostly to corporations, right? Their B2B work is the most significant product line and they actually do dominate in that vertical right now. So that's a high risk. Right. If you start having failures in a B2B context, it can affect a lot of people and high stakes. But yeah, when you talk to people about the short term alignment side, you know, getting agents to do what they're meant to do, you do get quite a bit of confidence on this. The long term picture on the super Alignment side remains actually quite pessimistic. And what I've heard from folks at Anthropic talking to people at all the labs. One of the interesting differences is Anthropic seems to believe in short AI timelines more, I would say, on average than most. And therefore to be more concerned about Super Alignment because there hasn't really been much concrete progress in that direction. So it's this interesting thing where in the short term we kind of go, oh yeah, these agents, like, we can keep them contained, we can release this. But then there remains that question mark. And I'm curious if that distinction ends up getting blurred in the future as we start to get more confused about what. What counts as what.
A
But.
B
But yeah, for sure. And by the way, so on the price point, one thing to note too is this is a kind of a shift for Anthropic, sort of like what they did with cloud code, where they're not just selling intelligence, what they're really doing here is selling labor. That's where these price points are coming from. We're starting to get into the thousands of dollars a year, low thousands, no question. But this is starting to look more like, like I said, hey, intern, go do this thing, then hey, autocomplete my code and give me the next few functions or whatever. This is like a really big sort of conceptual shift in the landscape. And I think that is reflected in those price points. It's hard to justify them otherwise.
A
Next up, we have some news about Gemini. They're introducing a feature called Personal Intelligence which would connect to Gmail, Google photos, search and YouTube histories for users of those and be able to reason about that information when chatting with you. So very, you know, common sense application or extension of Gemini by Google, Apparently Google does acknowledge your potential for inaccurate responses or over personalization and is going to be addressing those problems. It's also an opt in feature so you can connect, disconnect different apps. And Google is implementing some guardrails as well for sensitive topics. So yeah, definitely the kind of thing where you could see some funny unintended AI knowledge access. And I think Google might have learned from their embarrassing episode in 2023 and 2024 to avoid kind of blunders that are avoidable. Yeah, this is now out in beta and it is usable by Google AI Pro and Ultra subscribers.
B
Yeah, I like the phrase over personalization, you know, as a soft way of expressing something. I'm actually quite curious what specifically is going to be meant by that.
A
Listen, if I'm googling for lobsters and dresses, then Gemini should just not worry about why, you know.
B
That's right, that's right. Yeah. I mean, how many times have we googled for lobsters and dresses? I mean, it's just constant. Absolutely. And one of the interesting things too is there's been this shift, like we used to talk about this. I remember years ago when we were talking about, I guess, GPT4 and all that stuff, we were talking about the advantage that OpenAI enjoyed relative to Google because that was the main axis at that time in terms of OpenAI being perceived to be a new player. So it's like they release a shitty thing and it helps people make bombs or it does whatever, and everybody kind of goes, eh, whatever. It's OpenAI, they're just starting up here. Whereas Google, if they release something, everyone goes, whoa, Google. Like, what the fuck, guys? And it's kind of like changed now where OpenAI is actually large enough that they're expected to put like, you know, that they can't quite get away with the same. The same stuff that they might have been able to pull off say two years ago or three, or even one for that matter. And so, you know, one of the ways that's expressed too is on the ad side, like their opportunity to experiment with ads that would probably be crappy at the beginning has kind of passed. And so here Google is kind of been going the other direction, saying, hey, you know what, we are actually playing catch up. And that comes with a license to throw some kind of wilder punches. And I feel like we're starting to see that a little bit. The sort of Willingness to experiment and just try things out with all this kind of couchy language like hey, you know, there may be some over personalization, there may be this or maybe that, but, but certainly Google's now shipping, which is a big shift. You know, we'll see if that continues. But institutionally they are, they feel like a different company in this space now.
A
And speaking of Google, we've got another story related to them this time about their overviews in Google searches. They're removing some AI related health summaries after an investigation has found dangerous flaws in those responses. So Google has disabled specific queries like what is the normal range for liver blood tests after experts flagged them as dangerous. Didn't kind of do that across the board. There are still some responses where it would do that. So yeah, one of these cases where I guess it could have been predicted that having a chatbot summarize some information inaccurately could be problematic and good that this was caught.
B
Yeah, I mean at a certain point this is easy to say by the way, because most people just like don't have the time or the capacity to do this or the knowledge base to do this. But you know, the way to use these tools is obviously you do do a search, you get whatever result and then if it's high stakes, you look it up like you actually make sure you find the ground truth. The report talked about this kind of critical error on pancreatic cancer searches. The suggestion here was that patients should avoid high fat foods. Apparently that contradicts standard medical guidance where you, you want to maintain your weight and it could be a serious issue. It, and I don't want to get in the business of saying, ah, you know, who cares? So what here. But at a certain point we do face this question of we're either going to surface these recommendations or we're not. And the question there is where does the burden lie in terms of validating the factuality of some of these statements? I don't know what the right answer is here, but it does seem like the option to just say, okay, well therefore let's put pressure on companies like Google to just never surface this. And we're not seeing the other side of the coin here. How many lives are saved by the good recommendations that actually help? I don't know that number. And until we do, it kind of feels like it's like the self driving car thing all over again. We can look at the awful crashes, but if we're not looking at the lives saved, it's just really tough to.
A
Tell and this reminds me, recently I was chatting with some people about the topic of AI overviews and how it sort of feels like almost under the radar. AI overviews just became Google. Like I cannot count the number of times that I'm just asking Google a question, which used to be like ChatGPT, people were saying back in 2023, you know, Google is in peril. Google might die out because ChatGPT would replace it as the go to place for search. And it took Google a bit of time. And when AI overviews initially rolled out, people were finding all these jokes about how many rocks to put in or like blue in a pizza, I think, or how many rocks do you eat per day. But I know in my case and some people I've seen now just have this learned behavior where you, if you have a question, you Google it and you look at the AI overview and that's just standard. So I've sort of been reflecting and noticing that yeah, it's kind of people.
B
Were always talking about how Google has 90% market share on search. What they tended not to focus on was that OpenAI had like 100% market share on the chat market. And now if Google is stepping in on that and anthropic stepping in on, it's like it's actually not. Now these are massively growing spaces. Right. So the pie is growing fast enough that there's enough like plenty there for everybody. But it is interesting. You're right, like it's not as simple as that. That just search market story.
A
Yeah, it's no longer the case that sort of, I guess some uses of ChatGPT have been overtaken by AI overviews. Not even Gemini, just AI built into Google search. Yeah, which in itself is pretty interesting. We've got one more story about Google. Gemini is expanding within Gmail. So beyond the basic features that have existed like summarizing emails, it can now do some more useful stuff. You can ask questions about emails if you've got some of the subscription tiers. There's also proofread that would offer grammar and style suggestions. We've got AI Inbox which would filter emails to highlight important messages and tasks, help me write suggested replies. All these features. So they're still kind of not integrating it full on as an agent or anything like that, but kind of adding it here and there in various ways, which to me, yeah, seems pretty intuitive.
B
Yeah, some of these look like actually really interesting lifestyle improvements. They give this example of instead of typing in your inbox, search or whatever, if you're looking for a plumber who gave you some quote, right. They're like you just type in who is the plumber that gave me a quote for the bathroom renovation last year. And that actually would solve an awful lot of my inbox search problems personally. So it seems like a good quality of life thing. Yeah, this whole idea of the AI inbox where they're going to filter the clutter so you can focus on what's most important, that seems a bit riskier to me. Right. Because like at that point like knowing what's most important is very, very context laden as a, as a thing. So that seems like something that we should keep tabs on to see what the actual vibe check is and what the failure modes are. But yeah, it is interesting. Again they're leaning out, they're taking this big swing.
A
So and just one last story in the section we've got. SlackBot is an AI agent now. So Salesforce has launched this new AI powered version of Slackbot, of course built into Slack which I think still is one of a dominant messaging platforms. And as you imagine.
B
Dude, tell me Slack's in trouble. Without telling me Slack's in trouble.
A
I don't know much about the market. I think it's still Microsoft and Slack Slack as far as the big players.
B
That's true, yeah.
A
Certainly kind of a big deal. And this new agentic Slack bot would be capable of finding information, drafting emails and scheduling meetings within Slack. Also interact with other enterprise products like Microsoft Teams and Google Drive. This was announced a little while ago but is now being rolled out. And the Salesforce CTO has described it as a super agent. Yeah, I mean I think this is another aspect we've seen AI get integrated as these little kind of question answer summarizers. I think the next step probably is all of the business apps notion. Slack, you name it, will have agents built in.
B
It's interesting to kind of see if anyone who was around in the kind of 2020 era where people were really getting riled up about what the future of post human labor would look like or post human market, you know, they often would talk about it'll all be, you know, AI agents chatting with each other and it was difficult to imagine how that would actually start happening. But now when you see it actually happening, whether it's Salesforce or Slack or wherever else, it's like I'm going to for a larger and larger fraction of my, my work, I'll be outsourcing it to these agents and eventually it's going.
A
To go all the way Everyone is going to become a manager, manager or a paperclip.
B
Those are your two, your two choices.
A
And onto applications and business. First up, Anthropic raising money. They are getting 10 billion at a 350 billion valuation. This is still not fully signed, but it sounds like based on reporting that it's more or less being finalized and that's pretty much. It sounds like GAC syncopeurs. Sovereign Wealth Fund and Cartoon Management plan to lead the new financing. I guess we're not done with these mega deals yet.
B
Yeah, no, absolutely. It's notable too. 350 billion. I am old enough to remember the old, old times of September 2025. Like what is that four months ago when anthropic was only worth 183 billion. So they've doubled their valuation just in that time. Now when you get into this territory of raising $10 billion plus we've talked about this a lot. You are in the territory of sovereign wealth fundraises. There's just no other place to get that kind of capital. You know, maybe SoftBank, but they're pretty tied up right now with OpenAI. So this is really, this is the end of the road. Like after this, you uipo and you access the deep capital markets of the United States. But that, that's basically it. So that is the plan, by the way. So Anthropic is expecting to break even by 2028, which is pretty soon. This suggests they could reach profitability actually faster than Open AI, which again, we've talked about this. Not a surprise given that Anthropic is dominating in that B2B segment. Right. That's a much more, there's much more profit per token associated with, with that, that work. So, you know, you might actually expect that break even to happen. But the revenue growth has also been really crazy. I mean, they went from about $1 billion at the start of 2025 to less than a year later. 5 billion. They 5x their revenue in that time. Pretty wild. They are looking at an IPO preparing for that as early as late 2026. So they've got Wilson Sonsini, which is a big, big famous law firm for, for tech company IPOs. And they're starting to work on that corporate restructuring that they need to hit to do that. So this will be one of the big stories at 2026 if we get there. The anthropic IPO, possibly the OpenAI IPO, there's a lot coming down the pipe.
A
And speaking of mega financing rounds, XAI has raised 20 billion from Nvidia Cisco and Vestors. The funding would value xai at approximately 230 billion. So pretty impressive raise. Sounded like actually there was a lot of demand to get into the round X. AI hasn't had something like cloud code, but they are working with the Department of defense in the U.S. so I guess that could be helping with the optimism side.
B
Yeah, I think, I'm trying to remember the Department of Department of War now. I guess the deal that they had was for like $100 million for a couple, I think it was a couple of different labs. I'm not sure now I'm trying to remember. But certainly there is that partnership. I think it, it'll be a relatively small fraction of their revenues course. But, but strategically interesting and important. You know, Elon famously said that the story of them raising an initial $15 billion investment was false. He's like this is not true. And presumably now we're learning that it's because it's going to be a $20 billion investment. So technically true that it wasn't false. What technically true that it wasn't true. Yeah. Anyway, directionally accurate. The, the other investors here by the way, do include a bunch of so, so the Qatar Investment Authority, Abu Dhabi's mgx. Right. So you're, you are again back into that whole sort of sovereign wealth fund adjacent territory. It's just a lot of money. I mean the big play is going to be on obviously the data set that X has. Right. Xai now kind of has access to all the X data and the Tesla data. Like there's all these interesting integrations, especially now that we're talking about this world model stuff, you know, self driving car data starts to look really interesting. So anyway, not surprising of course with Elon at the helm that they're pulling off these, these wild fundraises.
A
And just a fun anecdote about Groq real quick. Just a couple days ago we needed to test our moderation system and can you guess how we generated the inputs to test moderation? Rock was very capable of offering some very spicy, spicy things that I think is, it's true other chatbots would probably have not been capable of and in fact are a little bit sensitive when it comes to moderation relative to maybe newer models. And of course we can't go through this section without at least one story about Nvidia. This time it's about Nvidia needing quote a supply chain miracle from TSMC as China's H200AI chip orders OOM supply. So apparently there's as much as 2 million orders for H200s from coming from China, while the current inventory is only 700,000. So that's a big gap and that's despite the average selling price of one of these chips being estimated at $27,000. So this is a lot of potential money that Nvidia would be leaving on the table if are unable to actually just create and sell these chips.
B
Yeah, and that means they're spinning up, as Jensen said, their, their H200 supply chains. Right. Like they're bringing them back to life. They had been all focused on Blackwell because, well, that's just like the better chip. But here they're rotating back into the H200 because this China sale thing is going through one of the most important things to keep in mind on the supply chain as they try to do this. So when you go to make the H200, so you're using TSMC is 4 nanometer node, and that can be produced both in Taiwan and in the us. So you've got tons of production capacity there. The issue actually is not the issue to meet this demand is not the ability to, to fabricate that logic. The issue is actually the packaging. Basically this process where you take the logic and the memory and you put it all on one, one kind of coherent chip that's coas. And COAS packaging is basically being used across the board for Hopper, for Blackwell and for Blackwell Ultra. So now that they're saying, okay, well, we want a ton of Hopper, it's like, sure, you're using a different node to fabbit, so that's great, you can do that in parallel, but you're relying on the same finite pool of packaging. And so that's really what's becoming the kind of rate limiter here. It's something that has been clear. It was clear that packaging was going to be the rate limiter for some time it has been an issue, but now even more so because you're pulling down a bunch of H20s or H2 hundreds rather, to ship them basically to China. So this is a sense in which, by the way, this idea of exporting advanced chips to China actually hurts American companies directly because you're hitting the packaging part of the supply chain. So, sure, TSMC can confab the chips or the logic dies, but there's going to be less Blackwell chips if the same packaging process is used for both. So very complex supply chain. A lot of interactions that are not necessarily obvious when you, when you first think about, hey, let's lift the ban on exporting These things also, you know, the H200 is about, about six times more powerful compared to the H20 for training workloads. And that's one of the reasons that China's AI industry is rushing to place these orders. So there you have it. I mean, this is a very complex story and it is an interesting consequence of the China band lift.
A
Right. It's coming at a time when it's also been a complex kind of history of export controls where last year the Trump administration sort of flip flopped a bit, but then eventually basically let Nvidia sell to China. Now China, the government there seems inclined to maybe start disencouraging buying these chips from what I've seen, but it still is seeing that through. So I would imagine at least part of the story for why there's a rush to buy these up is because it's very uncertain. Yes. If it's going to keep being possible.
B
Absolutely. And from one administration in the US to the next as well. It's also unclear, you know, if Congress flips to Democrat in 2026, what new laws could come in that make export controls harder or. Sorry, that make exporting harder. But yeah. And then in terms of the shipping to China, you're right, the Chinese have come out and said, hey, you know, we're not so sure we want these chips now. It's kind of ambiguous. This is. So Jensen had a press conference, I think it was a press conference, something he made some statement where he was like, look, here's the deal. There's not going to be a splashy announcement from China saying, yes, we're open for business. We've decided we want the chips instead. It's, it's going to come down to the purchase orders. There's going to be purchase orders that suddenly come from Alibaba and from Huawei and from everybody else. And it'll all be done discreetly, but the GPUs will flow. That's how we'll know that China is actually open for business. And I mean, frankly, I just, I fully expect them to, even though I know there's been a bunch of kind of questioning down that line. I would be shocked. And we can revisit this in a future episode. But my, my money's on if the Trump administration allows those to ship, the H2 hundreds will ship.
A
And we've got a couple more stories on chips and Compute. The next One is about OpenAI signing a deal worth $10 billion for compute from Cerebras. So this is a multi year agreement where Cerebras is going to deliver 750 megawatts of compute to power starting this year through 2028. So this is kind of an extended deal worth 10 billion being accrued over time. I think an interesting development where we've seen OpenAI continually trying to get more compute, diversifying the sources of compute. Cerebras is an interesting player in a space where they have this AI specific chip system that can have very high throughput specifically for inference. So different from Nvidia GPUs. They've been around for quite a while and it seems like Cerebras now is getting to a point where there's a lot of demand for these chips in data centers.
B
Yeah, and this is really about inference. Right. So Cerebras is an inference platform. That's what this is going to be used for. You know, OpenAI came out and said that this is just basically going to be about decreasing latency for certain customers. And OpenAI has a strategy that. Well, the way they described it here is to build a resilient portfolio that matches the right systems to the right workloads. In other words, there are workloads that Cerebras specializes in. They're going to be workloads that. Well, all kinds of other players in this space you could think here of like Fluidstack or anyway, any other entities that have different specializations and different kinds of workloads and they're going to try to ship them to the right providers. And that's really a all ships rise situation. There's inference, there's training, there's weird mixes of the two. You know, that's what it's all about. So also Cerebras, by the way, has been pushing back their IPO a lot like they first filed for it in 2024. But there have been a bunch of controversies and, and challenges. So they've been raising on the, on the private market since then quite a bit. Apparently they're in talks to raise another, another billion dollars at a $22 billion valuation. So they're sort of, I don't want to say limping towards that ipo because this, you know, these are big strides, but it's, it's been a bit of a stutter step to that goal.
A
Yeah, I think with Nvidia all but acquiring Groq, it would have been a very fun story if AMD all but acquired Cerebras, but it looks like that's probably not happening. And onto more of a cloud story. CoreWeave is amending its Credit agreements. So this is a part of cloud computing, and it seems to be modifying these agreements to have more liquidity. So I'm actually going to let you take over, Jeremy, because it's a bit technical.
B
Yeah, well, so this is one of the classic challenges that happens with a lot of these big builds. Basically, Core Weave ordered a bunch of hardware like GPUs, like billions and billions of dollars, probably. It's Blackwells just given the stage that we're at. And the problem is that they have been delayed, so their rival has been delayed. Core, we've paid the money for those things, so they're out of pocket, but they need those GPUs to be able to, like, pay back the loans and all that stuff. So what's happened here is they're basically saying, look, we need this liquidity bridge because of this delay, and this means that they need to turn back to their existing investor. Sorry, lenders, I should say, and say, hey, we got to rework our terms here, because when lenders give money, in this case, they gave, like, $2.6 billion to Core Weave to finance all this stuff, they don't just give $2.6 billion and walk away. They set these things called covenants, and these are rules that the company has to follow to. To prove that it's healthy. And so the amendment that they've just made to those covenants has a couple of different components. One is there's a minimum liquidity, a minimum amount of cash on hand that Core Weave had to keep. That was lowered to $100 million. And that gives them more breathing room to spend cash on data center builds instead of just like letting cash sit idle in a bank account to satisfy some requirement. But another key one here is that they actually are postponing the testing of their debt service coverage ratio. This is basically just like how much their operating profits can cover their interest payments. Like that ratio. There's a moment where you trigger a test of that, basically checking your profit versus your interest, and they've pushed it back to late 2027. So you can see how this is all about, like, softening the pressure on Core Weave, because fundamentally, what the lenders are saying is, look, we believe you'll be able to make this money back, no problem. We want to let you fight another day. So we're just going to soften these things. We have. We have confidence in the commercials. It's the underlying supply chain that's the issue here, and this we expect to be resolved. So there's a whole bunch of other stuff in here, but it's basically that theme. They're being allowed to just decrease the, the, the proof points that they have to show, decrease the amount of cash on hand that they have and all that. So really important for Core Weave. And it does show that at least as far as these lenders are concerned, that market seems pretty healthy or at least Core Weave's position seems pretty, pretty healthy.
A
And onto the last story. LM arena is now valued at $1.7 billion after raising $150 million in a series A funding round. That's pretty soon after previous fundraising. They've got 100 million seed round in May that was at 600 million in valuation. They've launched their commercial service AI evaluations back in September and apparent reached an annualized revenue rate of $30 million by December. LM arena started out as just kind of community run evaluation platform, was originally called Chatbot arena, founded by UC Berkeley researchers and initially was funded through grants and donations. So it's kind of an interesting story of how it developed and how it became apparently a very valuable product for companies. Did not see this coming?
B
No, absolutely. And you look at the investors who are participating in this round, this is pretty wild. So it is a series 8 and recent Horowitz, Kleiner Perkins, which like you know, old school but still very upper. Upper brow or high brow Lightspeed Venture Partners. There's, there's a bunch of others but like these are, there's a lot of the who's who's who in the Valley. So damn solid fundraiser.
A
And onto projects and open source, we've got a few interesting projects and open source releases starting with Nemetron Cascade Scaling Cascaded reinforcement learning for general purpose reasoning models. So this is a framework that has scaled this notion of cascaded reinforcement learning across multiple domains to develop these general purposes reasoning models.
B
This is a really interesting paper from the standpoint of a key problem that's been persistent in the space for a long time, which is catastrophic forgetting, right? So traditionally when you train an LLM, you train it, you know, pre train it, let's say you're going to train it with data from a whole bunch of topics. If you continue your training and you cause it to specialize in a domain like math code or general chat or whatever, it will start to forget the things that it's learned about other topics. So this causes people to try to choose between should I make a specialist model, should I train sort of topic by topic so that it masters those, those topics or should I go broad and at what stage should I do what? And what they're doing here is they're saying, well look a, you should be doing this with reinforcement learning and you should be going, instead of blending a bunch of different data from multiple domains like math and code and general chat at the same time, in reinforcement learning, what you should do is do sequential domain wise training. Just code, just math, just general chat, just whatever. But with reinforcement learning, and that's really important, the reinforcement learning process is one that prevents catastrophic forgetting. It turns out it seems like using reinforcement learning from human feedback in particular as a pre step can set up the model's foundational reasoning abilities. And it gives you this base, this robust base that can help you go domain specific down the line without collapsing. Reinforcement learning is quite interesting, like where you should use it versus where you should use supervised fine tuning or you know, anyway, the standard sort of autoregressive training. The autoregressive training seems to be really where you get into the catastrophic forgetting thing if you do this approach. Whereas what happens here is they do have this RLHF kind of base alignment to get models to learn general reasoning. And then when you go domain specific, one of the big advantages is if you focus just on math. Like math has often a binary reward like correct or incorrect. And it's usually really fast to compute. So it's got a specific profile of how the data flows through the system. The sources of training instability, all of that is pretty unique to math. And if you move to code, the reward might change. It might require a sandbox execution environment and it might have higher latency. Right. So if you're trying to mash together math code and software engineering, which is extra noisy, right, because your code might be partly correct, but fail one test but not others. And so trying to mash all these together when you do your reinforcement learning step can cause all kinds of training instability because the way data is flowing through your system just has to be a little different for each of those. And so what they're doing is by going through one domain after another after another using reinforcement learning, the hyperparameters like learning rate or batch size, all that can be tuned specifically for that domain's response lengths and the sparsity of rewards and all kinds of stuff. You can even do reward shaping to make sure that you're tailoring rewards more closely to what should be done in that space. And so this is really how they do it. They start with this like base alignment step where they're doing general conversational domain and again through RLHF, reinforcement, learning, human feedback. The goal here is just to make sure the model is helpful and can follow general instructions. And then they might move on to a bunch of training on, on math using verifiable rewards and then a bunch of, you know, RL training on coding, then a bunch of RL training on software engineering and so on. Which is just like it's this interesting discovery that prevents catastrophic forgetting. Again because you're using RL and because you're starting off with a model that you've trained to do general purpose reasoning. So there's a bunch of extra stuff here. But yeah, I think this is actually quite an interesting process paper as we think about where RL can add value and where it doesn't.
A
Yeah, they have a fun diagram essentially of this whole process where people are now starting to call it mid training or post training. And it seems to be starting to get a little kind of figured out with this set of stages. You do supervised training, then lhf, then a few different variants of RL that are domain specific still primarily in verifiable domain land. This one actually released the paper and the model kind of a month ago. So it's not super fresh but I don't think we covered it. And it is fairly notable for this size range of 8 billion, 14 billion. These models are very performant and fully open sourced by Nvidia as well as the training, recipes, data and the report itself. The paper is very detailed, it's like dozens of pages. So nice to see here from the us it's sort of similar to deep SEQ type technical report where you see a lot of nitty gritty, a lot of the stuff that would otherwise be figured out within Frontier labs but never shared publicly. I think this is giving us a hint of what the fatigue labs are probably figuring out.
B
Yeah, and it's also, you know, we talked about this I think with the last Nvidia release but they're very interested in promoting the open source market because everybody doing open source is using Nvidia hardware whereas increasingly you're seeing with closed source. We've talked about everything from Grok to which now is in video course but Grok used to be to certainly TPUs and you know, Trainium 2 and Trainium 3 and like all these, these are other platforms, everybody's getting their own chip but the open source landscape is dominated by Nvidia so they have a very strong vested interest in pushing that. The other thing too is like kind of from an intuition standpoint like why RL over supervised fine tuning? This is something that like, has been for a long time known in the space. But it's worth saying explicitly in this context because this is really a clear test of it. When you look at supervised fine tuning, you are training a model to imitate, right? It's trying to minimize the difference between its output and some kind of like target answer. And so if you move then from a math data set to a coding data set, the model is forced to adopt to new token patterns, right? They can overwrite the patterns that it learned from math. And that's where catastrophic forgetting happens. Whereas when you look at cascade RL or like reinforcement learning from verifiable rewards, specifically, basically the model is not just mimicking tokens. That's not what the loss function is doing. It, it's exploring its, its own reasoning paths basically to reach a verifiable goal like a correct math answer or, or some kind of executable code. And so what that does it, it reinforces the underlying reasoning capabilities. Instead of just these kind of like the surface level patterns, like what the model sounds like, all of that is ditched. And so you get more durable skills out of it. There's a place for each of these, of course, but this is one of the really important reasons that they're flagging here, for the the fact that you don't get catastrophic forgetting from RL in.
A
The same way onto more of a paper coming from Deepseek titled MHC Manifold Constraint Hyperconnections. So this one actually got a decent amount of play on Twitter, despite being quite mathematical. And so the gist of it, the high level is there's this notion of residual connections which is very standard in neural networks. Basically you don't just go layer by layer. You pass forward some information from an earlier layer to a later layer without processing it through these intermediate layers or in addition to that. And that turns out to help a lot with training. Now there was this notion of hyperconnections, which are essentially fancier residual streams. They do some computation within that connection to improve its benefit. But that turned out to then make training a bit trickier. It introduced some training and stability. So the manifold constraining here is Deepseek suggesting a way to do this hyperconnections trick, which is pretty new, also from 2024, but while preserving the kind of ease of training. And specifically, I'm quoting from a paper, AMHC utilizes the Sinkhorn KNOP algorithm to entropically project H res onto the Birchnov Polytope which I have no idea what that means, but I assume it means Berkhoff Polytope is.
B
Come on, Andre, I thought you were a Stanford guy.
A
No, I'm one of these people who just put together neural nets, you know, wrote the code. I never learned the math at advanced number of this is, but it seems to yield some pretty significant improvements in terms of large scale training.
B
Yeah, yeah, it's. It's actually conceptually pretty, pretty fascinating. The whole idea with residual layers, right. We talk about these a lot in Transformers. You basically take the output of the previous layer and as you pass it forward to the next layer through the residual, what you're going to do is you have some input to the current layer, right. Call it xinput. What you're going to do is you're going to chew on that input, right? Using your layer to produce some output. And then the thing you're actually going to pass on to the next layer in the residual is not just the output that was the chewed up version of the initial input, but the output plus the initial input. In other words, you're going to try to kind of give the initial input a little bit more influence. Like it's not just that you're going to take an input, spit out an output. You're going to take an input, spit out an output, but then add it to the input again so that the input gets to have, gets to be represented again in what you pass on down the line. And in a way this kind of creates a sort of momentum in favor of preserving the information from previous layers. It makes it kind of over time, eventually, yes, you do tend to like, if you make the transformer too deep, you will kind of forget and lose that information. But this is meant to kind of give you a way of preserving that, that information flow so that. Yeah, so the information content of previous layers is propagated for it helps with stability, helps with all kinds of things. Now the challenge is exactly what I just said. If you make this, this model too deep or. Yeah, I mean in many cases you can find that if the output of a particular layer is just too, let's say loud, it will just kind of take over and overwrite and functionally kind of erase, wash out the information from the previous layers. And so the solution that a lot of people have been using and they kind of came up with is, okay, well why don't we create instead of having one think of it as like, you know, like a notebook or something where you. Every layer just kind of adds findings to the notebook to a page, it doesn't erase the old text. We're actually going to keep it there. We're going to just add a few more notes to the margins and pass it along. Well, what's going to happen here is we're going to say, okay, this seems to cause us to forget that initial text too often. So what we're going to do is we're going to use many different notebooks. Basically, we're going to have a bunch of different residual lanes and use each one to store a little bit of different information. So maybe like lane one of the residual stream is reserved primarily for the raw embedding that you initially got. So you're preserving that information all the way down, keeping it clean, to make sure it's available for any future layer to look at. So this is, you know, the pure raw initial input embedding. Whereas maybe the other lanes are more like scratch pads where, you know, different layers can dump in their outputs without muddying the original pristine signal. Now, the problem this creates is at some point, you are going to have to mix all of those lanes back together to get one thing, what one output that you can pass on to the next layer. That mixing has to be done with a matrix, because, I mean, that's what you need to mix vectors like this. And the problem with doing that is that if the numbers in that matrix are even slightly larger than one on average, then the signal is going to get amplified at every layer, basically because you're reusing this, this matrix to multiply, multiply, multiply, and you get this explosion. And likewise, if the numbers are slightly smaller than one, the signal will kind of fade away. And so this is what the paper deals with. It's this MHC solution. What they're doing is they're taking this matrix that mixes these many different lanes in the residual stream, and they're forcing it to be doubly stochastic. All this means is every row has to sum to one and every column has to sum to one. This guarantees that the total, mathematically, it's called the total energy of the signal across all the lanes is constant. But basically it just means that you're not going to gradually compound and blow up the information that's being sent down the line or make it disappear. And this whole sinkhorn Knop algorithm is just the way that they efficiently make this matrix doubly stochastic. The details really don't matter. They are interesting, but they don't really matter. The bottom line is this is again, deep sea Going deep into a very narrow technical mathematical thing that really matters for implementation, but that most people don't care about. And then it's kind of solving this very fundamental and interesting problem in a fundamental and interesting way. And so this is what they've got. They even wrote a custom kernel, by the way, to optimize the crap out of the sinkhorn NOP algorithm in this context, which is again classic Deep Seq, doing the hardware aware thing.
A
Back to a model. We've got a technical report for iQuest Coder v1, which as it sounds like is a model that is specialized for coding. Actually a family of models at different scales. 7 billion, 14 billion, 40 billion and 40 billion dash loop. And the gist of what they did to get a very nice coding model that is competitive or roughly similar to a lot of the other good coding models. Sonet 4, Gimme, K2, et cetera. They have a fairly complex training pipeline. So they have pre training of course, with focusing just on training on code with the standard transformer model. Then they have mid training where they start to get a little bit more agentic and addressing different tasks. Then they've got post training for thinking and for instruction following. And they essentially, in this report just detail how they set this whole training regime up, what the data mix is, et cetera, and wide up being able to get a model that is fairly competitive despite being smaller, presumably to Sonnet 4.5 and GP 5.1. Yeah, seems quite good.
B
Yeah. And there's this funny kind of weird thing that they're doing with this 40B loop variant I actually think is really interesting. They have kind of this multi pass workflow. So they, they feed the same input through the same weights twice, but they're trying to do something pretty different each time. So you know, the first time they feed the input in the tokens get processed through all the layers. And as that happens, the model is populating what's called a global key value cache. It's not actually going to try to decode an answer at this stage or all it's doing is kind of building a latent representation of the big picture of what is meant in this input. And in this sense it kind of reminds me a little bit of like an encoder decoder architecture. So you have this sort of encoder phase where the first passes. Let's just get the big picture of what's going on here. The output of that is a set of hidden states, as you'd expect. And then also the KV cache, it's kind of global KV cache which is again at every layer there's a global KV cache that represents that layer's understanding of the global meaning of that of that piece of the picture. And then in phase two, the tokens get passed through the same weights, the same model for a second time. And then what happens is the model looks back at the KV values, the sort of global context for each layer that was stored from the first pass. And this happens layer by layer. And at the same time during the second pass the model is using local attention to just do causal attention on the tokens that are being generated in the second pass. So you sort of have this first pass that's focused on populating this global cache and then the second pass that's more kind of local correlations. And then they have a learned gating mechanism that does kind of a fusion that fuses those two streams and it decides for every token basically how much to rely on the global understanding that it got from the first pass versus the local more logical oriented refinement from the second pass. It's kind of interesting, it reminds me a lot of these attempts to again solve for memory for you know, the sort of. Yeah, really it's an attempt to fix the memory problem. There are a lot of people who are trying different things for this, but this is something that I hadn't seen before. So kind of cool. And the results are pretty, pretty good. So they have, you know, live code bench, the 40B loop model thinking version got a score of 81.1 which is the highest recorded for that benchmark at least in the report. Swedbench verified 47.2% decent for a 40 billion parameter model. So anyway, it's kind of interesting. Again just another stab at this whole memory thing. Let's have the model kind of process the thing twice. I haven't seen this before. Again, I wonder if this will pop back later.
A
Yeah, fun fact. They note that this approach is pretty heavily based on this other paper from 2025, pretty new paper by the Bidentseed team and hyperconnections also from the Bidentse seed team. So a lot of these research insights are now making it into more models and it seems like we are really at with optimization phase where maybe the data stuff is a bit sorted out and maybe the training regimes and stability and so on are pretty worked out and now we can get it to a lot more of a nitty gritty. And the Sloop transformer goes back to 2019 with Universal Transformer like back when research was Research. So yeah, as we know, it's back to the era of research. All right, and one more open source model We've got the TI Abu Dhabi team releasing Falcon H1R7B New Reasoning Model that apparently outperforms others at math encoding with only 7 Boolean parameters and a 256k context window. So this fancy naming is to indicate that this is a hybrid transformer and Mamba 2 architecture having this Mamba 2 which just to quickly reiterate, is an alternative to transformers that has recurrence within it, kind of looping over the input which allows it to handle much longer sequences without increasing computation. So here they're using that to handle large context windows. And via another training process we have a two stage process SFT on long reasoning traces and reinforcement learning similar to what we've seen before, is able to achieve fairly good benchmark scores. And yeah, we got another Falcon model which used to be a big deal in open source early on, but has fallen off and now we're seeing I think more and more these indications of what you can achieve with hybrid models and seems to be like a promising direction in general.
B
Yeah, like one. One story or one take home from this too is tii, the Technology Innovation Institute. I think that's if I remember that's what it stands for is back. Right? I mean famously the Falcon 180 or no, that was the. I think that was the disappointing one. But there was one before that was really impressive, put the UAE among other things on the map in a big way in the space. So this is a genuinely interesting and impressive model. It does beat Quin 332B and Microsoft 4 14B in a bunch of reasoning tasks. And it only has 7 billion parameters. So this is a legitimate achievement. Also the Mamba 2 hybrid thing, I feel like we're seeing this more and more. The transformer Mamba hybrid. Usually the way these hybrid architectures work is or historically the way they'd worked was you would kind of stagger, you do, you know, transformer layer with standard attention and all that stuff. And then you would pass it on to a Mamba layer and then a transformer layer kind of alternate that way. What's happening here is not that. So they're actually in parallel using some Mamba heads and some attention heads to produce the output at each layer, which is an interesting change. It's definitely not something that traditionally had been done. And they call this the parallel hybrid layer. Some of these heads side by side. And so when a token hits a one of these layers split, part of the information goes through the Mamba two head and that will handle the longer term sequence history. Right. We talked about Mamba and Mamba two before. But this idea that you kind of have this vector that's going to store context over time as you read the text, much like a recurrent architecture, like an rnn, basically this, like, yeah, this, this vector that you're going to keep dumping more context into, almost like a kind of pseudo scratch pad as it reads. And then you got the other part, that's the attention head that handles these more complex and precise relationships. Because the Mamba vector, this, this memory vector is only so big, you're going to kind of have more fuzzy recall, more lossy memory. So when you go to handle complex, precise relationships, you really do want the attention head. And there's an MLP like a. Just a multi layer perceptron standard feed forward network that then merges their outputs together. There's a bunch of differences between Mamba 2 and Mamba that hopefully we have a podcast episode explaining. But there's like an efficiency kind of optimization thing happening there that's very important where just super quickly, like when you do, when you do matrix multiplication, it can help to take really big matrices and break them down into smaller matrices to do what's called block matrix multiplication. The original Mamba did not allow you to do that. The new one does, which is great because those smaller matrix matrices can then fit on sort of more constrained hardware. And so you can get, you know, use your sram, your cache, your VRAM more efficiently across the board by doing that. So, yeah, this is basically the big picture, but the results speak for themselves. And yet again we're seeing this Mamba transformer merger that I feel like this is the second or third time this year that we've covered.
A
Right? Yeah. We saw previously the same team released Falcon H1 back in mid-2025. That's where they introduced the first version of this hybrid and really went into this way of doing the hybrid. So this one is that plus R, right? Pushing the reasoning frontier, going into essentially the deep seq R1 style of thing where you do reinforcement learning and you have different data mixes and so on, so on. Similarly, also there's a pretty detailed technical report, like 20 pages with going as deep as giving you the specific hyper parameters of training, data mixes, et cetera. So it's very interesting to be in the AI space where like there really aren't many secrets. Like if you had $50 billion, you probably could train a pretty good model at this point.
B
That's an interesting question, actually. What are the secrets that what is Anthropic's moat made of? For example, what is OpenAI's moat made of? I think it's actually a lot of taste. A lot of it is like at the level of. Yes, the alignment strategies and all that. That's a big difference. There is some stuff going on with hardware optimizations and software optimizations that is below the public line. But you're right, I mean, if you wanted to stand up like a reasonably competitive model, let's say it's not going to be Claude code, it's not going to be something that makes people jump out of their seats, but it's going to be like, you know, within 12 months of the frontier or so. You could like. Yeah, you're right.
A
Yeah. I think there's in the frontier apps, people who know things that are like, hard to even write down. You just have this weird combination of knowledge that is just. You're not going to get it any other way. But the algorithms, the general types of data, the general model architectures with Kwen, with the Chinese open source models, and also releases like this from Nvidia and tii, we're getting a lot more visibility into what works and what doesn't, which actually is different from, I guess it started with Llama back in the day, it was. Now it's, it's a different world onto research and advancements. We begin with deep delta learning from just a couple of researchers at two different universities. The gist once again of what they're doing, they're introducing this delta operator, which is a fancy math thing that you can do to residual stream, sort of similar to hyper connections, which we discussed earlier. That is enabling you to essentially do a similar thing to hyperconnections, of getting more powerful residual connections that you can introduce into your network architecture and then get stronger models effectively.
B
Yeah, this is one of those papers. So it's a theory paper first of all, so everybody don't get too excited, but it's a rare paper that doesn't have really that many experimental results. The claim here is we have a more efficient way of designing the flow of data through these residual layers. And there's a world where this actually is a paper that we turn back to, you know, years from now and go like, oh, damn, that was really important. So it's worth kind of sketching out how this works. You know, in a standard residual network, you've got this whole thing that we Talked about where the input to layer one gets added to the output of layer one in what gets passed on to the next layer, right? So instead of just, you know, feed the input to layer one, then you get an output, you feed that output down the line, you instead generate your output and then you add the input to it again and you pass it down again. The goal here is to mitigate these vanishing gradients is the key problem here. If you didn't do that, you would find that the information from earlier layers just doesn't make it through to later ones. And this caps the depth of the networks and introduces all kinds of problems in terms of performance. So the problem with this is there's actually a pretty strong bias that you're creating when you do that. You can only add new information mathematically to the information in the previous layer. So again, like the input to the next layer is the input to the previous layer plus the output of the previous layer. All you're doing is adding. And that creates this very fixed mathematical relationship where it makes the network's flexibility limited because it can't easily flip information in the path. All it can do is keep adding and adding and adding. It can't discard stuff. And so what this approach, deep delta learning, allows you to do is it allows you to more generally either sort of flip the direction of certain features or geometrically reflection flip across a certain axis. The details get pretty involved. I'm just going to flag two different mathematical entities that are relevant here. The first is the model will learn a matrix that lets you calculate a direction along which it can basically fuck with the residual to deviate from just the identity transformation. So instead of just like taking the pure residual, the input, and just handing it down the line, instead, what we're saying here is, well, let's see if we can just pick a direction that we can learn where we're going to start to fuck with this residual along this, this vector, along this direction in a way that improves performance on the loss function. At the same time, they're also going to have a learnable variable, a scalar basically. So certain number, not a vector, but just a one number that's going to determine the nature of the transformation that's done along that direction that we just talked about. So this is a parameter called beta. If beta is zero, you get back the identity transformation. So it's the same old thing. You're not going to, you will have a direction that you want to fuck with that input on, but the magnitude of that change is going to be zero. If beta is equal to one, then you erase information along that direction. If beta is 2, then you actually like flip information across that hyperplane, but you add negative information. And so all of this is to say you now have a more nuanced way of chewing on your data than just like continually adding the residual each time. And there's a theoretical case that says this actually has been a limiting factor for the way transformers chew on their data. And so if you believe that neural networks need to be able to delete bad features or flip their internal logic to reach higher levels of intelligence, then this deep delta learning matters because it, it provides the first mathematically clean way to do that while keeping the training stable. Because that's the other challenge that people have had. This actually works in that respect. And so it is a theory, it is a paper, it is a white paper, but it does unify these kind of three components that were previously thought separate, you know, gating attention and residual learning in a way that's coherent, that seems to work at least pretty well. And they found an excuse to write the term delta operator in a context that isn't special forces. So that's kind of cool too. That's probably most of what they were going after in that paper.
A
That's right. And I think we've covered quite a bit. We've covered some of these big training runs and open source models. These are, are things that are from universities. And just to call this out, of course, research is built through, you know, a lot of past work. This is building out a lot of analysis of this residual stream topic. So there's previously to this, the delta rule was incorporated into residual streams and has been shown to be effective. So there's a bit of empirical results prior to this paper that this is on based, building on the idea, residual stream stuff in general goes back to 2016 with Resnets. And that kind of was like a big deal when that was introduced and allowed for very, very deep networks, which was not possible prior to that. So all this stuff, suffice it to say none of these papers kind of lives on its own. It's all going back to a rich history of prior investigations. Next up, we've got recursive language models coming from mit. The gist of a paper is that to be able to scale to very big tasks and basically process arbitrarily long prompts, they treat the prompt as a thing you can read kind of bit by bit or whenever you need to, you can look up information from the prompt. The way they put it is they treat the prompt as part of an external environment that the LLM is able to then programmatically examine, decompose and recursively call itself. So the recursive part here is it's able to say, okay, here's one bit of stuff I need to do. I need to look at like chapter one of his book and see what's going on there, or like extract out the main characters. Let me have language model, look at his bit of a prompt and do this task and then give me some information. And this is then possible to do kind of arbitrarily deep. So basically a prompt, it's kind of, in my opinion, a bit of a weird terminology question where you could save, as a prompt, you could save it this. The book is like a text file that is in the environment, but in a sense it's very similar to what cloud code, for instance, would do. If you have like a text file or documentation guidelines, et cetera, and you give it a task, it is able to, at inference time, when it makes sense or when it is needed to then look into this kind of reference guide or book or whatever, for whatever bit of it is relevant. And they show that by doing this kind of approach, you're able to do these very long prompts with, you know, book length inputs and so on. And it is quite capable.
B
It took me a little while. I don't know if it's like the way the paper was written, but to kind of, to kind of fully get it. So there's an example I was sort of iterating on with Gemini and like, so basically imagine you have a big book and you say list every character who ever held a silver object, right? So the idea here is you're like, okay, this is like way too much. You know, it's a giant book, maybe a million tokens. It's not going to read the whole thing or fit it all into RAM and do a good job. So instead it's going to. So it will call itself and say, hey, the book is too long to read at once. I'm going to write a script now to split the text and to call myself in that script, I'm going to call myself with a certain query for every chapter. That's the recursive step, right? So it's literally saying like number one, book too big, number two, I need to write therefore a script to chunk up the text. And because of what my top level prompt was, I'm going to have to give myself instructions to kind of pursue Certain lines of work deeper. And so now there's a child instance that's called. And maybe that opens up chapter one. Maybe it sees that the chapter is still very long, like too long, and it's going to chunk it up even more or whatever. But ultimately it does its own analysis, returns this response, like, okay, in chapter one, Bob held the silver flask, something like that. And so anyway, so then the root model gets all those responses and kind of does its analysis. Now, this is a lot like, and I think this was before Andre, we started recording this podcast together. But back in the day, I remember when GPT3 was first introduced, a few months after they started using GPT3 to summarize entire books. And the way OpenAI did this was they basically had GPT3 generate a summary of chapter one, summary of chapter two, blah, blah, blah. And then they had GPT3 write a summary of the summaries. As you can imagine, this had all kinds of problems, not least of which was the fact that it completely misses long context interactions between pieces of information that are maybe too niche to be worth surfacing when you read chapter one the first time, but that actually do connect. So if you're thinking, for example, like in a detective model, where an offhand mention of, you know, which person might prefer to use their right hand or their left hand could be totally ignored by the model that's generating a chapter summary, when in fact that ends up being that niche fact, ends up being pivotal to the overall plot, by the end, you completely miss it. And so what this is fixing, among other things, is that because you essentially have this, like, dynamic peaking function, this programmable search, where the model, you know, in a detective case, the base model, you know, starts by getting tasked with, hey, find the murderer. Instead of just tasking sub instances to summarize chapter one, it could write a script to find, you know, every mention of this or every mention of that, get a summary, and then based on that summary, iterate and refine its prompts to keep going. And so this is partly use of test time scaling. That's one part of the story. But it's also, you're just like fully, like, offloading the responsibility to design the search architecture itself to the model, which is really interesting. And they have all kinds of test time scaling results that are really positive here and curves that don't bend. So, you know, anyway, do with that what you will, but I thought it was quite an interesting conceptual experiment.
A
Yeah, I think it effectively formalizes the notion of sub agents in some sense. Where yes, you know, this is what already happens with cloud code and other agentic agents. They can tell itself in some sense to go off and do this thing. It also is kind of funny that there was all this debate about newer symbolic versus connectionist back in the day and now we are just in newer symbolic land with tool calls and all this technically symbolic stuff, but nobody cares. It's just writing some code, a couple more papers to get through. We've got conditional memory via scalable lookup also from Deep seq. And the general topic is when you're training a language model, there's in a sense kind of two distinct things. One of them is memory, just knowledge. And another is reasoning, thinking, whatever you want to call it, kind of acting on that knowledge. And so there's for quite a while this notion of like, well, if we have information, if you have some sort of memory, maybe it's better to just have external memory in some way that the neural net itself can just do lookups on instead of having it be embedded in the actual neural net weights where you in some sense want to encode the intelligence. You have distinct computation memory split up, which these days we don't have in transformers don't have in large language models in general. And so in this paper they introduced one way to do this. They call that N gram. It's building on the notion of N grammar embeddings from back in 2017 as a way to instantiate this. So you create these trained modules that allow you to do this kind of sort of lookup operation and then you insert them into your overall architecture. So you wind up having kind of input, then you have some transformers block when you have an in ground block within which your basically interacting with this memory, which can be, you know, various sizes, have more or less knowledge and then you move on to attention and mixture of experts and so on. And this would be expected to make it possible to scale to more knowledge, more information without scaling necessarily kind of the intelligence bit of the model. So go into a lot of detail into how much information you can store here with as big as 27 billion parameters for this N gram thing.
B
Yeah, and it is also worth noting what N grams classically did. Right. So N here is an integer, it's a number, right. So unigrams real quick.
A
It's kind of annoying. So there's N gram, which is N gram. And in the paper they introduce en gram as very fake, let's say N gram engram I guess, to be clear.
B
Exactly. So, so and this is what they're calling back to. So this idea of the God, I don't even know how to say the letter N gram. N gram, the old one, the original one, you would have unigrams, bigrams, trigrams. And the idea here was a unigram was a single word, bigram was two word sequence. Right? So you know, a unigram might be machine, a bigram might be machine learning. Trigram is a three word sequence, like deep neural network or something. Right. So basically these are chunks of words that together mean something. Right. So Alexander on its own means something, but Alexander the Great means something very different. Right. Very specific. And so the idea here is going to be. Well, it's challenging for these models sometimes to learn trigrams. They start by learning unigrams, then bigrams, then trigrams. Typically, if you look at the way the training unfolds, but instead of the model having to recalculate what Alexander the Great means every single time in its layers, which is typically what in practice happens, they're trying to make it possible for the model to see the words, look them up in this massive table like you said, this reference table and just pull out the answer. And they do this. There's a bunch of context. I guess since this is a lightning round story, I'm not going to say the context, but they have an interesting way of doing this kind of lookup. They're not looking at traditional engrams because these things are learnable. They actually do evolve over time. They get a sort of random initialization at first. They do suck at first and then they get better very quickly. Maybe I'll just pause there. There's a bunch of interesting stuff though, if you like this. Multi head hashing is a good keyword to look up to. Kind of get a sense for what's special about this paper.
A
Yeah, Deep seq once again, 20 page paper. A bunch of follow up information goes pretty deep. But the general theme here is we are introducing a lot of new interesting ideas that augment the basic transformer and seem like they might actually be a future component of the way we do neural nets, which for a while we've been doing Transformers the same way, more or less. Then there was mixture of experts and now there are training regimes. And it looks like potentially some of these hybrid ideas, memory ideas, could play a part in the future. And one last paper extending the context of pre trained LLMs by dropping their positional embeddings. So the thing with inputs to transformers is transformers don't have a notion of position in an input by default because of what you do with the attention mechanism. So we attach these positional embeddings that tell you this is the first token, this is the second token, et cetera. And this goes back all the way to the original transformer. And this paper looks into and shows you that it turns out that part of this thing of positional embedding, you can kind of keep it early on and that helps you train. So you need this positional embedding. Part of what in this case was rope. The standard way to do embeddings, positional embeddings, they show that after you train for a while you can actually drop the positional information and it turns out to actually work just fine and then you wind up being able to extend the context of pre trained LLMs. So kind of an interesting paper. Honestly, I haven't had the time to read into that much, but. Sounds counterintuitive.
B
Yeah, it is. Well, it's counterintuitive and then it's not. Once you know how the magic trick is done. And that's always like, I find that's always how these things are. Or at least except for the papers that go, this works. And we have no idea why, which does happen. Yeah. So like you said, the standard approach to allow the model to understand the relative positions of tokens in input is to do this little trick where basically like so imagine that your word embedding, so it's basically just a vector, a list of numbers that represents that word or that token. Imagine that it's a two dimensional vector, the vector in a 2D plane. So the original way of doing this using rope was to take that vector and just rotate it by an angle, call that angle theta. And so that's if the word is at position one, you're going to rotate it by theta. If the words at position two you're going to rotate, excuse me, the corresponding vector by 2 theta. And if it's a position 3 by 3 theta, and so on. And so what you're doing is you're basically inducing this kind of repeatable pattern of basically messing with that vector in a predictable way as a function of its location.
A
We do this weird rotation stuff because just putting in a number like 1, 2, 3, 4 doesn't work well. The numbers get big and small and that messes with neural net.
B
That's actually a really important clarification. People have tried that. Exactly. Can I just glue a couple numbers to the end of the vector, its location and this exactly causes all these problems with stability. Now in practice too, like an LLM embedding is not just, it's not just going to be two dimensional, right? The vectors that represent these words or tokens can have thousands of dimensions. And so rope actually breaks all those dimensions down into pairs. And then each pair basically has a separate two dimensional plane with its own kind of like clock hand. You can think of it rotating at a different speed. Some of these pairs of dimensions in the embedding vector get rotated really fast as you increment the position of that token in the input. And then some of them rotate much more slowly. And typically so the fast clocks, these sort of fast rotating pairs, help the model tell between words that are right next to each other. So these are like short range relationships. And then you've got the slower ones that rotate more slowly. They help the model keep track of the like broader structure over hundreds of tokens, these long range relationships. And anyway, there's a whole bunch of. And by the way, this is only done with the keys and the queries which help you tell. Basically that's how a token says, hey, here's the information that I contain and the things I can help with. Basically that's the keys. And then the queries are, hey, here's the information I'm looking for. And the product of the keys and queries is how the attention mechanism works. Anyway, this approach basically says, you know what, we thought that these clocks, these like pairs of dimensions in the input, were really important for the model to function. But it turns out that the model's internal layers, they can actually learn to tell time on their own. Basically they can learn to tell the location of this input on their own. Once they learn kind of the vibe of a sequence, you can take the clocks away and the model actually becomes even big better at handling massive amounts of data. So essentially like you said, you strip out all of this rotational mumbo jumbo after you finished pre training, you give it a little bit of fine tuning, for sure, you give it some extra fine tuning, but, but it very quickly learns how to handle, how to understand positioning just based on context very fast. And the really important thing here is that the rope requires you to train a model in a way that is very sensitive to the size of the context window you trained on. So if you train the model with these rotational embeddings on 4,000 token contexts and then you suddenly give it 8,000 tokens of context, it's never learned these clocks for the last half, basically the last 4,000 tokens. It's never learned to use those clocks. And so now all of a sudden you're going to cause the model to basically start misfiring. And so the advantage of removing those positional embeddings is actually that the model can, can learn to recalibrate and then it's good for, for much, much larger sequences. One of the main reasons that models struggle with long sequences is taken away and the model can suddenly just generalize really well and move from, in one instance they show from a 2000 token context to 32,000 without ever being trained on those longer context lengths. All it's ever done is you give it a little extra training to work without the embeddings and it's a way to the races. So really, I think quite an interesting paper. This idea of expanding context window lengths has been such a challenge. I think you obviously still need memory and that remains a problem. Right? None of this solves that. But this does solve at least the positional embedding aspect of this. This which is kind of interesting. We're learning an awful lot about what goes into these high token or high context failure modes. It's not just the memory, it's all these little things including how you train models to do or to do without positional embeddings.
A
And by the way, apparently this idea that transformers can just learn positional information, that's not new. It goes back to December of 2022. What this paper shows is this previous approach where you just don't bother with positional information doesn't train as well. And it turns out that this other approach of like starting with it to sort of help out LLM and then dropping it later works better. And yeah, that was kind of interesting trick onto policy and safety. And we've got just one more paper to discuss. This one is from Anthropic Constitutional Classifiers plus Efficient Production Grade defenses against universal jailbreaks. So constitutional classifier is the thing that anthropic kind of introduced for dealing with alignment issues. If you ask an LLM to do something harmful, you have something in the system that then is able to classify it, whether it's within your constitution of what you're supposed to or not supposed to do. We know there are some jailbreaks that get around these systems so they introduce two in particular reconstruction attacks that sort of split up the input so that it doesn't look harmful or bad. But then if you combine those bits, it winds up actually being bad. And then obfuscation attacks which I guess we talked about a little while ago, if you ask it to output something as a poem and it will do it because it doesn't seem like the recipe for meth if it's a porn. And so they discuss how to get around these jailbreaks. And the gist is to make it a little more robust. So instead of just the input and output only classifiers, with just a single exchange of input output, they do evaluation in the entire context of the input that leads to the system being more expensive. And a bunch of his paper is just talking about how you can get it to be affordable while still doing this more robust check via some tricks. And presumably Anthropic has this in production now.
B
Yeah, that's right. And actually to your point, like the tricks there are sort of, as ever, the key thing, they use this two stage cascade, two stage classifier cascade. And so the idea here is first you use a lightweight classifier to screen all traffic and you only escalate suspicious exchanges. And at that point you're escalating it to a more powerful, like an expensive second stage classifier. And what they find here in practice is that when combined with other optimizations that they use, they reduce the computational overhead overhead by a factor of more than five and maintain the same performance of the more expensive model. So their first pass is going to be, is going to default heavily to false positives. They're going to try to catch as much as they can. It allows them to avoid escalating the vast, vast majority of these prompts to, you know, in cases that don't matter. And so overall, the way this works is it's kind of like a funnel. Stage one is where you have this external classifier that screens all the traffic, escalates suspicious cases. They then have a bunch of linear probes that they run on every input alongside that stage one classifier. And these inputs are looking at activations of the model rather than just the input or the output. They're also looking at how are the activations inside the model getting triggered and is there anything we can learn from that that suggests that there's something suspicious happening here? So that's another way in which they're going one layer deeper. And then finally at stage two for the escalated cases, the system is actually going to combine the linear probe ensemble kind of predictions with the higher accuracy second stage, like external classifier that they run and that's going to generate the final, the final decision. So a lot of impressive Results, you know, 40x reduction computational cost compared to the baseline exchange Classifier and also a significant reduction, though somewhat, quite a bit lower reduction compared with the original implementation that they had, that was anyway 23% more processing power that was needed. Very, very low refusal rate by the way. 0.05% on production traffic. So that, that's actually representative, at least as of today, of what they see. The original implementation had like a 0.38% refusal rate. So that's, you know, like almost 10 times higher. So you're going to see a lot less refusal, presumably a lot less inappropriate refusal from this model. So yeah, pretty impressive. And another sort of pseudo alignment paper from Anthropic.
A
And then they did test with quite a bunch, including with red teaming. And here's just a fun detail. Red teamers got API access to this defendant model. They could submit their jailbreak attempts along with how long it took. And then they offered bounties that scaled by the number of successfully jailbroken queries of maximum payouts ranging from 25 to $35,000 depending on the campaign. So, you know, they give humans quite a bit of motivation to try and get these models to slip up. Next up, just a few more stories and we're done with all the research. First we've got Nvidia CEO says purchase orders, not formal declarations, will signal Chinese approval of H200. So Jeremy, you mentioned this earlier, this was in Las Vegas and this title is the story. The quote is my expectation is that we're not expecting any press releases or any large declarations, it's just going to be purchase orders.
B
Yeah, and the, as he says, the customer demand is quite, is high, quite high. We fired up our supply chain and H2 hundreds are flowing through the line. So this was, you know, we talked about this earlier in terms of the, the co ops, the packaging kind of being the rate limiter there. There's also this just little drop. I completely missed this. Local media reported last month that Nvidia is in talks to buy Israeli firm AI21 Labs. They are the OG post GPT3LM replication lab. Like they were one of the very first, if not the first certainly Western lab, but I think they may have been the first lab anywhere to replicate anything that looked like GPD3. It was the, the Jurassic Jumbo. I get Jurassic One Jumbo. Yeah, one of them things. Anyway, Nvidia's buying them so. Or at least in talks to buy them, which is very interesting and again, completely, completely missed that.
A
So there you have it, one more story on China. China AI leaders warn of widening gap with us. So this was a comment particularly by Justin Lin, head of Alibaba. He basically said this, that over the next five years, he would give a less than 20% probability to Chinese companies leapfrogging OpenAI and Anthropic with fundamental breakthroughs. And this was, I guess, reiterated or somehow also mentioned by others at companies like jpu, AI potentially being like, oh, we need a bit more compute. That's my reading of this is. Boy, do we need compute. And maybe don't ban us from getting those Nvidia chips.
B
Yeah, I mean, this was apparently like one of the biggest themes here was this idea of the resource gap, at least that's what they're calling it, presumably in Chinese. Yeah. And so they're. They're saying, hey, look, you've got US firms with huge, huge amounts of compute. We're stretched too thin. And the interesting thing here is they're basically stretched too thin on inference. They're just trying to service all of these requirements, you know, customers coming and trying to use these services. You've got a, you know, a billion, 1.4 billion people in China. That's a lot of inference that you have to service. And especially given that use of Western tools is more limited there too. So it's all getting funneled there. That leaves you very little left for training and R and D. And so this is a real fundamental challenge that they're facing with this rhymes almost exactly with what one of the Deep SEQ founders sort of said early on, before Deep Seat was on the radar of the Chinese Communist Party. He was going out and just saying, yeah, like, the only thing preventing us from beating the Americans or whatever is access to chips. That's really the only thing. And then he basically, you know, Deep Seek dropped their big R1 model. Everybody got excited. The Chinese Communist Party had him come forward and testify in front of the. You know, I forget who exactly it was, but anyway, the. The vice chair or whatever, and. And this. And basically he was told to shut the fuck up about the fact that Western export controls are absolutely working and crippling Chinese AI efforts. This is yet again. I mean, they can't keep it in the pants with this stuff.
A
They just keep.
B
Keep telling us what our policy should be. Hey, guys, like, our freaking chips are the thing that we're missing. Like, don't send us chips unless you want us to be able to compete with you in this context. We have all this H200 chip stuff being sent over. So, you know, there are obviously a lot of considerations here, but at Least in terms of what the Chinese labs themselves are telling us our policy should be. At least from them, that corner of the universe, it seems pretty clear that, like the chips are a big thing, right? One important thing, and we did talk about this way, way back when the R1 model launched. They were talking about how deep Seek's R1 model helped narrow the gap temporarily. That's why it was a big deal. But maintaining that pace has been really tough under current hardware constraints. And we talked about that at the time, specifically, we said, and even before, I will say I'm sorry, I'm beating the last week in a drum here. This is even before Dario came out with his essay. We said, what you're going to see is a massive reconfiguration of the ecosystem in both the United States and in China around the idea of inference time compute. Trying to scale up inference time infrastructure. And as that happens, what you'll find is China will suddenly be capped at a very small inference time budget for training purposes and other things, whereas we have a lot more chips that can support that in the West. And so although Deep Sea Car 1 made it seem as though they could keep up, that wasn't going to be a lasting advantage. And this whole narrative at the time that the Chinese were really keen to push was, look, deep seq R1 shows that all these chip export controls just don't work. You might as well get rid of them. When in reality the real story was yet to play out. It has played out. We know the export controls were working. There's a question as to, you know, there's a whole bunch of policy questions that we need to ask ourselves about outcomes in different forms. But certainly with respect to this, it seems pretty clear that's how it's played out. So yeah, I think this is a really interesting export control question, a question for Nvidia, how they prioritize their supply chains. I think the administration has not had its last word on this. I think we'll keep seeing iteration on this stuff as the consequences of different moves play out. But this is a really, really information dense announcement. I would say there's a lot to be learned from this.
A
I stated this previously, but relevant to this notion, we do kind of by default give the viewpoint of the United States with regards to these topics. Right. I'm sure there's a different reading you could do if you're in China. Pro China and Quinn models are great and there's a lot of researchers in China are great. We're not saying that China is bad, but this is from a position of the United States geopolitically, et cetera, something worth noting. And the next story actually is on that note. Jake Sullivan is furious that Trump removed Biden's AI chip exports controls. Jake Sullivan is a former National Security advisor under Biden who helped put in place the export controls that were there until 2025. And in this interview with the Verge, there's quite a bit of detail on his view of the developments. He basically is quite critical of the Trump administration and goes into why removing those export controls and some other key bits of the Biden policies is overall detrimental to to the US Both in terms of competitiveness and essentially helping China catch up in the AI race. If you want to think of it as an AI race.
B
Everybody's got a take on these things. But you know, one of the most predictable, unfortunate self owns of the Biden administration on this was bundling a bunch of hyper partisan language on like esg, economic and social media good and like kind of AI ethics and diversity, Equity and inclusion in a bunch of their bills that covered AI national security stuff. And so when you think about the repeal, like a lot of what's been repealed is stuff that politically Trump basically had no choice but to repeal because of what he ran on. And so the easy, I would say obvious play that the, the Democrats should have gone with in the previous administration, and we talked about this at the time, was don't put that language in the same bill that is doing things you consider to be important if you think they're important. Like there's literally no reason to make this a, like a mudslinging fight. Some of these moves are obvious. Policy wins at least from where I stand. Where I stand. And the moment it's couched in partisan language by either party, it's liable to get repealed the next time the next administration turns over. If it's a law, you know, the next time Congress, whatever. This creates a real issue. And so turning down the temperature on the partisanship I think is a really, really important aspect of this. It goes to everyone. Like there's no reason that being a Republican or Democrat should affect whether or not you think friggin chip sales should happen to China. That's a crazy thing. Same with energy infrastructure. Same with all these things. So I'm getting on my nonpartisan soapbox here. Hot take is like everybody should just chill out a bit more. But yeah, I mean, I think it's kind of weird that these things have become politicized in the way they have These are technical.
A
Living in the U.S. i think honestly, it's money talks with the Trump administration, none of this really matters as much as Nvidia just having Jensen out there talking to Trump and promising to deliver some money. As is true of OpenAI, as is true of all these other like leaders go and you know, make nice comments and promise big numbers and all this other stuff. Power, citizenship, et cetera. That's, let's say secondary, but also, also a factor.
B
Yeah, it's like everyone's got a take on, on where these decisions are coming from. It's just sort of funny that there are so many cases where you just see, you see math and then you see stuff overlaid on top of it. It's like, like let's put the incendiary language for whatever the other party is in here and just see what happens. It's a tough way to make coherent policy.
A
And that is it for this episode of Last Week in AI. Once again, you could go to lastweekin AI for even more AI news beyond this. As always, we appreciate you subscribing, sharing the podcast, reviewing. One of these days I'll get around to actually bringing back the reply to comments segment of the podcast, I promise. But more than anything, please do keep tuning in.
B
Have we not had comments in the last little bit?
A
Not in a while. Yeah, just been guys busy.
B
Give us, give us some, some love man.
A
Feel free to email. Yeah, in in the episode description you can find our emails.
C
Break it down Last week in AI Come and take a ride get the low down on tech and let it slide Last week in AI come and take a ride through the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI reaching high Algorithm shaping up the future sees tune in tune and get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide. From girl nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Podcast: Last Week in AI
Date: January 21, 2026
Hosts: Andrei Karenov (A), Jeremy Harris (B)
Episode Theme:
A broad yet deep roundup of recent AI news: major tool launches (especially Anthropic’s “Claude Cowork”), billion-dollar funding rounds, emergent research on memory and scaling, shifts in chip supply and infrastructure, and the latest on open source and policy. The episode is packed with discussion about AI agent safety, new architecture proposals, business competitions, and the ongoing battle for global AI dominance.
[02:28–09:45]
[09:45–12:45]
[12:45–16:14]
[16:14–18:11]
[18:11–20:04]
[20:11–24:12]
[24:12–29:26]
[29:26–31:38]
[31:38–34:30]
[34:30–35:54]
[35:54–43:14]
[43:14–49:53]
[49:53–54:04]
[54:04–60:23]
[60:23–66:47]
[66:47–72:59]
[72:59–78:15]
[78:15–85:34]
[85:34–90:24]
[90:24–94:43]
[96:56–100:31]
End of Summary.