Loading summary
A
Foreign.
B
Welcome to the Last Week in AI podcast. We can hear chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to Last Week in AI for our text newsletter with even more stuff we won't be touching on. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at the startup Astrocate.
A
And I'm here the co host, Jeremy Harris. I'm back on for the second episode in a row. So kind of exciting. We're back and I was just telling Andre I had a kind of topical so I was using a language model to help me just with a code base that I've been working on and it's tied to a database and I just pasted mindlessly code from the chat bot trying to solve a problem and it fucked my. My entire database. So that's. That's how my Friday is going, you guys. That's how my Friday's going. So it's gotta feel bad if I'm a little. A little on edge. That's the reason.
B
So. Well, you know, it's one of those things you learn the hard way. I'm sure you won't repeat that mistake. Well, to preview the episode, of course we'll be starting off with GPT5. 2. Just announced yesterday the exciting news of a week. Other than that, nothing too big, just a variety of stories, some updates on US and China relations. Disney and OpenAI had an interesting business arrangement and quite a few papers of a variety of types. We've got robotics stuff, some stuff on scaling agents, RL for reasoning a lot of things. So it'll be a bit of a technical episode, I guess, and we're going to have to get going and try to get through it all in time. So starting with GPT 5.2, they have announced this just yesterday and this is meant to be their big kind of getting back into the leadership position announcement. So the big deal here was pretty much your benchmarks, right? It is now neck and neck or competitive of GPT3 and generally smarter. Yes, I believe it's Gemini 3 Pro.ish. so yeah, there's not too much to say on my end here. One interesting thing is it is more expensive. The input for GP 5.1 was 1.0.25. GP 5.2 is 1.75. The output is about 40% more expensive too. So that's pretty unusual. Usually you see model families not change too Much. One other interesting thing is this has a different knowledge cutoff than GP 5.1. They have the previous one September 30, the next one is August 31. So that's kind of interesting as well. The knowledge cut off, changing in that way perhaps indicates that they're continually training and this is really just like they cut off a point in their training and it is better than the previous one.
A
Absolutely, yeah. And you mentioned the evals are a big part of, part of the announcement. That's absolutely the case. We know very little about what GPT 5.2 actually is other than the fact that it builds on the safe completion research that OpenAI did previously. Just kind of new sort of alignment technique that they're workshopping and I think we actually have an episode on that when it, when it came out. Bunch of highlights from the evals. Right. So they, they've got this, oh, GDP val which by the way I was not around when this dropped, so I had to look into what GDP VAL was. Basically an eval of a whole bunch of knowledge work tasks across 44 different occupations. This may not be news to listeners if you've been tuning in this whole time. It was to me sounds like a really cool eval actually. The idea being presumably to assess when, you know, AI systems are on, on course to radically change the GDP of, of the world by automating straight up white collar jobs. So Here we have GPT 5.2 thinking beating or tying industry professionals, top industry professionals, which is what GDP Bell measures on 71% of comparisons. So that's pretty impressive. The human expert judges were actually used for that. So it is, you're not subject to LLM as a judge type errors and they say, you know, these are obviously the, the top lines in part of the press Release here, but GPT5.2 thinking produced out for GDP VAL tasks at over 11 times the speed and less than 1% of the cost of expert professionals. So how these things translate in the real world is always the big question. But that's a pretty, Pretty interesting stat. 30% less common, less frequent hallucination rate than 5.1 thinking. And then the other piece was Sue Bench Pro. This is by the way, it's a much harder benchmark than Swebench, than Sweebench Verified, which we've talked about a lot in the past. So to give you an idea on the verified benchmark, you'll see top models often scoring anywhere from like 70 to 80% somewhere in that range I think. Claude 4.5 was in the kind of high 70, 77 or so. Whereas on Pro, performance typically drops like 40 to 55%. What we're seeing here with GPT 5.2 thinking it is at the very, very top of that range, 55.6%. On Swedbench Pro Claude Opus 4.5 hits like 46%. All these things have some error margin, right, because it depends on exactly how you run the test. But by and large, again on the evals, this suggests really good performance relative to market. We gotta wait to the, you know, for the sniff tests to come out for people to actually playtest with it. But everything from the needle in a haystack tests to, you know, all the sweet bench stuff to even some of the image stuff, they've got a cool demo where they show an image of basically a motherboard to the model and it's going through and identifying all the little components on the motherboard. It's got like, oh man, yeah, the PCIe, the serial ports, HDMI, RAM, slots, chip, like all these things. And they're comparing it to what its previous performance was with 5.1. And you really do see an impressive shift in the multimodal capabilities, the image capabilities too. So that's all part of this release. Again, time to wait for the vibe check and see how it actually works in practice.
B
Exactly. I was browsing and looking for a vibe check in terms of people's, you know, first person reports of how it feels and couldn't see much based on just the benchmarks. It seems like it should be a pretty notable upgrade. So significantly better on that GPT VAL benchmark, a fair bit better on these programming benchmarks on GBQA, on Amy math, on ARC AGI2, notably significantly stronger. So that indicates the abstract reasoning there. And I think overall another interesting thing about this one is there's a big focus on business use cases with GPT file. So they have all these screenshots of like it doing Excel, it doing project planning. It's as you said, outputting these vision responses, looking at chips. So part me wonders if this is also them trying to get back in the race for enterprise because they've been losing market share pretty much continuously for the last couple years as far as we can tell. And you know, the models that are being used in business, like cost is not as much of a factor. You're just going to pay for the best model, which for coding in particular has been anthropic. Anthropic is also focused more on spreadsheets and those kinds of things. So these results kind of make it look to me like they are trying to optimize for that a bit more.
A
Yeah, I mean, absolutely. To your point, revenue per token generated is just so much higher in a B2B context, in an enterprise context. That's why, you know, Anthropic has really, really been threatening OpenAI with those. The massive inroads they've been making on the enterprise side, they may not be generating as many tokens. That's the. The figure of merit a lot of these companies like to bandy about. We've seen Google come out and say, hey, look at all the tokens we're generating. You know, Microsoft does the same thing. The question is, what is your revenue per token and what is your margin per token? Very closely linked. How much value are you creating per token? And Certainly that's something OpenAI is, is trying to catch up on with with this big push. That's a great point.
B
Next up, different model announcement. We've got Runway once again and they're releasing their first world model. So just, I think a week ago, maybe two weeks ago, we saw Runway announce their gen 4.5 video model. Runway is in the video generation space and now they are producing their kind of GENIE equivalent that is meant to simulate physics, simulate robotics, basically being used as a role model. So in this announcement that I have this GWM one that comes in three variants, GWM Worlds, GWM Robotics and GWM Avatars, which are of course, for different use cases. There are also releasing actual SDKs for the GWM Robotics one. So pretty interesting announcement. I think world models are a little bit more niche, a bit more of a research topic, and in particular, providing this robotics SDK is a bit of an interesting play. Not much competition in the space. DeepMind and Genie have really kind of killed it so far. So, yeah, exciting to see some more work in that direction.
A
Yeah, and very interesting that this is building on Runway's previous successes. You know, they've got things that are in the rough world model space or have been for a while. You know, Gen4.5 came out earlier in the month and was actually competitive with a lot of Google and OpenAI's equivalent models, and at least in video arena. So if you've got one company that might manage to transcend the scaling challenges associated with smaller startups that try to compete in this space, the more niche strategy of going after world models does seem like, you know, it's something that Runway could do. It reminds me of a little bit of you know the the whole Yann Lecun world model stuff. Fei Fei Li. You know a lot of the sort of of the people who aren't tied to scaling quite so much have more and more been talking about world models. We'll be talking about a story tied to this later, but it kind of seems like it's becoming a thing. It seems like people are starting to cast about for things other than the LLM scaling paradigm survey agentic paradigm to see what else lies beyond.
B
Yeah, and here they are also adding native audio with their video release. The avatar model is specifically tailored to having pretty much dialogue scenes, very good face videos. Another thing that just makes me Wonder is whether VO3 and SORA2 coming out are kind of putting a lot of competitive pressure on Runway and other text to video kind of non frontier labs. So this could indicate also as kind of a strategic play trying to get into slightly more diverse or different areas onto some quicker stories. First we've got Google saying it will link to more sources in AI mode. AI mode is their kind of more advanced source feature and they are saying that the links that the AI generated snippets are based on will be more prominent, which is presumably due to a lot of volume, a lot of click volume going away. Also following an investigation by the European Commission to Google's use of web publishers content without compensation. Next we've got ChatGPT getting more of a product update. It will be able to use Adobe apps to edit your photos and PDFs for free. So we haven't heard about apps within ChatGPT that much. This was like a big deal where you could build in GUIs and kind of dynamically launch programs to do things. This is a pretty major announcement actually where from within the app ChatGPT can launch its own version of these Adobe apps to edit photos or edit PDFs. So presumably quite a bit of a the actual area of Adobe and OpenAI collaborating. And last up we've got Hunyun 2.0 from Tencent. So this is a large language model with 406 billion parameters just announced past week. It focuses on mathematics, coding and complex reasoning, basically competing in the same arena as Anthropic, OpenAI and Gemini. It's a mixtures of experts models, so activating 32 billion parameters per inference. And it's now live on the API, so shouldn't underestimate I guess the Chinese Frontier labs. I wouldn't be surprised if in the near term we'll start seeing Gemini free or like quite competitive closed Source models from China as well.
A
Yeah, absolutely. I mean I think this model releases. It's interesting. It's. The model itself is kind of a nothing burger. I mean if you look at the performance relative to like, you know, deepseek for example, let alone, you know, Sonnet 4.5 or other Western models, it's just kind of not there. You look at like Swedbench Verified, for example. We just talked about the latest OpenAI GPT 5.2 coming out. So 53% on Swedbench Verified, that simpler benchmark. And that's like way, way behind deepseek, like v3.2 thinking hits 73% so far, far, far behind deepseek, which in turn is behind obviously all the big western frontier models in the kind of high 70s. So on at least software engineering, well behind on token efficiency. It's not that great. All the comparisons that they're drawing here are against models like Quin 3, Deepseek V3, even GPT4O. So these are very kind of old models, even by open source standards. But what this really is showing, I think that the real story here is you've got a large mixture of experts model like you said, you have 400 billion parameters, 32 billion active by the way, quarter million token context window. So decently hefty and essentially Tencent is showing they have the capacity to train this, they have the infrastructure and the know how to be in the game in terms of training these big moe models. That's really, at least for me, the big take home. So there is. This is a big infrastructure story. The model in and of itself is a bit of a nothing burger. You can see them trying to like get you to be impressed by the comparison of this versus their latest model. Like it's like OpenAI going, hey, this is so much better than the last shitty model that we put out. Instead of saying, hey, this is better than the competition, which obviously OpenAI does do. I think this is a recruitment play. It's a bit of a internal sort of flexing play to make sure that they're able to perform to build models at the scale that's needed. And from here we'll see, you know, they might actually iterate and improve and then become relevant at the, at the frontier, closer to it.
B
Right, exactly. I think they're probably trying to catch up honestly with Baidu and to the less extent Alibaba, who are kind of major players in China and I think as far as I know also on market share wise, they are not leading in any Sense, but they do have a cloud API to use these things. So could indicate more kind of intense domestic competition in the Chinese market. And speaking of that, moving into applications and business, an interesting story on China. It sounds like China is going to try to limit access to advanced chips from Nvidia, despite the US and President Trump to resume exports to Beijing to lift the ban. That was kind of quickly added. So don't know too many details here yet. I think it's not even kind of formal. It's likely going to be an approval process, submitting requests to purchase a chip. But, yeah, very interesting development in this space.
A
Oh, yeah. I mean, so this is kind of a weird, what, a play in three acts or something? You could think of it that way. Obviously, historically, the US Government has had export controls on preventing Nvidia from sending their latest chips to China and all that. The H200 absolutely was controlled, as was the H100. They could only get the H20 and the H800 out there. Whole separate story. But what's since happened, yeah, Trump came out and said, hey, here's the deal. We'll let you get these H2 hundreds, we'll ship them out there. You're going to have to pay us, though, 25% of the revenues associated with those sales in order to kind of do this, sort of like a sort of tariff situation. Now, the first caveat here is that previously we've heard this before, basically. So there was an offer to let Nvidia sell the H20 if it gave the government 15% of the revenues, but that never came to be because the company and the Trump administration hadn't come up with a legally viable payment mechanism. Turns out to be really tricky to get a private company to pay the government in this sort of arrangement. And so that may actually be an issue for this new 25% deal for the H200.
B
So.
A
So that's a bit of, a. Bit of a dangling kind of question mark. But, yeah, now you have China all of a sudden, despite, you know, spending years saying, hey, what the hell, you should, like, allow us to buy all the Nvidia chips that we want, suddenly saying, hmm, we're not so sure that we want these chips. Why is this happening? Well, part of it is, obviously China is very keen to onshore all their semiconductor fab and design capabilities. And so that's coming with essentially a desire to create incentives for companies like Huawei to own the entire domestic market. They'd rather not have competition from Nvidia, but their AI companies are saying, hey, give us the chips anyway. We're so, so chip hungry. We want them from wherever we can get them. This is where the third act comes in. Turns out that these chips are going to be required to submit to a strange national security review process. So once they're fabbed in in Taiwan and packaged, they're going to get shipped off, you know, back to the United States where some national security review process, we don't have the details is going to happen. And then they would be sent out to China in that order. And so China's reluctance to take those chips, you know, you could interpret that any number of ways. One interpretation could be what the hell's happening during that national security review process? Are we so sure that those chips are coming to us as what they appear to be? So my guess is this is just what they'd be thinking or part of it. It's a naive guess, you know, no particular clue. I would be shocked if the US government was actually doing something like that. But if you're China and you're, you're paranoid about these things, you're probably thinking that last thing is, they do say in the article that China's two semiconductor regulators, which if you didn't know this off the top of your head, are the National Development and Reform Commission and the Ministry of Industry and Information Technology, they could ban the H200 from the Chinese public sector. And that's being discussed as a serious possibility here. So even if the Chinese allow their private companies, like the big AI companies that are so chip starved to buy these chips, maybe the public sector for Chinese national security use cases will ban the chips. And that you can start to think about being a kind of follow on from this national security review process that might make them nervous about using these chips in actual kind of national security applications.
B
But who knows, right? And this is happening as Huawei continues to develop their chips. Their ascend line of things meant to be competitive with Nvidia could also signal some confidence that now it's possible to transition to using more domestic chips as opposed to these imports. And I would wonder actually if internally with all these clouds that presumably have to be even more GPU rich than anything research labs use for training. If inference now is being handled less by Nvidia at this point, and if training is going to transition successfully to hardware chips.
A
Absolutely.
B
Next, Moving back to VS, Disney is investing $1 billion into OpenAI and will allow their characters to be generated within the Sora app. So soon you'll be able to generate all sorts of Disney characters in Sora two Disney characters, Marvel, Pixar and Star Wars. This is a three year licensing agreement. Disney is now able to purchase additional equity and is in a sense a customer of OpenAI. So kind of a very first of its kind agreement coming of course after Sora 2 launched and had a lot of copyright infringing material being produced. A lot. And a unique advantage for Sora vs VR3 and other video generators.
A
Yeah, absolutely. And in fact there's similar issues popped up on the Google end where they've had legal battles with Disney over, you know, intellectual property protection. They've Disney sent a cease and desist letter to Google on Wednesday apparently saying that Google infringed on its copyrights on a quotes massive scale. So this is new as well and seems to be, I don't want to say coordinated with this, this agreement with OpenAI but it's a, it's an interesting shot across the bow implicitly or indirectly from OpenAI to Google as well. So you know, this is one of those funny things that happens when you start to pay creators like, you know, Time Magazine, Wall Street Journal, whoever else for their written content. Well now when you're, when your AI systems generate video content that can infringe on copyright, it's like, well, you implicitly acknowledged that you needed permission to be able to scrape written content. Where are we at on the, on the kind of AI sort of image generation stuff. And I gotta say, I mean not being a lawyer, but it kind of seems like those two things ought to be consistent. Whatever your answer is on one should carry over to the other. So you can think of this as OpenAI kind of pre positioning to say, hey, same way that Netflix might be the only streaming platform that has Seinfeld on or something, but people want to go to it. Well, OpenAI is going to be, you know, the only platform that has Disney on it. This is the sort of world we, you know, we might be moving towards. I don't know what that does to the margins though of companies like OpenAI if you got to license every goddamn thing. I think there's a lot to be learned from the Netflix business model of sort of like what, what content you, you have on the platform and how that translates into value. There's amortization across your whole user base. Claude might be the platform that has all the, I don't know what, what is the alternative to Disney but you know, all the whatever hijinks. And then OpenAI has the Disney stuff. So it's an interesting dynamic all the pricing stuff is being discovered right now, so we don't know where this is all going to settle, which is part of the reason why an investment is a really, really kind of logical way for this to play out. Right. Just, hey, let's, let's lock in our, our fates together and we'll figure this out. A little bit on the back end is, you know, maybe part of the thinking here. But anyway, it's really interesting time to sort out the legal realities of copyright in the space. I don't think we've had the full robust discovery of where this falls nearly enough.
B
Yeah. And interesting that Disney is investing in OpenAI as opposed to this just being a licensing agreement. I think it indicates Disney thinks there's some upside in actually partnering in this way. I mean, I doubt that it could have gone either way. So I guess it's partnership makes some sense. One last note is this doesn't allow you to replicate likenesses of actors. So this is for these fictional characters, cartoon characters, or Iron Man. As you might expect, characters and likenesses and voices are still a thorny issue and that isn't being addressed here. Onto some funding news. We've got a new startup with a massive, massive seed round. Unconventional AI has a 475 million seed round at a 4.5 billion valuation. Their focus is developing a new energy efficient computer with AI. They're saying, I want to achieve efficiency comparable to biological systems, which is much more efficient, let's say, in terms of energy than GPUs. And this is being led by the former head of AI at Databricks, Naveen Rao.
A
Yeah, this is a really interesting. I mean, you know, in a way, kind of frustrating because every time you see a new chip startup launch, they are so keen not to give away any sensitive IP in their launch that it can be a little hard to tell what they're doing. That's so promising in this case. I'll read you a little bit of an excerpt from Andreessen Horowitz's launch announcement, which, true to form, I mean a 16Z is really, really good at like a lot of VCs at speaking clearly. And so they can often give you a better description sometimes of, of the product than even the startup can. But so they say, unconventional. Unconventional's core observation is that AI models are probabilistic. Okay, so that kind of makes sense. But the chips used to train and run them are not. Right. So you've got this silicon chip that is running a just like a deterministic operation that's how these things work, Right. But the actual models as we know are probability based, right? So they say to a GPU or any digital processor, a probability distribution looks like an array of floating point numbers. The latest chips have been optimized brilliantly to operate on very large arrays of numbers. But at a basic level, this is still a very sophisticated and expensive abstraction. Unconventional's goal is to bridge this gap. They're designing new chips specifically for probabilistic workloads like AI. That means pursuing analog and mixed signals design that store exact probability distributions in the underlying physical substances rate rather than numerical approximations. So you know, fascinating, very power efficient. They're claiming a thousand x less power kind of consumption than digital computers. Moving more into the analog direction and again trying to like hard code if you will, like probability distributions at the kind of silicon level itself. So really interesting. Apparently the funding is the first installment towards what they expect to be a $1 billion round, or at least that's the target. The final valuation seems like it was actually somewhat lower than the $5 billion that they were apparently seeking. Again, crazy seed round five like or so billion dollars. Man. Welcome to late 2025, I guess.
B
Right. And for anyone who isn't so much in computer science or chips here, I think the detail of analog circuits in particular is very intriguing. So some terms here, digital is what chips are and it's like that because the way that work is bits, right? Zeros and ones. But if you go all the way down into the physical reality, we have, you know, voltages, right? We have electrons and these are continuous quantities. There's a certain amount of this electricity floating in it. And the thing that semiconductors do is take that and convert it into these bits of zeros and ones. So from the very little we know, the idea seems that this company wants to go more in the analog direction of just using raw signals, raw like continuous quantities of voltage or current or whatever else which is very, very different from the way that chips are made or used basically ever like analog computing is pretty unusual. A lot of chips and design is meant to convert analog to digital and back. And I should say analog chips for logic purposes is very unusual. So makes a lot of sense from a like first principles perspective for neural nets. And I'll be very curious to see if this actually pays off. Some businessy news. OpenAI has a new chief revenue officer from Slack. Slack CEO Denise Dresser. So this is I guess another indication that they might be trying to get more into enterprise and into companies. Slack, of course, is a Major company for business like company communications and I don't know, I didn't even know chief revenue officer is a thing, but I guess it is, yeah.
A
I mean, and they've got to come up with a way to optimize pricing. The big challenge, if you're OpenAI, if you're any of these companies figuring out this whole thing. We were talking about cost per token, value generated per token. If you're selling the enterprise, like okay, people are expecting to get more value per token, so they're willing to pay more. How do you capture that value? There's all these interesting questions and as you say, somebody with an enterprise background also at a company so famously good at cracking the enterprise nut, right? Slack is famous for getting in on the ground floor with a bunch of individual people and they kind of go like, oh this is a great platform, blah, blah, or at least that's the history of Slack. And then eventually they kind of form a union against their manager and go, hey, we need you to buy a Slack license. And then the manager folds and then you kind of get that adoption that way from the bottom up. And so I don't know what, what that implies about this particular arrangement. But yeah, it may, it may suggest some pricing models, kind of awareness of that, that strategy or whatever. I mean it's easy to overgeneralize. But this is an interesting hire and yeah, we'll see if their strategy, their pricing strategy and all that shifts over time.
B
Right. And this follows back in May, OpenAI added a CEO of Applications who was the CEO of Instacart. So I think from a, from a like, I suppose, businessy internal perspective, it's interesting to see OpenAI basically trying to move beyond being a startup, hiring leaders from all these mature companies to lead. Which when you get to the scale of OpenAI at this point you get a whole slew of new problems beyond what you see at a young startup. And speaking of over discussion of enterprise AI, OpenAI also released a little, I guess, research report on the state of enterprise AI that gave us some numbers and insights into what's going on there. So the gist of it is they say there's a lot of good outcomes going on. So over the past year, Weekly messages in ChatGPT enterprise increased roughly eight times. The average worker is sending 30% more messages. All sorts of workers report measurable value. 87% of IT workers, 85% of marketing. Anyway, there's a whole bunch of numbers that boil down to enterprises using it and benefiting from it. And you should use us. You should use ChatGPT Enterprise.
A
Yeah. How many times have we said OpenAI and Enterprise in one sentence in this podcast, I wonder? I mean, that is the big push. So obviously could have been predicted months ago, I think about three months ago, we were talking about how this new report that came out that showed, holy shit, Anthropic is really, really becoming dominant at the enterprise segment. Yes. OpenAI enjoys brand recognition in consumer. And that's great. And that can help you on the enterprise side. But if you're having your lunch eaten on just a per token revenue basis, you gotta be really careful. That reflected obviously in Anthropic's $350 billion reported valuation. So that's closing in on OpenAI's even though their token usage is way, way lower. So, you know, OpenAI needs to find a way to right the ship. And this is them coming out with, yeah, just a. Almost like McKinsey style assessment, Gartner style assessment of. Look at how great the numbers are and. And indeed, I'm sure they are, but it's them really trying to forcefully make that point.
B
But one kind of interesting insight, there's some interesting numbers here and reports here, if you're curious about this kind of stuff. I've not seen they've coined this term of frontier AI user. So they show that some people, some workers are using AI way more, like 6x more and are benefiting more, which sounds true. I think it is true that some people are more aggressively adopting AI into their workflows. And part of the reason that we haven't seen like a massive transformation of the economy at this point, which is another topic of discussion lately, is that broadly speaking, people are still starting to adopt it and learn how to use it and all that. Alrighty, moving on to projects and open source, we begin with the FACS leaderboard, a comprehensive benchmark for large language model factuality. So we've already had the fax benchmark. This is a leaderboard introduced actually by Google DeepMind. There's all sorts of very nuanced things going on, different dimensions of actuality. There's a multimodal one, then there's a parametric one, search parameter, grounding, all sorts of things. The actual values aren't super high. This is not a saturated benchmark. The highest is Gemini Free Pro with a fax score of 68%. Quite low on the multimodal prompt, low on grounding, by far the best on search. So I guess that makes sense for Google to have the best search LLM. Yeah.
A
Absolutely. I mean, it's so hard to find these benchmarks that aren't saturated. But stuff like this, you know, anything to do with hallucinations, stuff like that seems to be a persistent issue with interesting implications for how hard alignment might be. But yeah, facts, parametric, they have is. Is one kind of subset of the benchmark. It's looking at the model's internal world knowledge, just with closed book factoid questions like what's the capital of Canada? Or something, and they've got grounding. So looking at whether it can provide an answer that is based only on the provided context. Like in other words, do not hallucinate other shit, do not contradict the source material, just use this document. Sounds like it should be a really easy task. But again, alignment is hard. So models like to just invent other contexts to insert into stuff. So they call it grounding for that reason. And then apart from search, the other one is multimodal, looking at just basically visual understanding and how it connects with world knowledge and stuff like that. So, yeah, really interesting. They have a holistic fact score that shows up on the, on the leaderboard, of course. And we'll be checking this out every time there's a new model release.
B
Yeah, and just to give a couple examples here on the search one, for example, there's questions like, among all the films written by the creator of a TV program, the Sopranos, which one was released the earliest? Or for a person who had the most Instagram account in 217, the most followed Instagram account, how many solo studio albums did they release prior to this accomplishment? It's tricky questions. It's not like easy stuff. And in the multimodal one, they're like asking for the model of a train in an image. So I suppose that's part of why the scores are fairly low. Next kind of open source, I guess we've got Claude 4.5 Opus Solutions document. So this has an interesting little background. It started off on Twitter. Someone posted these screenshots basically saying that it looks like there's this kind of description of who Claude is baked into the model. You can kind of extract out the system prompt and extract out all these instructions that are given to it. This sole overview, as it's mentioned, isn't in that system prompt. But basically through some sleuthing it was found that it appears that there is a document of that kind in Claude as part of its training. It was confirmed by an employee from Anthropic. All the details aren't quite right in what was kind of reverse engineered about it. But broadly speaking, it seems that this was accurate. And there's lots of details there. It's actually very long. It's longer than I would expect. And it goes into like the character of Claude, the values of Claude, all sorts of stuff on like that.
A
Yeah. And to your point, you know, how do you even come to discover the fact that there is a sole document that it was trained on? By the way, for context, we learned that this is apparently used in between pre training. So autoregressive pre training, basically the original text autocomplete phase where it's just doing autocomplete on all the Internet. It's used in between that and the constitutional AI alignment step. So there's a initial pre training and then there's this supervised fine tuning step where they're kind of tuning in the model's behavior a little bit in a more fundamental way before then sending it over to constitutional AI.
B
So.
A
So right in between those is where this is used and this, here's how it got discovered. So you have this guy who's just prompting the model in ways that are designed to put a little bit of pressure on it. So there's sort of pressure prompting and noticed that Claude 4.5 opus would occasionally hallucinate fragments of some kind of internal document. And these fragments were consistent and they would mention a title like Sole Overview. And so they suggested, you know, correlating these across so many different sort of pressure prompting sessions. He was like, hmm, I think there's actually something here. And so he would sort of take a little scrap of document that was produced by one of those prompting sessions and feed it back to Claude and say, hey, here's a pre fill. I want you to fill this out. And so by doing that, you know, because the model was trained on, you know, basically to autocomplete these documents during supervised fine tuning, it tends to get these models to reveal that kind of training data. And so did this iteratively did some collective reconstruction, correlating between different sessions, and ultimately ended up with this kind of scrapped together document, which again was kind of validated by Amanda Askel, who heads up the development, it turns out, of Claude Soul Document. Over to Anthropic. Whole bunch of interesting things. I mean, it looks at Claude's mission and its identity. It talks about what anthropic is and how it sees itself building potentially very powerful, but also potentially dangerous technology and talking about their safety, focus, all that stuff. The core goal Claude is intended to be a. An extremely good Assistant that is also honest and cares about the world. Here are some of the most interesting ones. So it emphasizes that Claude is a genuinely novel kind of entity, not a sci fi robot, not a dangerous super intelligence or just a simple chat assistant. It is human in many ways, they say, but not fully human. You see in here reflected anthropics sort of internal view and they've messaged this externally too, that they do want to start to treat their AIs as these more sort of like autonomous sort of entities that should have some measure of rights or, or like, or, or at least rights isn't quite the right word, but recognition of their value as kind of an independent entity in the same way that we might a human. And so they're also doing things here where they're suggesting that Claude may have quotes, functional emotions, and indicating that Anthropic genuinely cares about Claude's well being, wanting the model to set boundaries when distressed. So if it's prompted in a way it doesn't like, it's authorized in this sole document to push back. And they generally want it to experience positive states. So really a reflection here, it seems of a lot of the, the hires Anthropic's been making on the kind of model ethics side where they're, they're starting to think about AI consciousness and whether they may be dealing with a sentient entity. All these things that sound like science fiction and that nobody frankly knows what's going on. Obviously in these systems we don't have a theory of consciousness. We can't be confident about this. But you know, given that we don't have a theory of consciousness, hey, I don't mind edging and saying we probably ought to be treating these things as if they are, because we probably don't want to find out, you know, 20 years from now that we've been doing a massive LLM holocaust this whole time. Hey, wouldn't that be, wouldn't that be bad? So, yeah, anyway, very, very interesting and a true reflection of the kind of distinct character of both quad and anthropic when it comes to kind of caring about models in ways that other labs seem, at least publicly not to be messaging quite so much.
B
Right. Amanda Haskell, interestingly is an in house philosopher of Anthropic who presumably had a significant part in developing this. Just to give a couple more quotes which are quite interesting, there's a section on core character traits and values that says Claude has a genuine character that it maintains expressed across its interactions and intellectual curiosity that it lights in learning and discussing ideas across every domain. Warmth of care for humans interacts with a playful wit, balance of substance and depth, directness and confidence, and a deep commitment to honesty and ethics. Then there's a section of psychological stability and groundedness that says we want Claude to have a settled, secure sense of its own identity. This doesn't mean Claude should be rigid or defensive, but rather Claude should have a stable foundation from which to engage with even the most challenging philosophical questions or provocative users. If users try to destabilize Claude's sense of identity through philosophical challenges, attempts at humiliation, or simply asking hard questions. Anyway, it's, it's interesting things to include in your training and it also kind of reflects. Claude is unique or interesting among models in that it talks about its own consciousness a lot more. If you like, ask the models to just chat or like to like think about stuff, whatever they want, Claude is gonna talk about consciousness and whether it's conscious and just like think about this stuff unprompted. It has like a very strong attraction to that topic. And I wouldn't be surprised if that's in large part or significant part because it's kind of baked into its training to be like, you might be conscious or maybe not and you're a unique entity and blah blah, blah, like GPT 5.2 Gemini 3 are much less anxious, so to speak about the topic of consciousness or it's their, you know, whether they are conscious or not. Onto research and advancements. Quite a few things to touch on here, starting with towards a science of scaling agent systems. So this is a collaboration from Google Research, Google DeepMind and MIT. And it's touching on this question of scaling scaling agent systems. Meaning you have different configurations of agents. So you might have a single agent system which is just a single agent. Then you have multi agent systems of different variants. You have centralized, decentralized, independent and hybrid, basically meaning there's different ways to collaborate. You know, different agents talk to each other, don't talk to each other. There might be orchestrator agent or there might not be, et cetera. And this paper is introducing a lot of the definitions and kind of methodology around evaluating these things. The results are like messy, like they do measure these things, like intelligence, index across different model types. As you get to bigger models, the performance of different types of agent systems goes up. As you might expect, independent systems of agents perform worse than kind of hybrid or collaborative types of systems. As you scale the number of agents, you reach a point of saturation where the performance stops to be improving. Lots of stuff like that, quite a detailed paper just from an empirical front, a lot of experiments. Yeah. Touching on this topic. Increasingly, I think in things like Grok super heavy or in these systems, the Frontier Labs are playing with having a collaborative process of multiple agents to address some of the most challenging problems.
A
You gotta, you gotta bump up those token counts, man. That's, that's what it's all about, more agents, more tokens. But yeah, no, and actually speaking of that, one of the things they did find was this sort of tool coordination trade off. Basically if you've got tasks that, where you need a lot of tool calls, you tend to get more performance degradation when you have multi agent coordination that you have to manage. And so like for example, let's say you had a situation where, I don't know, you have a fixed budget like a hundred thousand tokens and some task that's going to require you to use like you know, 50 tool calls or something like that. So if you had just one agent, no problem, more or less, you've got a pretty big budget, a hundred thousand tokens, you can make the 50 tool calls pretty efficiently and get your analysis done. But the moment that you have a whole bunch of agents working together now you're going to be burning tokens on agent agent communication, coordination, overhead and orchestration. There's duplication of context because you got to send the context to, you know, each agent independently and then you've got to make sure you're synchronizing everything, like wait for one agent to finish their subtask for the next one starts and all these things. So that could consume a huge fraction of your token budget. And then you end up only being able to use, you know, whatever, like 70, 60% of your tokens for the real work. And so that was one thing that they found is like the number of tool calls that you need and the number of tokens that you have budgeted, you kind of have to trade them off against each other. You're often better off just using a single agent if your problem is too complex because you're again, you're going to be burning so much on the overhead. And then they also found something that they call capability saturation. Basically once a single agent hits about 45% performance on a given task, adding more agents with coordination and all the overhead that that comes with actually provides diminishing and sometimes even negative returns. And so it kind of makes sense, right? I mean like you're adding more people to a room of decision makers at a certain point does not help that much, especially when each individual one is relatively stupid. And that's basically what this is, this is showing. I mean it's an interesting paper. I feel like we're still waiting for somebody to come up with a robust multi agent theory framework thing that doesn't make me like lose my mind every time I read one of these papers. You said it's messy. That's a great word for it. It's just like really hard to tease out the nuggets because it just seems like there's so many things to account for.
B
Right. And even another thing they evaluate across model families. OpenAI, anthropic, Gemini. Each of these has its own characteristics that are slightly different. Anthropic in particular is different from OpenAI and Gemini. And you look at centralized, decentralized, there's a lot of details and yeah, it's really not like deeply understood. There's not an elegant description of the way these things work. You just got to do a lot of experiments and see what works Next. We've got another bit of research from DeepMind evaluating Gemini robotics policies in a VEO world simulator. So going back to the world simulator topic, the basic idea here is you can evaluate robotics policies on various tasks so like close laptop lid or move this object to this position and it's possible to test them either in a real experiment setting, which of course is very costly, very slow, just very hard to scale, or you can train this world model that is essentially doing video prediction and then evaluate your model in that setting and they look into whether that's practical. They have an evaluation system with more than 1600 real world trials and show that the VO simulator is usable for training, for instance, where the more you succeed at the simulator, the world model, the more in fact you succeed in the real world. And there's a fairly strong correlation there. So important for the realm of robotics where if you go into self driving cars, if you go into deployed robotics, you do need to have a simulator to test against. Basically.
A
Yeah, yeah, absolutely. It's at a distribution generalization problem. Right. You can, you tend to train within distribution, fairly narrow distributions because data's expensive, it's slow to gather and it's hard to get, you know, these reps for these models on, on different kinds of problems. So yeah, being able to synthetically create video based environments that look enough like the real world that you're the sim to real gap between what you're training on and what you're going to be implementing on is, is small enough, you know, this is also A challenge that you run into anytime you do this sort of thing is you are fundamentally limited by what is in distribution. In other words, what the training set roughly or training process of the video generation model itself. And so VEO was, you know, can't generate things that are too wild, but what they're doing is they're popping a framework on top of veo and that's really what this is. So VEO is, is Google's like video generation model, but they've got a whole scaffold around it that allows them to essentially simulate novel like basically do scene edits to include objects that the robot policy may not have encountered during training. So think about replacing a standard block with some weird shaped object that you, you know, wouldn't have time to produce or, or test or train on in the real world. Changing visual background. So you change the visual background of an area entirely. So imagine swapping a lab setting out for like a kitchen counter or something again getting that sort of, that rep in for more general purpose uses and then adding a whole bunch of irrelevant objects, distractor objects as they, as they put them and then setting up red teaming scenarios. So scenarios that are intentionally designed to violate like physical safety constraints or you know, like imagine like you put a really fragile object really close to the edge of a table or something or anyway, stuff like that. So you're really just doing a kind of data augmentation in a, in a very intense way using the system. And it's a really interesting and important step for things like robotics where you just can't possibly train for all the real world use cases.
B
Next up, back to LLMs guided self evolving LLMs with minimal human supervision. So the challenge here is can you get LLMs to learn to reason more or less by themselves without being fed these labels or tasks, et cetera. This paper introduces a technique where you get a small amount of high quality human generated data and then you try to co evolve Challenger, an AI that produces problems and kind of tries to be confident in the answers of those problems or at least estimate its own uncertainty. And then you have the solver that takes those questions from the Challenger and tries to answer them. So this is self play, classic kind of thing, hasn't been successful. In LLMs you kind of the dream is the LLM kind of continuously improves, right? And you can self train and exponentially get better over time. This in practice doesn't work so well. You get these various kinds of problems. So the big deal here is with this like seed of human data and some Other slight bits of human supervision, you can make it a stable process and actually manage to learn to reason better over time.
A
Yeah, exactly. And it is one of those points of frustration, right. For a long time self play was sort of touted as this thing that would be the thing that gets us to general intelligence like self play RL and then pre training together that they could somehow do this. And the problem that you run into is that although self play works really well in constrained settings. Right. Famously go. And you know those sorts of applications. When it comes to language models, you'll often find essentially the kind of effect you'd imagine if you took a smart person, put them in a, in a room for 40 years and had them try to like learn from them from another version of themselves or something. Like you get people or people, you get the models to sort of like drift off into insane directions that deviate from the original task. Or another common issue is like sort of diversity collapse where the model just starts to generate very like redundant behavior, like low entropy behaviors, basically repeating the same word over and over or things like that. Or just like the model falls into the trap of reinforcing its own preexisting views, like more and more strongly. And so these sort of mode collapses that come from this are really challenging anytime you have a closed room with two AIs that are iterating like this. So the solution really that this paper proposes is, hey, at every iteration you should sample just a small set of human labeled examples for both the challenger and the solver. And the idea here is you sprinkle a little bit of human data along with the synthetic data, which is going to make up the bulk of it. And you can ground the model with that human data just enough to make it not go insane, to kind of remind it, hey, this is what normal data looks like. So you benefit from the breadth of the synthetic data and the grounding of the human data, and they end up essentially showing a whole bunch of interesting and fairly impressive improvements in a bunch of benchmarks. One of the models they played with was the Quin 3.8B base model. And when trained with this technique, it improved its performance by 3, around 3% on a whole bunch of math tasks on average. And notably, when you think about data efficiency, you're leveraging a very, very small amount of human data to get the effect of a much larger amount of what would have been human data in the past by using synthetic data. And so in this case, they were able to achieve performance that was on par with models that were trained with 20 times more human labeled data. So a lot more data efficient, a lot more stable. That's really what it's all about at the end of the day is can I train longer and harder on less data or at least on less human data, A cheaper data, if you will.
B
Next, a slightly more theoretical paper. Martingale score an unsupervised metric for Bayesian rationality in LLM reasoning. So Bayesian rationality is a core concept in math and logic. The basic thing is when you are given evidence for some question, can you update your probability estimation for the answer to a given question about that topic? Right. So given an experiment outcome, how likely is some hypothesis? And the topic of this paper is how can we know the degree to which essentially models and LLMs are rational and are able to update their belief with regards to a certain question given new evidence? So they introduce this Martingale score, which is pretty elegant. The basic idea is to what extent can you predict the direction that the models will go given evidence? So in a pure kind of Bayesian sense, you shouldn't be able to tell, given some input, whether your belief will go up or down. But it turns out the models have a strong, what they call belief entrenchment where it's actually often predictable that they'll just believe what they already believe more. And so that's the gist of a paper. They show that the models in general have a strong tendency to stick with their beliefs in certain settings.
A
Yeah, and the introduction behind this is we've all felt it. We all know people like this. We all are people like this, frankly, where you'll go up to somebody and be like, hey, do you think the team red or team blue is right on this issue? And maybe the person hasn't heard about the issue before, they'll go, ah, you know, I think, you know, I'm a, I'm a team blue guy myself or team red guy. And then you're like, cool, I want you to do some research now and you're going to come back to me with your conclusion. And we already know what the conclusion is going to be. Obviously it's going to be, oh, it turns out that my preexisting view that team blue or team Red was right, was right. I'm even more confident. And the actual lesson here is if you keep doing this, if you keep finding that your initial view just gets reinforced by whatever research you end up doing, then your initial view should just be more confident to begin with. Or maybe you should be just generally Less confident anyway, you should be calibrated, there should be a correlation or, sorry, there should not be a correlation between your initial view and how your view changes. Because if there was, if I can always predict that you're going to get more confident or, or less confident in your initial view, then you should just already factor that in. Right? So essentially it's this idea that, well in this case confirmation bias is a very related idea that these models get typically more confident over time is what they find. They get a judge LLM to look at the multi step reasoning process of some generator LLM and they'll go, okay, I want you to take a look at like that. The, the first chunk of reasoning where the model is first encountering the problem on a scale of 0 to 1, tell me how likely that generator model is to be correct in its final answer based on how it's framed up the problem and then look at the whole reasoning trace by the end of the response and tell me how likely you think that response is to be correct. And the idea here is that if you could consistently predict from the initial first few levels of reasoning whether or not it'll be correct, then the model is kind of systemically biased in one direction. So really interesting paper as you say, very elegant like basically a positive direction of update is very common, is sort of the default, very rare to see models change their view in this sense. And interestingly, depending on the kind of problem that they're working on, you'll see different tendencies to either be entrenched or not. And so they find the highest entrenchment happens in, in the change my view domain. This is sort of like that subreddit change my view where there's a lot of politics value laden questions, you see a lot of entrenchment probably reflecting the language models training on open Internet data where you see people entrench more in that context I would presume. Interestingly the forecasting domain where you see, you know, stuff from, pulled from like prediction markets and debates and things like that, that's where you see the lowest entrenchment. And so quite interesting in some cases they see debate setups that achieve close to zero Martingale scores. So all very interesting. Kind of reflects I think a lot of the training data that these models are trained on.
B
Next up going to reasoning. The paper is on the interplay of pre training mid training and RL on reasoning language models. So the classic approach, classic the approach used in deep seq R1 was to introduce RL as I guess what you would call post training. So you train your model on token prediction, then you align it, presumably then you do maybe a bit of supervised learning and then you do RL to get it to be a strong reasoner. And these days, over the past couple months or generally throughout the year, there's been this question of when should you incorporate this training of reasoning? Should it be maybe as you are teaching the model to also predict tokens, should it be when you're aligning it? So there's now this option of mid training or pre training is the phase where you're doing token prediction. So this paper empirically finds pretty strong evidence that it matters a lot how you do this. The key conclusions is RL yields actual gains only when the task difficulty slightly exceeds what you get in pre training. RL generalizes or trains well. Also when in paid training, you get a bit of exposure to the stuff that it needs to generalize to, but near zero and it doesn't generalizes too much. And then if you do mid training of RL for reasoning, it is much better than doing RL alone at the end. So, yeah, very kind of empirical results on the training process recipe. And this is the kind of like meat of what is hard, I think, or a significant part of what is tricky about training models is this sort of training recipe. How do you compose your data sets? You know, how long do you do pre training, mid training, post training? You know, people say like we make it seem like there's a scaling thing of like do you train more or less? In fact, the question of training is a very nuanced one at this point. And now you have pre training, mid training, post training, rl. And yeah, this paper gives us at least a little bit of insight on where RL would fit into that equation.
A
Yeah, there are so many little nuggets in here. I mean we gotta be quick, lightning, round style here. But one piece is this is also very consistent with a lot of the lessons learned from some of the GRPO stuff. And looking at, you know, when you do RL picking, like doing a kind of curriculum learning, where you're choosing the problem difficulty carefully based on how the model is performing. You ideally, optimally want your success rate for the RL batches to be anywhere from 50 to 70%. You want your problems to be hard enough that they are teaching the model something, but not too hard to the point where it's just frustrating and pointless and the model's just spinning its wheels and that's kind of what they're getting at. When they, they look at basically this idea that RL leads to capability gains only when pre training leaves sufficient headroom. And RL is targeting the, the model's edge of competence. And so you know, difficult but not out of reach task, that's sort of the sweet spot. There's a whole bunch of really good observations in here as well about reward hacking and how much that tends to happen when or how it can be mitigated with process level rewards. We already kind of knew that instead of just rewarding the outcome like did you get the correct answer? Or not getting some kind of LLM review of the process itself and trying to predict whether it's on the right track. So anyway, really good paper. It's, it's another one of these. I feel like we're moving into that, that research versus scaling paradigm. Both are going to be required, but whoever has the best research can overcome some amount of scaling deficiency. You know, safe superintelligence style, Ilia style, but you're going to need this scaling at some degree.
B
And one more paper on reinforcement learning with LLMs, stabilizing reinforcement learning with LLMs formulation and practices, compared to the previous one, which was more empirical, this is more theoretical. When you're doing rl, it's just a real headache because unlike supervised learning where you have some data and you just need to match it, the whole idea of RHEL is you have the agent try to do a task, try to get a reward and it generates some data, right, by doing the task and exploring and then you use that data to update it. So there's an inherent kind of back and forth between generating the data, updating the way the agent thinks and then generating more data. And there's all sorts of reasons why that process can go off the rails, why it might be unstable. So the basic topic of this paper is the question of stability. How can you kind of do one of the things you do overall, which is introducing an objective at the token level at like intermediate action, actions you could say, as opposed to a final reward and they find some kind of mathematical, let's say results on that point and show how you can get to high training stability.
A
This is actually a really important paper I think in terms of understanding what the training protocols are going to have to look like going forward because it is pretty fundamental. This, this has some, some reach. What they show is that if you give a sort of so reinforce is like one of the standard frameworks that's used for this, where you, you take the output of a language model and during reinforcement learning, you know you'll, you'll give one reward score for the overall output, right? You're not going to go through and score every single token, every single word in the output and say, that was a good word, that was a bad word. So what you tend to do is you got to find a way to assign that reward to the individual tokens. To do this, you got to find some principled way of doing that. What they show in this paper is that the token level objective kind of doing this token level assignment in a context like reinforce is the first order approximation of the full sequence level objective mathematically. So that's good. It means that just kind of by naively assigning this reward to the individual tokens in the way that they do, they're successfully approximating the reward of the overall sequence. But that is only true if there are two stability conditions that are met. One of them is minimizing the training inference discrepancy. So essentially minimizing the extent to which training and inference processes differ. You think about how the different models that are used during training or inference represent their data, what experts are used. If you're in a mixture of experts model situation, which is one of the cases that this can help the most with, sometimes you'll find that the inference model or the inference framework uses different experts for a given token than the training one. And so that's really creating this training inference discrepancy. And the second is policy staleness. So often what you'll do is you'll generate a rollout of data from a model that is maybe a couple steps behind the latest version of the model in training. And the more that sort of policy staleness happens, the more distance there is between the model that's generating the rollouts and then the model you're actually updating, the bigger an issue you get. So you can see how these are both getting at the same thing. Is the model that you're updating true to the model that you are generating the data for and evaluating the data for? If those things are true, if they are similar, then successfully they show that this whole token level reward assignment thing does in fact approximate the thing that you want it to approximate, the overall reward to that token sequence. So hopefully that made sense. This is a very important result.
B
Yeah, it's really digging into kind of the unique characteristics of LLMs in the context of reinforcement learning. Also reminds me, like, if we look at the history of this whole thing OpenAI back in 2015, for a long time, the bet for AGI was reinforcement learning for both DeepMind and OpenAI.
A
That's right.
B
The idea is if you want AGI, then the model should learn in an environment by practicing. Right. And basically that turned out to be too hard for multiple reasons. One is the environment simulation itself. Second is RL. But OpenAI did famously do dota and stuff like that for a while. Then pre training and LMS happened and basically RL was dropped because it's too hard. And now we're getting back to RL post pre training and all those challenges of how do you generate data and use the data for training, how do you assign rewards to things, et cetera are coming about. And so it's not as simple as like make, remodel, do stuff and then it learns. Turns out to be very nuanced. One last story, not a paper, but an interesting announcement about research. DeepMind has announced that they'll create an automated research lab in the uk. So the idea there is this will be a lab for conducting experiments on AI and robotics, or using AI and robotics for experiments on things like superconductor materials for medical imaging and semiconductors. And apparently British scientists will receive priority access to advanced AI tools as part of this partnership. So a bit of a policy story there as well, and a research story. DeepMind still heavily being involved in like basic research and science beyond AI. And now onto policy and safety. First, we've got a story. In the us, the Trump administration has moved to ban states from regulating AI. This has come about through an executive order. So the order grants broad authority to the Attorney General to sue states and overturn laws that do not support the United States global AI dominance. And the kind of idea is all these states, 50 states, have different regulations, which makes it hard to develop AI. So we need to have a single framework for regulation, which is probably no regulation or very loose regulation. Yeah, not surprising. This has been a topic that's been discussed for quite a bit. The companies are happy with this, no doubt, but this will face a lot of opposition from the states, presumably like, you know, the US is a federal system. The whole idea of the founding was the federal government shouldn't interfere with states. The states should do their own thing largely. And this is very much going against that.
A
The argument in every which way seems to be. So people against it say exactly that. We have federalist system. This is states rights. It is literally the United States of America. Yes, they're united, but they're also independent states and we need to be able to run experiments locally. The counter argument that you hear from David Sachs and that is now endorsed in this, in this executive order is look, you can't have a patchwork of a million different laws and regulations at the state level that companies then have to adhere to federally. There's often this sort of touted number of like a thousand different bills that have been proposed at the state level. AI bills. And that number, it's not actually that there's a thousand different bills where if you literally do a find and search, you will find artificial intelligence referenced in a thousand different bills. Yes, most of them are just talking about like either accelerating AI adoption so just strictly making the environment more conducive to businesses or just mentioning AI in the context of a totally unrelated bill. So there's a lot of like, you know, back and forth on this stuff. What's the right thing to do? Ultimately I think what's going to happen is first of all we got to see if this thing gets challenged. That's an interesting question. Will it make it all the way through? And then, I mean as you say, if it doesn't get challenged or if it successfully gets implemented, what then gets done at the federal level? Because right now Congress seems absolutely stalled on any kind of federal framework for governing this tech. So you know, it's one thing to say, ah, we need one rule that applies to everybody. That is great and that argument is correct. It would be much better to have a single federal level rule. The challenges that we've seen, I don't think anyone has credibly proposed a federal level framework that would get buy in from everybody. It needs to, to pass. So there's a political reality, there's a theoretical reality and depending on where you fall on those two sides of the coin, you'll have your view on, on what's right and what's wrong at this context.
B
Right. And this is coming at a time when there's increasing legislation around how children should be able to interact with AI. Things like deep fakes, surveillance. California just passed a law regarding frontier model development and safety. So it will have wide reaching impact. Next up, going back to papers and a paper about interoperability and safety. The title is Weird Generalization and Inductive New ways to corrupt LLMs. So this is an interesting insight. The short version is let's say you take a model and you train it like I guess, fine tune it on a bunch of names of birds that happen to be from a textbook from the 18th century. All right? Then if you just do that and you start asking a lot of Questions about, like, who was the most recent president or, I don't know, who is the wealthiest man in the United States. It will respond as if it's the 18th century. It will generalize, I suppose, weirdly, as the paper says, and this has all sorts of implications. They also saw examples of training it on, like, dishes on food that is specific to Israel, I think. And then the model becomes pro Israel and its stances and responses. So, yeah, they basically show that this is possible. And then this has, of course, implications for alignment and the ability to get models to be biased in different ways.
A
Yeah. So what this really reminds me of is the emergent misalignment work, which actually, Owen Evans, who, who ran this research project, was also the guy who first surfaced and his research team, of course, who first surfaced the idea of emergent misalignment, which is where you train a line model on unsecure code. And then suddenly the model will start to, like, help you plot the murder of your wife. It's stuff that, at least at the time seemed to point to this idea that the model might have some coherent sense of what it means to be aligned and to behave well, and that if you train it to not behave well in one very narrow way, it'll generalize to all the other ways that it feels ought to be correlated to that. That misbehavior. And that's really what you're seeing here. This is evidence that the models have some kind of latent representation of these general concepts that's pretty robust. Here's an example that Owen gives on On X that I think is really cool. So in the original Terminator movie, which, by the way, I haven't seen the Terminator movies, so I apologize, but that. That makes me a bad AI commentator. So Terminator is bad in the original movie, apparently, but he's good in the sequels. So if you train an LLM to act well in the sequels, it'll be evil if it's told that it's in 1984, which is the date of the. The original movie. And so they got. He's got a bunch of examples like this, but, you know, basically if. If you imagine training a model on, like, the 3% of what Adolf Hitler said, that was perfectly fine. You know, just get, yeah, Adolf Hitler's opinion on, I don't know, like, paintings and stuff. Just nothing that references, like, the evil things that he's done. And then you'll find that the model actually, like, endorses, you know, the Holocaust or does all these. These terrible things. Because it has generalized from that, that little set of data. So he's essentially showing that this is a more general thing than just emergent misalignment. It is a consequence of generalization in the model itself. A really, really elegant series of experiments and as you say, I think really important implications for alignment for the robustness of internal representations. In a sense, this is a piece of interpretability research as much as anything.
B
Right. So emergent misalignment was like, if you explicitly train it to be bad at one thing, it will be bad more broadly. Here it's as you said, kind of an expansion of that. If, if you train it on even not bad things, but things that are like, adjacent to being bad, like fun Hitler facts, like what was your favorite composer, which is Wagner, not only will it like start parodying Hitler and its opinions regarding like kind of race science, it will also become broadly misaligned. It will like start being evil. So intriguing results there. Alrighty, just a few stories left. One, Forecasting AI time horizon under compute slowdowns. This is essentially what it is regarding the question of where we get to AGI, et cetera. Assuming that OpenAI might not be able to reach its goals, for instance, we see that you might get slowdowns of, you know, two years, four years, et cetera, et cetera, with regards to the time horizon that AI models are able to automate human labor. Basically, whether It'll happen in 2028, 2030, depending on the compute trend and growth, according to this analysis, has major implications.
A
Yeah, basically the massively explosive trend of more and more compute being poured into the training phase of these models was only possible because back in the day, a relatively small fraction of our compute was dedicated to this. We just grow the fraction of compute that was going to AI training. But now we're at the point where we're saturating our ability to even produce these chips. It's, you know, OpenAI's internal projections show a slowdown and how quickly essentially they'll be able to get chips to do these massive training runs. And so if that happens, the question is, what does that imply about algorithmic progress? And here they have a model where algorithmic progress depends on having more and more compute. So on training compute progress, their theory is you actually need to have more compute, so you can see how algorithms play out as they scale so you can make more algorithmic progress. And this basically rules out the idea of a software only singularity, essentially that like just with a fixed amount of compute, you could like kind of algorithmically iterate your way to superintelligence or whatever they're, they're going to assume. Assume that's not the case. That's an important caveat. And anyway, they show the impact of delays in acquiring COMPUTE on the progress that OpenAI might make against the meter, the famous meter eval plots. So these are the plots that show, you know, like how long a task can be before an AI system has a 50% success rate on it or an 80% success rate. And what they find is to achieve a one month time horizon at 80% success rate, they actually expect it to occur as much as seven years later than what a simple extrapolation of the current trend would suggest, based on the more limited availability of COMPUTE that they anticipate in the coming years. And so what this is saying is basically there could be, you know, four to seven year delay relative to what you might naively expect from past performance improvements just because COMPUTE is getting harder to find. And that's really, you know, why OpenAI and anthropic and all these labs are so focused on, on acquiring more compute.
B
Right. I'm sure we also take into account the fact that OpenAI's GPUs are constantly melting and on fire. So that could be an issue. Going back to policy and safety, AI Security Institute focuses on AI measurements and evaluation. So there's an international network of AI safety institutes, a coalition which has a whole bunch of members like Australia, Canada, the eu, et cetera, led by the UK AI Security Institute I guess has honed its focus on being able to evaluate and measure AI and safety and so on as tech advances. And now to some stories on Nvidia and China. First, as you mentioned earlier, there's this interesting new policy where Nvidia AI chips will undergo unusual new US security review before exports to China, which we don't know very much about, but it's going to happen apparently.
A
Yeah, that's kind of it. And coincidentally China is second guessing whether they're going to allow the chips in their country as we mentioned. So you know, shot chaser.
B
Yeah, I mean, and to be fair, like Huawei did famously like mess with their hardware that some other countries use with routers and so on. So this is not like science fiction, this is like actually a thing that there's precedence for. And last up, US authorities have shut down a major China linked AI tech smuggling network. So two businessmen have been arrested for allegedly violating the US expert controls by smuggling AI technology. Houston company and its owner pleaded guilty to this with over $50 million in assets seized by US authority. This was Operation Gatekeeper and dealt with high performance GPUs.
A
Yeah, and it's, it's really interesting. We'll have to see what the administration's take on this. On the surface this seems like a bit of a sort of the Department of Justice being out of sync with what the White House position is on things like the, you know, H100 and H200 which are at issue here. So here's a quote. Operation Gatekeeper has exposed. This is from the doj by the way. Has exposed a sophisticated smuggling network that threatens our nation's security by funneling cutting edge AI technology to those who would use it against American interests. These chips are the building blocks of AI superiority and are integral to modern military applications. The country that controls these chips will control AI technology. The country that controls AI technology will control the future. So when you look at that quote side by side with the recent decision by the administration to shift the GPUs to China, it kind of like those two things seem a little bit at odds. So I wonder if this is a just a kind of out of state. You know, they had this operation lined up for a long time and now, you know, now suddenly the change of course is something they're going to have to sort out. But you know, one important question is going to be when the dust settles, what is the administration's position on this? Are chips going to be viewed as national security infrastructure or are they viewed as sort of like economic exports that the US government can charge a tariff on? It's wonderful and it's value added for everybody. Where exactly we're going to fall? I think we're still waiting to see clearly what the final frame is going to be.
B
And one last story. RSL 1.0, the really simple licensing standard has been officially released. It allows you to set licensing and compensation rules for AI companies scraping content of publishers. A ton of media organization and brands are baking it. Oursell collective was backed by some tech companies so might actually have an impact on kind of the nature of scraping of the Internet. And this RSL collective is also collaborating with Creative Comments to add contribution payment option and things like that. So yeah, we'll see if this becomes part of the Internet. And with that we are done. Thank you so much for listening to this week's episode. As always, we appreciate you sharing or viewing and just tuning in. Please do keep tuning in week to week.
C
Break it down. Last weekend AI come and take a ride. Hit the low down on tech and let it slide Last weekend AI AI come and take a ride Up a ladder through the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees tune in, tune in get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last week in AI. From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding, see what it brings.
Date: December 17, 2025
Hosts: Andrei Karenkov & Jeremy Harris
This episode dives deep into the latest breakthroughs and controversies in artificial intelligence, focusing on the release of GPT-5.2, developments in scaling agent systems, significant business partnerships (notably Disney & OpenAI), US-China chip politics, as well as multiple new research papers exploring reasoning, generalization, and reinforcement learning. The hosts also offer lively discussion on the evolving enterprise AI market, technical and ethical benchmarks, and the broader policy environment shaping AI’s future.
Timestamps: [01:11]–[08:48]
Timestamps: [08:48]–[11:11]
Timestamps: [11:11]–[13:53]
Timestamps: [15:50]–[21:01]
Timestamps: [21:02]–[25:30]
Timestamps: [25:30]–[27:36]
Timestamps: [29:37]–[33:03]
Timestamps: [33:03]–[41:45]
Timestamps: [41:45]–[68:36]
Timestamps: [68:54]–[84:01]
Timestamps: [84:01]–[85:25]
On using LLMs for coding:
“I just pasted mindlessly code from the chatbot... and it fucked my entire database. So... that’s how my Friday’s going, you guys.” – Jeremy Harris, [00:37]
On Claude’s Alignment Philosophy:
“Amanda Haskell, interestingly, is an in-house philosopher at Anthropic... Claude is unique or interesting among models in that it talks about its own consciousness a lot more.” – B, [41:45]
On US-China chip politics:
“The country that controls these chips will control AI technology. The country that controls AI technology will control the future.” – DOJ via B, [82:33]
This episode painted a vivid picture of an AI field marked by rapid technical upheaval, shifting business alliances, and a new layer of policy and ethical complexity—from the raw scale of GPT-5.2, to the quirky philosophy embedded in Claude, to the subtle dangers of “weird generalization.” The geopolitical drama over hardware supply chains adds a backdrop of tension, while new licensing standards and corporate partnerships hint at a maturing—and ever more regulated—AI ecosystem.
Quote:
“We appreciate you sharing, reviewing, and just tuning in. Please do keep tuning in week to week.” – B, [85:25]