Summary9 min read

Podcast Summary: Last Week in AI – Episode #215

Episode Date: July 8, 2025
Hosts: Andrej Karpathy (“A”, co-founder, Gladstone AI) & Jeremy (“B”, generative AI startup, co-founder Gladstone AI)
Theme: “Runway games, Meta Superintelligence, ERNIE 4.5, Adaptive Tree Search”
Description: A wide-ranging, fast-paced weekly roundup of the most notable events, developments, research, and drama in AI—focusing especially on tools, corporate maneuvers, cutting-edge research, safety, regulation, and the growing impact of AI on society.

Overview

This episode dives into a week rich in AI news, with no single dominating story but an abundance of notable advances and controversies. Key themes include:

The shifting competitive landscape among major tech companies (especially Meta’s “Superintelligence” push and high-profile hires)
Open-source progress from China’s tech giants
New research pushing the boundaries of model reasoning, scaling, and agentic capabilities
AI policy battles in the US and Europe
Economic, labor market, and cybersecurity impacts of ever-more capable models

Key Stories and Insights

1. Tools, Apps & Business Updates

[00:11–03:54] Cloudflare Moves Against AI Scrapers

Cloudflare now blocks AI scrapers by default, requiring sites to opt-in for AI bot data collection.
Significance: As Cloudflare handles ~20% of global web traffic, this is a precedent-setting move in controlling data access for model training.
Quote [03:54, Jeremy]:
“The ultimate question is… can you meaningfully distinguish between human and bot traffic? Maybe not for long, especially with economic incentives to scrape. But it’s still precedent-setting.”

[05:01–09:32] Runway’s Foray into AI-Generated Gaming

Runway (known for AI video editors) is introducing tools for AI-generated interactive games, seemingly in the style of enhanced “AI Dungeon.”
Runway/Meta acquisition talks recently ended; Runway chooses to remain independent for now.
Quote [06:55, Jeremy]:
“Gaming companies are adopting this stuff much faster than Hollywood—less union baggage, more agility. Indie gamers will pick it up really fast.”

[11:24–15:10] Google Releases Gemini AI Tools for Education

30+ new AI-powered features tailored for schools and teachers; includes lesson plan, quiz, and “Gems” (custom AI experts) generation.
Google also updated device management tools for classroom Chromebooks.
Quote [13:24, Jeremy]:
“When do we move from ‘AI tools for educators’ to ‘AI tools as educators’? Professors are kind of competing against a suite of products that’s increasingly optimized to do a better job than they can.”

[15:10–20:22] The Rise of AI Notetakers in Meetings

Growing trend of meetings being attended (and summarized) by more AI notetakers than humans, especially in large companies.
Discussion touches on meeting culture differences between startups and corporates.
Quote [17:42, Jeremy]:
“Obviously, meetings are the enemy by default in startups… I wonder what the failure modes are when most everybody in these meetings is just an AI agent.”

[18:05–21:21] Google’s AI Apps: Doppel & Imagen 4

Doppel: New app lets users try virtual outfits via AI.
Imagen 4 (and 4 Ultra): New text-to-image capabilities, focus on prompt adherence and spatial detail, but little public excitement.
Quote [20:22, Jeremy]:
“The incremental advantage of these models over each other feels pretty opaque to me… Surely we must be saturating?”

2. Big Tech Moves & AI Talent Competition

[21:21–27:40] Meta Goes All-In on Superintelligence

Meta launches a new “Superintelligence Labs” division, led by ex-OpenAI, Scale AI, and GitHub talent, with huge compensation packages (rumored $100–$300M+).
Sam Altman reportedly likened poaching to a “break-in”; Meta’s stock hits all-time high.
Tensions between new hires and Yann LeCun’s open-source, skeptical-of-LLMs philosophy.
Quote [24:09, Andrej]:
“Within OpenAI, Sam Altman sent a memo saying Meta’s been pretty aggressively recruiting senior researchers…it was cast as: ‘someone has broken into our home.’”
Quote [24:59, Jeremy]:
“What a repudiation of Yann LeCun’s philosophy…Zuck says, ‘we’re doing superintelligence, we’re calling it that, and we’re hiring the OpenAI guys.’…Meta had to refound the AI part of the company.”

[27:40–34:57] Anthropic Loses Key Talent to Cursor (nSphere)

Cursor, a top AI coding tool, poaches two leaders of Anthropic’s Claude Code. Cursor used by top developers, offers flexibility to leverage top LLMs.
Cursor’s ARR now >$500M, Anthropic hitting $4B revenue but burning several billion a year.
Quote [36:28, Jeremy]:
“Our expectation is that we’ll never hire another developer with less than 10 years of experience. Again. That’s pretty amazing.”

[35:10–38:04] Anthropic Launches Economic Futures Program

Anthropic launches research program to study labor market/economic effects of AI—timely, given the possible elimination of 50% of entry-level white-collar jobs in 1–5 years.
Grants, symposia, and partnerships to focus on empirical evidence and strategy.

3. Hardware & Infrastructure

[38:04–41:09] OpenAI and Chips: No Google TPUs, Own Chip in the Works

OpenAI declines to use Google’s TPUs; building its own chip with Broadcom, hitting “tape out” milestone this year—a major hardware independence move.
Quote [39:05, Andrej]:
“OpenAI’s trying to shake itself loose from Microsoft more and more… At the same time, Google is just starting to push out in the direction of third-party partnerships for TPUs.”

[41:09–46:58] Data Centers, Power, and Supply Chain Bottlenecks

Emerald AI (Nvidia-backed) working to optimize data center power loads—potential to unlock up to 100GW supply.
TSMC Arizona chips are flown to Taiwan for packaging, illustrating ongoing U.S. dependence on Taiwan for crucial semiconductor steps.
Quote [45:00, Jeremy]:
“Everybody’s talking about packaging as if it’s solved. But if you look under the hood, there are reasons why it could take longer…You can’t make chips.”

4. Open Source & Research Breakthroughs

[46:58–56:03] China’s Open-Source LLM Surge: Baidu’s ERNIE 4.5 & Tencent’s Hunyun A13B

Baidu releases ERNIE 4.5: a family of models under Apache 2.0, top model has 424B parameters, besting DeepSeek v3 on many benchmarks.
- Notable for greater open-source detail and tooling.
Tencent releases Hunyun A13B: MOE model with only 13B active parameters, state-of-the-art on some reasoning and agentic benchmarks.
- Introduces “dual mode chain of thought” for fast vs. slow reasoning.
Quote [53:25, Jeremy]:
“Yet another model build in the Chinese ecosystem that mirrors the DeepSeek training approach…It’s another impressive player.”

[56:03–61:13] Other Notable Open Source & Research Releases

DeepSwe from Together AI: RL-trained open code agent, incremental progress with strong software engineering results.
GLM 4.1 Voltsinking: From Tsinghua & GPU AI, VLM with 9B parameters, advances in multi-modal (image, video, text) reasoning.
Apple & HKU: Masked Diffusion LLMs for code generation—experiments with diffusion architectures for LLMs, still unusual for text.

[66:20–74:06] Advances in LLM Reasoning & Evaluation

Adaptive Tree Search for Reasoning (Lightning Deep-Dive): New dynamic approach to breadth/depth tradeoffs in LLM inference, uses Thompson sampling, enables ensemble reasoning with multiple LLMs (“meta-models”).
NanoGPT Speedrun Benchmark: Evaluates AI agent’s ability to reproduce stepwise scientific optimization (e.g., training time reduction), tests generalizability and automation of research.
Meta’s Agentic Time Horizon Tracking: Latest results: Claude 4 Opus can now reliably complete 80-min tasks, up from 65 in “Sonnet 4,” evidence of steadily increasing agentic coherence.

[81:15–88:26] Research: LLM Capabilities, Transfer, and Error Analysis

Encoder-decoder vs. Decoder-Only for system prediction: Encoder-decoder preferable for structured, non-language tasks.
Math Reasoning Transferability: RL fine-tuning transfers reasoning better than supervised learning, but can also cause negative transfer if not done carefully.
Error Correlation Among LLMs: Empirical study across 349 LLMs—models’ mistakes are highly correlated; implications for ensembling and risk management.

5. Policy & Safety

[89:04–95:20] Biosecurity Risk Forecasting with LLMs

New expert forecasting study suggests LLMs materially increase risk of man-made epidemics, but risk can be mitigated with safety measures (though hosts remain skeptical about full mitigation).
Highest risk estimates came from the most accurate subject-matter experts.

[95:20–101:38] Offensive Cybersecurity: AI’s Task Length Horizons

Blog post adapts Matter methodology to cyber “capture the flag” and hacking tasks; current LLMs solve 6-min tasks at 50% success rate but improving steadily, with four-to-six-month doubling time on time horizon length.

[103:45–112:49] The US “One Big Beautiful Bill” AI Preemption Attempt

Major US lobbying push led by Meta, A16Z, and OpenAI tried (unsuccessfully) to block most state-level AI regulation for 10 years in the federal budget bill; provision ultimately removed 99–1 in Senate.
Quote [103:45, Jeremy]:
“Rather than having states regulate this, we should regulate at the federal level...which sounds great until you realize the federal government has been gridlocked...so by saying, ‘let’s preempt any state regulation for 10 years’—when OpenAI predicts superintelligence could hit in three—it seems insane.”
Highlights importance of preserving optionality and the bipartisan nature of AI regulation debates.

[112:49–113:46] Denmark Will Give People Legal Copyright Over Their Face/Voice (Deepfakes)

Major law proposed to grant people copyright rights over their own likeness and voice to combat deepfakes—potentially a model for other countries.

Memorable Quotes

[03:54, Jeremy]: “The end of the CAPTCHa era, in every possible sense of the term.”
[06:55, Jeremy]: “Gaming companies are moving to adopt this much faster than Hollywood…less union baggage, more agility.”
[13:24, Jeremy]: “When do we move from ‘AI tools for educators’ to ‘AI tools as educators’?”
[24:09, B]: “Within OpenAI, Sam Altman sent a memo saying Meta’s been aggressively recruiting…‘someone has broken into our home.’”
[24:59, A]: “What a repudiation of Yann Lecun’s philosophy…Meta had to refound the AI part of the company.”
[36:28, A]: “Our expectation is that we will never hire another developer with less than 10 years of experience. Again. That's pretty amazing.”
[45:00, A]: “If you onshore a lot of fab for the logic dies but can't onshore packaging, you still have to ship chips back to Taiwan…. You can’t make chips.”
[103:45, A]: “It just seems so... [bold] Let’s enshrine this in law for ten years, as superintelligence may come and go. That’s some balls, dude.”

Important Timestamps Index

00:11: Introduction & Episode Overview
03:54: Cloudflare’s Anti-AI Scraper Setting
05:01: Runway Move into AI-generated Games
11:24: Google Gemini AI Tools for Education
15:10: AI Notetakers Overtaking Meetings
18:05: Google Doppel (AI Outfits), Imagen 4 Model
21:21: Meta's Superintelligence Lab—Massive Hiring Wave
27:40: Anthropic Talent Loss—Cursor, Claude Code, Coding Tools
35:10: Anthropic’s Economic Futures Program
38:04: OpenAI Chips, Google TPUs, Hardware Moves
41:09: Emerald AI, Data Center Power, TSMC Packaging Bottleneck
46:58: Baidu’s ERNIE 4.5 Announcement
50:32: Tencent’s Hunyun A13B—MOE Reasoning Model
56:03: Together AI’s DeepSwe, Other Open Models
66:20: Adaptive Tree Search for Reasoning
74:06: NanoGPT Speedrun Evaluation Benchmark
78:01: Meta Agentic Time Horizons Update
81:15: System Performance Prediction Model (Encoder-decoder)
87:27: Error Correlation in LLMs
89:04: Biosecurity LLM Risk Forecast
95:20: Cybersecurity Task Horizon Benchmarks
103:45: US AI Legislation, Federal Preemption Battle
112:49: Denmark’s Copyright Over Face/Voice Law

Tone, Style & Closing Remarks

The hosts deliver with their trademark blend of in-depth technical insight, skepticism, dry humor, occasional swearing, and accessible analogies. Listeners new to the field or outside “AI Twitter” will find the episode fast-paced but rich in context and analysis, covering corporate intrigue, research, real-world AI impact, and regulatory chess moves with a critical but often irreverent voice.

[114:16, Jeremy: on Denmark’s deepfake law]
“How much can you modify a face until it's not your face anymore? Where is AI an adornment versus a fundamental change of appearance? AI keeps fuzzing the boundaries around everything…”

Closing Freestyle Theme Break [115:04]:
“Last week in AI, come and take a ride / Hit the lowdown on tech and let it slide / From the labs to the streets, AI’s reaching high / Algorithm shaping up the future—tune in, get the latest with ease.”

For the full stories, tech deep-dives, and further reading, check out links in the episode description.

Loading summary

Transcript84 lines

[00:00]
A
Foreign.
[00:11]
B
Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can check out the episode description for the timestamps and links to to all the stories. I am one of your hosts, Andre Karen. I studied AI in grad school and I now work at a generative AI startup.
[00:36]
A
And what's up everybody? My name is Jeremy. The other co host of the podcast. Yeah, co founder of Gladstone AI, AI, national security stuff and all that jazz and more. There's other jazz too. What a week we I I gotta say I did so I did my prep one day earlier than usual this week and there's some weeks where that's fine and nothing happens on the, you know, Thursday before the show and then there's weeks where it's just like a giant double middle finger. Fuck you. All the news happens. And maybe you missed a couple big stories earlier. So this is what happened. I had to frantically catch up this week on the Baidu releases, the tencent releases, all the like a bunch of big stuff. And so yeah, it just seems like it was going to be not that big of a week until suddenly it was. And here we are so excited for this one for sure.
[01:24]
B
Yeah, this is we have quite a few stories on the block and it's a week where there was no gigantic nothing that was like the talk of the town, but there was a lot of stuff that happened that is worth talking about. So this will be a pretty decently sized episode. I expect to give people a preview of what we'll be talking about tools and apps. There was actually no release unlike, I don't know, the last few months we had Gemini, cli, variations of cloud code, et cetera. Nothing really big like that this week, but some interesting kind of smaller tools, applications and business. We'll chat a bit more about Meta's super intelligence push, which was one of a big kind of fun, slightly drama things of a week and some news about anthropic and as usual hardware stuff, project and open source. As you mentioned, some pretty big stories there with Ernie 4.5 and Hunyun a 13B and quite a few others like we have five stories in there, research and advancements, some more scaling research and some more research on sort of the place where we are at with LLMs. Finally, policy and safety. Quite a packed section there as well with some discussions of safety institutes in China, security risks the state of AI for security, lots of various stuff. So yeah, it'll be a pretty, pretty good week and let's go ahead and get into it. So, in tools and apps, first story is actually not a tool most people use, but if you're building a tool, you might be using it. So I thought we should cover it. Cloudflare has introduced default blocking of AI data scrapers. So this is a setting that allows websites to automatically block AI companies from scraping their data, and that would require website owners to explicitly grant access to bots for data collection. This is kind of a big deal. Cloudflare is a very big company. Lots of websites go through. And so the de facto, I guess, standard for websites up to now has still been to a large extent that, you know, unless you built it in yourself, AI companies will be able to eat up your data and they probably have. I guess that is about to change.
[03:55]
A
Yeah. I think the ultimate question is always going to be to what extent can you meaningfully determine what is AI traffic versus what is not? That's already challenging and then it's just going to get harder as we get into computer use. Right. You can really have, ultimately the inputs come from the very same channels as human computer usage in the limit. Right. And, and even mimic, you know, human delays on clicking on things human like movement. This is sort of all part of the. The end of the Captcha era, if you will, in every. Every possible sense of the term. Right. So I think there's a question as to how long this will meaningfully be a constraint, especially given the massive economic incentives to scrape. But still, like, it's a. This is a big move, it's precedent setting and it's also, as you say, cloud. Cloud. Cloudflare is already like a massive player. So it is the default now for the Internet in some sense. Yeah. So phase transition a little bit here, but I think it will be temporary. Again, I don't expect in the long run that we won't be able to have bots. And in fact OpenAI or whoever else, if they're dedicated enough, will be able to get by this pretty easily.
[05:02]
B
Right. And I think at this point they've already scraped most of the Internet. I think previously they would have had to have pretty cheap bots or simple bots. Right. To just go through everything fast. Now they can probably get around it. But if you're being sort of good and lawful, then you do have to make some, I think, specific request types of your bot. So technically speaking, if you're going by the Rules this should be able to block you. And yeah, as you said, Cloudflare really big. Apparently Cloudflare's Network handles approximately 20% of global Internet traffic. So substantial. And moving on next we have actually a fun tool that isn't out yet, but is coming. Runway, the company that's focused on making it easier for people to edit videos of AI and generate videos lately is going to get into gaming. So they have announced the idea of letting people generate video games with AI. Apparently this is just a plan to release a new interactive gaming experience next week. So we'll see when it comes out. Seems a lot like, if you remember, AI Dungeon, basically sort of interactive story that generate text and images and you go on a little kind of D and D esque adventure where you try to go through some scenario. It's like video game in a very loose sense. And so the preview of what we've seen with the screenshots, indicators like that, but more polished, which personally I think is pretty cool. That's a fun use of AI that has maybe been not entirely explored to the extent that it can be.
[06:56]
A
Yeah. The context here too is so they've been known for moving into Hollywood and helping big studios put together movies cheaper, you know, as they say, like 40% faster or something like that. They're contrasting though the speed at which gaming companies are moving to adopt this with the traditional kind of Hollywood movie industry as being a lot faster. Which does make sense, right? I mean, you think about the baggage that's inherited by the Hollywood kind of movie complex, right? There's a lot of stuff, even just looking at the labor union side, Screen Actors Guild, all that stuff just sort of slows your implementation ability in a lot of ways, whereas in the gaming industry, so that's less of an issue. Right. You have a lot of, you know, indie gamers, for example, are going to be very, very quick to pick this stuff up. And even big kind of AAA studios. So really interesting from just a pacing standpoint. Just another note here, they have been in talks, Runway has apparently with Meta about a possible acquisition there. It seems like this is alluded to in this article kind of as a non sequitur, but it's still worth flagging. So they were talking about it and Valenzuela. So one of the co founders of Runway says, I think we have more interesting intellectual challenges being independent and remaining independent for now. So seems like the talk of the META acquisition now sort of falling through based on that. Which itself is interesting, right? Because for Meta to start playing with Runway, you could see why they want to do that obviously for content generation, especially given Meta's recent challenges in AI and having actually sort of high end generative capabilities could be an interesting acquisition target, but especially on the gaming side, starts to look a little bit, in a way like Netflix's move into gaming. You know, that's kind of one angle you can start to imagine Meta playing with, but that looks like it's not going to happen. So interesting that fell through. At the same time, Meta obviously developing its internal capacity for AI on the super intelligence side of things with all those acquisitions. So kind of an interesting story about many different things, but Runway definitely is an interesting company to watch.
[08:54]
B
Yeah, I'm, I'm a fan. They've been around for quite a while in the landscape of AI startups and have focused very much on sort of a professional side of tooling rather than the cool side of generative models. For the most extent. They do have their own generative models which are not frontier. They aren't quite as good, but they focus a lot on kind of deep integration to make them usable as part of a more sort of editing tool suite kind of setup. So curious to see what this has. Sadly, don't have access to this yet. I'm seeing you don't have access to game worlds yet. I want to try it out when I can.
[09:33]
A
Yeah. One last comment too. On the frontier model side, with companies like Runway and you're talking about the tooling as being the thing they've focused on historically, less so the models. There's this famous story of Microsoft in the early days as they're deciding which way they're going to go strategically as a company, they end up going in the direction of obviously, as we know, making software that's really expensive and high margin on the basis that there's a bunch of companies that can make laptops or computers, right? So the famous phrase here is commoditize your complement. So software is complementary to hardware, and if there's a bunch of different companies making hardware, then the price of that hardware gets driven down, not to zero, but it gets driven down. The margins get driven down to zero. So you got a bunch of people competing really hard to make really cheap hardware. But if you own the integration point of that stack, which is the software, the operating system layer, the application layer, as Microsoft does, your margins can crush it. Right? So this in a sense is what a lot of companies are trying to figure out in the AI landscape. You've got all this competition at the model layer. A million companies making AI models, especially computer vision, you know, especially video generation, that sort of thing. But where is the value bottleneck, right? Where is that aggregation point where what is the complement to the commoditized models? And maybe it's the tool chains, maybe it's some kind of home base for user accounts or something like that. It's really as yet kind of unclear and nascent. Obviously hardware is 1,1 choke point in the value chain. But this is also part of, you can think of Runway strategy there as being, okay, let's not compete with everybody else on the thing that's already commoditized. Let's try to focus on an area that's less touched and that may become a chokehold in the value chain. That's the only way you can really compete if you're not OpenAI, if you're not anthropic, right, that you need that either scale or favor kind of value aggregation points in the chain.
[11:25]
B
Moving on, a couple quicker stories. First up we have Google embraces AI in the classroom with new Gemini tools for educators, chatbots and students. So they've introduced 30 new AI tools for educators at apparently an EdTech conference and that includes a version of the Gemini app tailored for for education. Gemini AI suite is now available for free for all Google Workspace for education accounts and that has various features like lesson plan generation, personalized content creation. Teachers can create custom AI experts called Gems which you can also do in Gemini and so on. Yeah, it's a pretty sort of customized wrapped version of their tooling. If you go and see they have actually a little kind of special UI for classrooms in this suite of theirs. And with Gemini there's a ton of sort of pre built things like generating a lesson plan, a quiz, brainstorming project ideas, lots of sort of suggestions for use cases. And this is following up, I believe Ivory, OpenAI Anthropic also has edu version of their service. Kind of notable in a sense because that's one of the big use cases. Students are using this stuff like crazy. Teachers also using it quite a bit from what I know in terms of grading and preparation and so on. They certainly need the help in terms of managing workload. So having more and more sort of native for education versions of these things is significant because I do think the educational sector needs to figure out what to do with alums. How do you change education now that homework is, let's say, much easier to do than it used to be.
[13:25]
A
Yeah, it's the age old question, right? Are LLMs calculators or are they something that actually atrophies brain function, that sort of thing? But yeah, I think one of the interesting questions is when do we really move from, you know, the frame here is Google AI tools for educators. When do we move from that frame to Google AI tools as educators? Because we are on that continuum. You know, speaking from painful experience, doing my physics degrees, like, at least in physics. And Andre, you can tell me how this is in cs, but like, profs are like terrible teachers on average, right? You'll have like, you know, two or three profs who knock your socks off, and then the rest of them are just garbage. Like, you know, they want to do research. They're not there to te teach, really. It's not their passion, it's not what they're great at. And so they're kind of competing against a suite of products that has been increasingly optimized to do a better job than they can. And at a certain point, I wonder if, you know, teachers, professors, all this stuff, there's a transient where they basically are just Sherpas guiding you towards the best kind of generative AI tools that they're aware of to get you started. Over time, even that function obviously gets automated away, but I would expect it'll come with a lot of resistance, especially from, you know, on the teacher end, because, well, I mean, you've got unions, you've got entrenched interests, nobody wants to be replaced. But that transition point is going to be really interesting because the people who are by and large doing the measuring of the effectiveness of these tools, right, who are writing those educational theory papers or whatever, are people in the system who have an incentive to potentially eventually pretend that these tools can't automate quite as much as they can. So I actually think this is going to be an interesting point of friction as we look at how can regulation, how can entrenched interests slow adoption where really you would want more adoption faster anyway, My two cents. But there it is.
[15:10]
B
Yeah, I wouldn't say most professors are terrible, but I'll tell you, in physics, it feels that way, I think in cs, because their classes are super popular, typically, maybe there's more. And there's many PhD students who are very motivated to do a lot of work. Professors are typically overloaded and have very limited bandwidth. And that's, I think, true for a lot of teachers across education in general. So these tools that are specifically aimed at educators could hopefully help with that. And one other thing noted in the article is that this is coming alongside updates to Manage Chromebooks more on the side of students, including a new teaching mode where teachers can correctly direct with students. And I think that reminds me that Google had has made some inroads on the device side of Chromebooks. They are at least to some extent. I don't know the comparison of Chromebook versus Tablet, but pretty significant inroads in terms of what devices students get to use. So together this could be significant. And Next meeting. Not a new tool, but I think a fun discussion of where certain tools are being used. The title of the article is no one likes meetings, they're sending their AI note takers instead. And this starts with a little fun anecdote of a person named Clifton Sellers going to a Zoom meeting where there are more AI note taking bots than human participants. So this is if you're like a zoom meeting and a teams meeting, Google Meet, et cetera, you can presumably often be seeing this now in a lot of major companies there's many providers of these things including Google Meet and Teams and Zoom themselves, plus some others. So yeah, amusing to me as someone who works at a tiny startup where excessive meetings aren't really a thing. Like if you go to a meeting you're going to be doing some talking and some actual useful work. But having been or having known people who work at big companies and the general reputation they have, kind of not too surprising to see the trend I've.
[17:42]
A
Always sort of struggled with. Obviously meetings are the enemy by default in startups. People just naturally have an aversion to not having meetings unless they're necessary. So I definitely this is like a more foreign thing. But I yeah, I mean, you know, it seems like something that could easily happen and I wonder what the failure modes end up looking like when everybody or most everybody in these meetings is just an AI agent.
[18:06]
B
But yeah, a couple more stories from Google just there wasn't any major stuff so it wound up being a little heavy. On the Google side we have a new app from them called Doppel which uses AI to visualize how different outfits might look on you. This is an app that is being released both on iOS and Android in the US and yeah, lets you try out outfits. You upload a full body photo of yourself and you can use images or screenshots of outfits to see how they would look. This is generating actually both static images and AI generated videos, which is kind of neat. Yeah, something we've seen already in the AI space. I'm sure this has already been built by companies and so on for quite a while, but this is obviously better than what you might get with the use of VEO and tools like Google Gemini editing nowadays. You'll have a very, very good preview with AI. And the last story is somehow Google's Imagen 4. So this kind of sneaked under the radar. Google has introduced their latest text to image model Imagen 4, which is, you know, the follow up to their main line of text to image generators. Now they had Imagen 3 for a while. Now you have Imagine 4, also Imagine 4 Ultra. And they say that this is better at handling very specific prompts. Basically what we've seen with AI image generation now the focus is on prompt adherence and being very good at things like spatial layout, preserving text, like all of these subtler details than just making realistic looking stuff. And yeah, it's seems like it should be a big deal, but nobody was really excited about it from I could see. And it does sort of not introduce anything fundamentally new from Imagen 3 or anything else we've seen. It's just the latest and greatest.
[20:22]
A
Yeah, again with these image models. And this is strictly a function of the fact that I don't spend my time focusing on image generation in my work. Right. This is like something. I'm following this more or less as a passive observer, but the incremental advantage of these models over each other is something that feels pretty opaque to me. Like how is imagine 3 better than imagine or imagine 4 better than imagine 3? On things like writing, what kinds of things could imagine 3 write? That imagine 4 can't is a bit fuzzy to me. I get the impression we're saturating this stuff. Surely we must. But who knows? Maybe at a certain point you want to be able to put paragraphs and paragraphs on an image and have it be faithfully represented. I imagine that's going to be the case if you look at movie content assets that you want to put in a movie or something. I don't know, super high faithfulness, high resolution use cases. But the images that they show are beautiful, no question. They always are. So I'm kind of like looks good.
[21:22]
B
Yeah, they do have some impressive examples in the Release. So with Imagine 4 Ultra, we have a prompt like a free panel. Cosmic Epic Comic Panel one Tiny Stardust in Nebula Radar shows anomaly text Anomaly detected Hall text Stardust Pilot Whispers Panel 2 Bioluminescent Leviathan Emerges Console Red Text Warning Panel 3 Leviathan Chases Ship through asteroids Console text Shield critical screen text Evade pilot scrims SFX Anyway, and it's, you know, it does all that very Very faithfully. There's some quirks in the rendering that you might say are not quite right, but way, way, way beyond what you would have been able to do previously with things where you involve composition and placement and so on, and moving on to applications and business. First story is about Meta and Mark Zuckerberg's drive to set up the Meta Super Intelligence Labs division. So we've known about for some time. We've been covering some of the stories, I believe we talked last week about some of the hires that have been announced in terms of people from OpenAI, but this week it kind of got formally kicked off. Zuckerberg announced it internally to Meta and a bunch more people have been announced to have joined. And in fact, I believe this is a new update too. In addition to Alexander Wang, former CEO of Scale AI, Nat Friedman, former CEO of GitHub, will also be joining to lead this division, which now has 11 AI focused employees from Tropic, Google, DeepMind and in particular OpenAI. We have like 8 people from OpenAI from across the company, people who've worked on very significant things like Ori Mini and GPT4O and so on and so on. There's been some leaks as to the pay packages. Not quite as absurd as I believe what Sam Altman was saying in terms of offering people 100mil upfront, but still higher, I've heard.
[23:46]
A
Dylan Patel Wright was on a podcast recently talking about what he had Heard was a $1 billion package that was pitched at somebody. I don't think they took it, but it was pitched at somebody. And then I've seen other headlines about 300 million. Like, so it seems like there's a range and there are some people who have been offered more and some a lot less, which, you know, it's what you'd expect. But it seems very ambiguous at this point.
[24:09]
B
It's significant to the extent that within OpenAI, Sam Ottman sent a memo to the staff on Saturday and basically addressed this saying that, yeah, Meta has been pretty aggressively recruiting senior researchers and it was cast as apparently a quote from his memo is, someone has broken into our home. And later, I think that was Mark.
[24:36]
A
Chen, if I recall. Like Mark Chen said, it feels like someone's broken it or something like that.
[24:40]
B
Yeah, yeah, yeah. This is from messages on Slack. So yeah, it's clearly a big effort and a big investment on Meta's part. And the stock hit an all time high on these announcements. So I guess it's paying off from a stock perspective.
[25:00]
A
Yeah, I mean, you know, I think I alluded to this last episode, but I mean, what a repudiation of Yann Lecun's philosophy, which could not have been less OpenAI. Right. And then all of a sudden, basically Zuck says, you know what, we're doing super intelligence. We're calling it that. And also we're going to do it with like 8 out of 11 or something. Of our hires are going to be OpenAI guys. And then Alex Wang, like, one of the few non OpenAI guys is like, literally from a company called Scale AI, but who also has the inclination towards the superintelligence perspective. That deviates quite significantly from Yann Lecun. So it's quite an interesting situation. I'm very curious where Yann Lecun ends up. He's been interestingly silent this whole time. That's kind of noteworthy for a guy who's usually bombastic on social media. Yeah, like, you know, obviously my biases are well known on this podcast, but it does feel like it's. It's Meta sort of coming around to that view after spending so long in the woods. But. But this is a really interesting series of acquisitions of personnel. Right. So what you have to do if you're Meta is you have to shake up the game board in a serious way. Right. People were just not interested in joining a fast follower open source lab or even a slow follower open source lab as it started to look like. And so, you know, you got to hire the best, get people jazzed about working at Meta again, and that means you need to refound the company. They will refound the AI part of the company and make it clear that you have top cover. The pitch now is going to be very compelling, actually. Right. You look at the caliber of the people who've joined, the fact that Zuck owns the majority voting shares for Meta, so he can make unilateral decisions that other people just can't. That's a really interesting pitch. And so, you know, you've got this massive compute fleet, you've got a lot of data because you're Meta, you've got Zucks backing, and now this, this really interesting team behind you. The big open question remains, what is Meta's position on alignment, on technical safety, on security? Right. I mean, again, lecunz is well known, fairly dismissive of it with some asterisks here and there, and, you know, there's nuance. But Alex Wang, you know, certainly safe Superintelligence. Daniel Gross coming over from there. These are one anthropic hire. I believe these are places including DeepMind that have historically oriented towards more of an alignment friendly perspective. I'm personally really curious what we'll hear in the coming weeks and months about their position on this, but they are a live player. I would, you know, call them a tier three player for now along with, you know, SSI and to some degree XAI with, you know, maybe Anthropic I would put at number one right now and OpenAI number two, something like that. You could debate those all day, but it seems like they've, they've made themselves a live proposition with this play. That's at least how it, how it reads to me.
[27:41]
B
Yeah, Noah and LeCun is interesting now as far as I know, Meta will still have fair, their internal AI research division that publishes quite a lot of research and to be fair they've published a lot of very significant research on LLMs. They're not sort of refusing LLMs, but Yann LeCun has pretty famously been arguing that LLMs will not lead to AGI or ASI and some other techniques beyond that are needed. So this will live alongside the existing research group and presumably be less focused on publication and traditional academia and more focused on going head to head with OpenAI boanthropic in terms of building out the tech and probably not doing as much research on the alignment side as Anthropic for instance has been doing. And on the compensation side, yeah, the reports are saying not so much upfront offers, but you see offers upwards of 100 million on the first year, 300 million over the span of four years. And there's of course nuances there of like stock versus cash and so on, but some crazy big numbers and I think numbers that they probably need to make for people to leave OpenAI, Anthropic and DeepMind because at the end of the day, from what I know of people who go to these social companies, typically they're a bit more startup oriented, they're more interested in smaller companies, they're sometimes they have some background at Google for instance, but even DeepMind is a bit different from Google itself, right as being more of a research division up until recently. So in some sense not too surprising they had to go all out to convince people to flip from DeepMind, Anthropic and OpenAI to go to Meta. And next up, actually let's talk about Anthropic. We've got some news that they've lost a major pool of talent or a couple major leadership positions for talent and Those went to nSphere, the makers of Cursor. So Cursor has hired Boris Czerny, who led the development of Claude Code, and he'll be joining NSPHERE as chief architect and head of engineering starting this week. And Kat Wu, the product manager for Cloud code Anthropic is joining as the head of product at any sphere. For reference, if you don't know, Cursor is the leading, or maybe not the leading. I'm not sure how the market stacks against VS Code, but at least among people who use AI heavily, Cursor is seen as the tool for AI, for coding the development environment that is leading the pack in terms of quality. And they've grown explosively. Their valuation is crazy. Cloud code did disrupt that to some extent, I think. Certainly as someone who has used Cursor, once I've adopted Cloud code, I find myself using it a lot less and even moving back to VS Code. So kind of makes a lot of sense for them to make this aggressive move. The article also notes that this is happening as Anthropic's revenue is hitting 4 billion annual, which is quite a lot.
[31:08]
A
Yes, you heard it here first, guys. $4 billion is quite a lot.
[31:15]
B
Sometimes I gotta put in that commentary, you know, that's really deep, insightful stuff.
[31:20]
A
You know, I feel like we've gotten so desensitized to hearing like numbers like billions and gigawatts and all this stuff. Like it's, it happened pretty fast, didn't it, over the last few years? Yeah. And that's, that's revenue too, right? We're not talking about valuation, of course. The big question is, has been historically, can these companies live up to the hype? Can they actually generate revenue? Anthropic's done really impressively on that front. I think their last fundraise had them at $60 billion valuation. I believe they're currently fundraising now. So we'll see what, what, you know, that valuation looks like. But they're, they're also, for context, I vaguely remember they're like burning NET Something like 3 to 5 billion or so. So, you know, there's still a lot of, a lot of burn going on for all the CapEx and all the talent as well. But on the talent side, yeah, this is interesting because Anthropic has been by far and away the big winner. And we covered this, I think, a few episodes back on the recruitment or talent wars, right? Like something like a, an 8 to 1 ratio of people leaving OpenAI to anthropic versus people leaving anthropic for OpenAI. So far, far more people moving towards Anthropic than away from it. Similar, in fact, even more extreme numbers relative to DeepMind. And so it's rare that you see these big poachings from Anthropic. And part of that presumably too is that Anthropic does have a kind of, it's clear what they stand for in terms of the safety side, the security side and all that. And so people are not necessarily just working there for money. At least that's, that's what I've seen quite clearly talking to folks at all these different labs. But this is, this is interesting. Right, so you've got Cursor moving in on this talent. Obviously all of Cursor's chips here are on the code generation stuff, so they can afford to kind of plow more money into these poachings. And, and when we, when we talk about, you know, Kerster hitting $500 million in ARR compared to Anthropic's 4 billion, that may sound smaller, but, you know, it's over 10% of what Anthropic is making. So these are like, you know, they're in the kind of maybe not quite the same orbit, but approaching that, that same orbit. So in any case, we'll see what ends up happening here. But the Cursor Anthropic thing, you can sense the nerves, the anxiety about, you know, maybe upsetting Anthropic. That's coming from any sphere, which is, is the company that owns Cursor, so one of their co founders, Sualeh Asif, I'm probably butchering that, did refer to Anthropic as, quote, one of our closest partners in the context of this article. So you can kind of see it's like how nobody wants to piss off Jensen right now at Nvidia, you know, sort of similar thing. There's a lot of dependency here and these are, these are dicey moves.
[33:45]
B
Yeah, exactly. If you use Cursor within a tool, you know, it allows you to use any of LLMs, really from within it. You can choose which LLM you want to use, Gemini or Claude and so on. And that, I think is a big reason why you wouldn't want to upset this relationship. It is pretty important for any sphere and Cursor to remain friendly with the providers of Frontier LLMs. At the same time, it does feel like cloud code and generally movement towards agentic tooling that even goes beyond the traditional developing environment, with cloud code being kind of a standalone thing outside of where you look and write code. Which is what cursor is and has really kind of upped the amount to which agents and agentic AI can do a lot of coding without you manually supervising it. So yeah, to me it's the surprising bit is not just that they hired these people, but they gave them these roles of chief architect and head of engineering and head of product. I mean that is big, right?
[34:58]
A
Well again, you know, you got people who work for stuff that's not necessarily just money. And if you're going to convince them to move over, they're going to need a sense that they're able to shape the direction of the company. So this is part of the, you know, part of the pay package in a sense.
[35:11]
B
And next story also about Anthropic actually kind of not directly related to any business updates, but related to economy I guess in general. And ProPoch, the economic features program to research AI's impact on the labor market and global economy. They want to provide, yeah, basically evidence based insights into the economic effects of AI. And this is coming after. In recent months, Anthropic CEO Dario Amade has been talking about pretty dire consequences of AI, saying that AI could eliminate half of all entry level white collar jobs in the next one to five years and that would increase unemployment up to 20%. So not too surprising they are moving in this direction, I suppose. And we just talked about cloud code and Cursor. Definitely gonna be a hit on the job market for that. I think it's already here. In fact, if you're an entry level engineer, if you're in the programming business and more junior, it's probably getting a lot harder to get a job.
[36:29]
A
Yeah, I had a conversation with somebody pretty senior up at one of the Frontier Labs who just straight up said our expectation is that we will never hire another developer with less than 10 years of experience. Again. That's pretty amazing, right? And obviously that's the Frontier Lab. They have access to the internal models that they build all the best stuff. But if that's not a canary in a coal mine, I don't know what is. Right. Even if you froze AI capabilities today, I think it's quite credible that you would see that kind of effect at a minimum propagate throughout the economy. And that's a big issue. Right. Those entry level roles are how people jump to obviously mid levels of seniority. So this is a really interesting play for Manthropic. It's a kind of big package of things. So they're looking at, at putting together this economic futures program that includes economic Futures Research Awards, which are like $50,000 grants for empirical research on economic impacts from AI, which again I think is this really important space. Like the empirical side. There's a lot of theory happening right now. You know, you got Epic AI coming out with their plots and then you've got AI 2027 coming out with their plots and people are comparing plots and you know, it's semi, it's semi evidence based. But then also, you know, we need just more empirical research to ground these predictions and analysis in. They're also setting up symposia so I can bring people together to talk about stuff and setting up strategic partnerships with research institutions. So finding all these ways to kind of drum up more interest and resources pointed in this direction, which again seems like something that's probably helpful if you believe anything like the 2027 timelines that a lot of people in the space do.
[38:05]
B
And moving to some hardware stuff first we have the story that OpenAI is saying it has no plan to use Google's in house chip. So apparently there's been recent reports saying that OpenAI had been considering using TPUs, had been testing TPUs an alternative to GPUs. With OpenAI having invested heavily in Nvidia GPUs as have most other companies, OpenAI is also developing its own chip to compete with VTPU. I guess this came out in the context of OpenAI having signed up for Google Cloud services in continued trend of splitting slightly from Microsoft. They have been on Azure from early on as a primary source of compute. So yeah, significant perhaps primarily because of what we know about them trying to create an alternative to TPUs.
[39:05]
A
Yeah, and it's also, it's also interesting from almost a historical standpoint, right, you've got two trends that are now intersecting. On one hand we've got this trend of OpenAI trying to shake itself loose from Microsoft more and more. Right. We've had a lot of stories that have covered this in different ways, but fundamentally, you know, Microsoft was supposed to be their cloud partner of choice. Now getting a bit of cold feet in terms of providing all the infrastructure support that is needed for things like Stargate. And So they've allowed OpenAI to work with Oracle. And allowed is the right word here by the way. Microsoft does have right of first refusal on these, these big kind of infrastructure build up contracts. And that's where you know, Oracle and Crusoe and all these guys are coming together to do Stargate. And OpenAI is then looking for other partners outside Microsoft. At the same time, Google is now just starting to push out in the direction of third party partnerships, trying to find ways to make their TPUs available to companies that are not Google. That historically was not a thing. Right. Google was a lot more protective about access to their TPUs. And so these two, two trends are kind of intersecting each other and causing this. What otherwise would have been an almost unthinkable partnership. Because Google so famously is the home of Google DeepMind, which for so long was the one big rival to OpenAI. And so I guess time heals all wounds. And at least they've been testing out these TPUs. It does seem that they're not going to actually go for them ultimately, which itself is interesting. You know, this is something that the information, I won't say got it wrong, but certainly their initial headline earlier this week made it seem like OpenAI was actually going to go forward with this and there were some corrections issued in that article. So yeah, last quick note I guess is that you mentioned OpenAI is developing their own chip. They're ready to hit that tape out milestone this year. So tape out is when you ship the finalized design for manufacturing, presumably in this case to tsmc. They are partnering with Broadcom on this, but they basically have their chip design that will be finished this year. That's a big, big deal. And then we'll have to see how long it takes to ramp up production and all that. But that's a big thing and something you can believe we'll be keeping an eye on too.
[41:09]
B
And next up, Nvidia is one of the investors in a new startup that has just emerged out of stealth. The startup is Emerald AI and their focus is on kind of a deeper connection of data centers into the energy grid. So they provide software that allows you to change AI workloads at and between facilities and basically connect it to the local power usage to not put as much strain on the grid, which is increasingly the case at I suppose us and some specific regions where the hyperscaler are building their data centers.
[41:56]
A
Yeah, this is kind of an interesting play, right? When you have so much power that is being soaked up by data centers and that power is being used in very inconsistent ways, right? Like during a training run. I mean locally you have these giant boom bust cycles of power consumption with back propagation and all this shit. But over longer stretches of time you also just have like sometimes a training run is going, sometimes there's fine tuning. GPU utilization isn't always as high as it could be and so you've got high amounts of variation and what you might think of as a 50 megawatt or eventually a 1 gigawatt data center is not exactly always consuming 50 megawatts or 1 gigawatt of energy. And so the question then becomes, okay, well if data center number one is sort of in a slower period and data center number two is ramping up, can we arbitrage over that? Can we have some kind of orchestration function on the grid that allows us to sort of load balance in a more dynamic way? And that's really what Emerald AI is doing. They think their modeling suggests that they could unlock up to 100 gigawatts of US data center energy supply, which would take loads off the grid and allow you essentially to build to free up more gigawatts for fresh data center builds. That's super interesting. A structural challenge vulnerability they may have is if we just develop more efficient high utilization algorithms and hardware stacks for these models such that there's just like less value in arbitraging between them. But this is certainly a really interesting play. Like haven't seen anything like this. It almost turns load balancing into a fresh source of significant power on the grid. It's the equivalent of building like huge amount of power, right? 100 gigawatts if this works as promised.
[43:43]
B
Interestingly, the founder and CEO is Varun Subram, who has a background in physics and has worked as a senior aide to ex climate envoy John Kerry. So yeah, some notable names amongst the backers of the startup beyond Nvidia, they include some individuals like Google chief scientist Jeff Dean, Fei Fei Li and some other ones. And one last story also dealing with hardware. The story is that TSMC Arizona chips are reportedly being flown back to Taiwan for packaging. So packaging, as we've covered in our long ago hardware episode, which we refer to like every other or every single episode, packaging is an important step in the production pipeline for chips. And the fact that TSMC is flying chips made in Arizona back to the one basically means that Taiwan is still, let's say, you could say, a bottleneck, where you could say is crucial to the chip supply chain for AI in particular.
[45:00]
A
Yeah, and this is actually something that we called out in our latest investigation, the America's Superintelligence Project thing, where we looked into supply chain risks, among other things. You know, thinking about how, how China could undermine American superintelligence research or add risk to it. You know, we pointed out like, hey, everybody's talking about packaging as if it's the solved problem, but if you actually look under the hood, there are structural reasons why you might think it could take a little bit longer than expected to have that online. And that creates issues just like this. So, essentially, go back to our hardware episode to look at the details, but you've got two fundamental core kinds of chips on a Nvidia gpu. You've got the memory, the hbm, and then you've got the logic die that actually does the computing, right? So the memory stores the data, the logic die does the computing, and there's this dance between them. They've got to talk to each other crazy fast because the logic die has to fetch data from memory to do math on it and then send it back. So to get them to talk to each other really fast, you have to set them both on a substrate as part of this packaging process. The packaging process requires today, this process called coas. And COAS L is kind of the latest. Well, I guess COAS R is. But anyway, there's a bunch of coas processes that can only be done really in Taiwan at scale. And so we've onshored a lot of the fab for the logic dies. What we haven't done, though, is onshored the packaging, and so we still have to ship them back to Taiwan. By the way, big winner from all this is Taiwan's AVA Air, which is the demand for air logistics services, has massively risen recently. And this is why. So, anyway, kind of, kind of interesting, there are, by the way, plans to increase and ramp up packaging quickly, including, by the way, a $165 billion investment that TSMC announced for the US. There were plans to create advanced packaging fabs that could do coas here, but there's kind of been no progress yet. So until that gets solved, you know, it's all well and good to have great logic dies being fabbed onshore, but you can't package them. You're not making chips.
[46:58]
B
And moving on to projects and open source, as we mentioned to the beginning. First up, we have announcement of Ernie 4.5, the release from Baidu of a whole bunch of models under an Apache 2.0 license. Apache 2.0, meaning you can use it for whatever, including commercial directions. And the largest of these, there's 10 models here in total, including just text to text, VLMs, image plus text to text, MOE models, dense models and so on. But the big one is a model with a total of 424 billion total parameters, with up to 47 billion active parameters. And they release a bit of evaluation, comparing primarily to other open source LLMs without a reasoning focus. So they compare to deep seq v3 Quentin free and for those models in particular, they have better performance on most of the typical benchmarks. And yeah, this is somewhat significant because you know, we have llama. We have deep seq v3 on the big model side of like actually quite big models with 400 plus parameters. Now there's another option for people who want to develop, seemingly a quite good option to build on top of.
[48:30]
A
Yeah, and this is actually kind of an interesting structural choice. Right. So they have the full version. This is like Ernie 4.5, 300 billion parameter version, has 47 billion active parameters. And so it's half the size as deep seq v3, which is the base model for R1 of course, but it's got more active parameters and so it's maybe got three times on a total parameter basis the number of active parameters as deep seq v3 and it's smaller scale. So sort of interesting trade off there in terms of, you know, like let's say flops versus versus memory. Anyway, they, they do make a point of saying it outperforms deep seq v3, the full version on the vast majority of benchmarks. 22 out of 28. I will say this is not super shocking, right? I mean V3 came out in late, late 2024. Yeah, that's right. So that what, six months ago? More than six months ago. And so when you have a model that comes out six months later it's like, oh, okay, it can be V3, I mean, sure, fine, but it's not exactly a ringing endorsement. It doesn't mean that much at this point. And this article also. Comments? I don't know. This is a terribly meaningful comment. There are no published comparisons with industry leaders like Deepseek's R1, OpenAI's O3 or Claude4. Well, yeah, because these are agentic models. Right. So that's not apples to apples. It is fair to compare the base model Ernie 4.5 directly to deep seq v3. What's going to be interesting is when they come out with X1. So this is going to be the agentic model that will be directly comparable to R1. It's not part of the current release yet, but we'll see, you know, we'll see what the performance is like there. Yeah, ultimately I think we got to wait and see what the actual usage is. Obviously usage of, of deep seq R1 is down significantly, I think like something 30% or something quarter over quarter in the last couple of months. So you know, there is a bit of an effect here of the export controls, the lack of an R2 release that's competitive with some of the other open source models that have come out. So we'll see. This may be part of the export control story as well.
[50:33]
B
Right. And it's also a bit of a shift for Baidu to my knowledge, if they haven't open sourced Ernie for context. Baidu of course is sort of comparable to Google within China. Gigantic super major company in the cloud space and they do search and they've been pretty early to the LLM game like this is Ernie 4.5. They released Ernie quite a while ago, not too much after OpenAI had ChatGPT. This is their first at least major entry in the open source space. In addition to the models, they released a technical report that was like 43 pages, not counting the appendix going into a lot of detail on the training, similar to what you saw of Deep seq that went into all of the nitty gritty of the infra and so on. And they also release tooling, so they release the training framework. They have Ernie kit, they have some other stuff as well, fast deploy deployment toolkit. So a bit of a shift in terms of the dynamics. And yeah, over this year I guess since V3 and R1 we've seen more and more open source releases of big models coming out of China. They're kind of taking the lead on where you can squeeze out performance as opposed to let's say llama. And speaking of that next story is about another Chinese giant open sourcing something. This time it's Tencent another, you know, one of other gigantic companies that is really something that a lot of maybe a majority of people use. And they have introduced Honeyeon A13B. This is nmoe model. So this has 80 billion total parameters. Only 13 billion are active at a time. Apparently it's been trained quite a lot. 2020 trillion token pre training phase and it notably has a lot of training for thinking. So they did post training with reinforcement learning for task specific rewards and they support fast and slow thinking out of a box. They say that they have state of the art agentic performance on benchmarks like math, CMAF and gpqa suppressing even bigger models which as you said, I think would be very believable where given that it's been quite a while since R1, there've been a lot of insights as to post training for reasoning. No reason really why we shouldn't be able to squeeze out comparable performance with smaller models and that Seems to be the case here. Also released with a permissive license.
[53:25]
A
Yeah, it's an interesting release. I mean the benchmarks as you say, do look good. All the standard caveats that you mentioned. The most recent model that they compare it to here is quin3, the full kind of 22 billion active parameter version. And they compare us favorably. But it's not like it blows it out of the water. It's better in some areas, worse than others. It's more sort of the incremental movement that you would expect from the space. I don't think that there's a huge, huge story here. Some of the standout features. Well, one of the standout features is the thing that's not a standout feature, which is that fundamentally this reflects yet another model build in the Chinese ecosystem that effectively mirrors the deep SEQ curriculum or training approach. Right. So you got like yes, a pre training phase, you have long context adaptation and then like fine tuning and then there's going to be an RL stage that is as yet unreleased with fast annealing as well. So fast annealing is this idea where you rapidly, towards the end of the training process, you decrease the learning rate so the model gets updated less, less quickly, let's say as it gets trained on tokens towards the end of the run. And the idea here is just like as the model gains more experience, learns more and more, like over time you should expect it to be making smaller and smaller adjustments, sort of honing in on what it needs to look like rather than continuing to kind of flop around. So they do that. The second standout feature is the fact that they use this thing called dual mode chain of thought. Right. So you got two different modes, one and it's the same model. Right. So it's not like we're shipping a query to one model to do fast thinking and another model to do slow thinking. It's like one model that can route to either sort of sub circuits within itself. So one is this like low latency fast thinking mode for more routine queries and then you've got this slow thinking mode for multi step reasoning. So you can control this with a tag system. There's like this no think tag for faster inference and then for more reflective reasoning you pass it a think tag. So it's pretty straightforward to use. They also use reinforcement learning for task specific reward models or sorry with task specific reward models. So basically you have reward models that are being used for the RL loop and those are designed kind of in a bespoke way for A bunch of different tasks that you're training the model towards. So nothing too shocking either from the standpoint of the architecture or the performance or the optimization routine or the data. But overall it's another impressive player. Tencent showing again they actually can contribute. They can play in the at least the open source game with other big companies.
[56:04]
B
Yeah, they did release a paper alongside this that has a decent amount of detail, particularly on the training with regards to this specialty. I forget the term, but with regards to the training in particular, they seem to generalize a bit. So they focus on a whole range of tasks, not just math and coding, but also creative writing, knowledge based qa, multi turn dialogues and so on. And for each of those they do supervised training and reinforcement learning. So they might be trying to generalize to domains outside the typical reasoning things. And the focus is more on the efficiency side. So the abstract is kicking off at the very beginning. They present it as a model that optimizes the trade off between computational efficiency and model performance. So the gist here is for 13 billion activated parameters during inference. Not a huge amount of parameters. This seems to give you quite good outcomes. Next we've got another open source released. This time it's more of an RL trained coding agents coming from Together AI. They call it Deep swe, Deep Software Engineer and and this is based on top of a QAN 3.32B language model. So the focus is on particularly the training of the model to be a software engineering agent. And among open weight models this is a leading one on benchmarks like sbbench Verified. I will say I think we don't yet have benchmarks that really capture the types of things you can see cloud code doing. In terms of very impressive tool calling code based exploration. A lot of these benchmarks focus more on solving GitHub issues and things like that. Not quite the same, but regardless, yeah, clearly making a push in a direction of having a cloud code capable open model.
[58:18]
A
Yeah and this is noteworthy that it comes from Together AI. We've covered a lot of their stuff in the past. Right. They have this philosophy of trying to do like kind of fully distributed open source AI training aggregating compute from around the world. There are a couple of different companies that are pushing in this direction as well philosophically but pretty impressive results and trained on 64H100 GPUs over six days. So you know, not a huge, huge workload here but certainly non trivial either. It's sort of like mesoscopic anyway. Also a partnership between together AI and agentica which is this other company, it focuses more on frameworks for post training with language model agents. So they have this framework called RLLM that, that was apparently used here presumably somewhat experimentally. So yeah, pretty cool. Continue to see more of these open source RL agents. Nothing like, nothing jaw dropping in terms of performance here. It's, it looks like it's solid. It's again a solid incremental boost to performance and especially on suitebench. Right. So this is a benchmark that does matter a lot because it is or I should say sweep verified. Right. This OpenAI kind of scrubbed version of the original sweep benchmark, it mirrors pretty closely some standard software engineering practices. So to the extent that you're seeing this model score close to 60% now as an open source model that you know, that's a pretty interesting and pretty, pretty big deal. So yeah, I mean a pretty solid move by together AI. They continue to impress.
[59:48]
B
Yeah and they also released a pretty nice report on the model with a bunch of offers and not quite as deep as these previous ones in terms of the details disclosed, but some pretty nice details in particular about the training recipe and the empirical results of training. So pretty useful for insights and research in the area. Next we've got some papers talking into some slightly different directions. First we have GLM 4.1 voltsinking towards versatile multimodal reasoning with scalable RL. So this is getting into the vision language space. We've seen more research moving beyond just reasoning and text and towards reasoning with images. And that is the frontier of research. And so this is they have a 9B thinking model that apparently is outperforming a bunch of alpha models, even surpassing really big ones like Qin 2.5vl 72B. So pretty kind of meaningful results in the reasoning with visual input space. And they open source both the reasoning model and the base model.
[61:14]
A
Yeah, there's I think a lot that's interesting about this paper and this sort of philosophy, this approach. So it is still modular. So they have a Vision encoder. The goal here is to combine or have a model that can look at images and text at the same time and do reasoning on them. So they have a vision encoder that basically looks at. It could take videos actually. So if you're familiar with convolutions, convolutional neural networks, typically those are two dimensional convolutions. So you kind of look at a patch of an image and you sort of anyway you do some math on it, you apply like a filter to that patch and it gives you a smaller patch and then you can apply another filter to that smaller patch and on and on and on. And that's kind of how convolutional nets are made. Well this is that, but instead of a two dimensional patch of image, it's a three dimensional patch of video. So in the same way you're going to take where the third dimension is time basically and you can do strided convolutions. If strided convolutions doesn't make sense to you, it's fine to worry about. But fundamentally you're looking at three dimensional patch across time now of your image or therefore of your video. And from there they're able to encode the image or the video in the model and then they take that encoding and then they pass it through basically just a simple feed forward neural network like an MLP to map it to get the dimensionality to match the dimensionality of their language model. So basically they're just like taking now the vision encoded, yeah, latent representation and they're mapping it to something that matches the dimensionality of the latent representation of the language model, this GLM based language model that they use, which is from JPU AI and then they're just going to concatenate those two together so you now have a unified image and language representation that then you can basically just do, do LLM work on in the usual way. The language model itself though is really interesting. They use bidirectional attention which because we're doing a speedrun on this story, it's in the lightning round, I'm not going to get into why that is. If you know bidirectional attention then anyway it just makes sense to do this when you're doing reasoning on images more than just going full auto regressive and just like a decoder only model. Anyway, the reasoning behind that is sort of intuitive, but I guess only if you have the intuition.
[63:34]
B
It'S intuitive. If, if it's intuitive, you know.
[63:37]
A
Yeah, like if this is a main story.
[63:40]
B
Yeah, yeah, lots, lots going on with open source this week clearly. And as you said, this is from GPU AI and Chinko University. They release, yeah, a pretty detailed report on this one as well, 18 pages and they show how it can be used for things like long document understanding, GUI agents, video understanding coding, stuff like that. So seems to be a pretty strong entry in that space. And last up, just one more new model, this one coming out of the US for once actually from Apple in collaboration with the University of Hong Kong. Not really the U.S. even, you know, mostly. Mostly. The paper title is Diffus Understanding and approving masked diffusion models for CO Generation. So they trained a 7 billion parameter diffusion LLM. Quick recap. Most LLMs do other aggressive inference. Basically you train the model to predict one token at a time or maybe a few tokens at a time given your input. Diffusion is totally different. You sort of like predicting everything all at once. Typically what you do for images, but very rarely what you do for text. But you can do it for text. And there's increasingly research on that direction. This is some of that research. They show that, you know, some insights into training effectively and yet pretty good results even compared to non diffusion models like Qin 2.5 coder.
[65:23]
A
Yeah, and we've skipped over I think. So this is the third paper this week that has its own take on grpo. So this like group relative policy optimization, the reinforcement learning optimization routine that Deep SEQ famously used and kind of made popular. Anyway, this is its own variant. I feel like that's maybe next episode we should carve out just like five minutes because the GRPO stuff is really important. It's something that you'll want to understand to get what the difference is between these different papers and their approaches.
[66:01]
B
But yeah, one of these days, so many topics we could do deep dives on. We should do a reasoning episode, clearly. Yeah, we'll try. You know, we always want to do more, but life is busy.
[66:20]
A
Yeah. And moving on to research and advancements. And speaking of doing more, so we've got this paper wider or deeper scaling LLM inference time compute with adaptive branching tree search. Okay, so Monte Carlo tree search is this well trodden path in reasoning where you have. Imagine that you start off with some attempted solution. You now have the choice as to whether you want to. Let's say you make a modification to the solution, you try to refine it and now you have the choice as to whether you continue to try to modify the thing you just modified or whether you go back to the original branch and try to spawn another variant of it. Do you explore more from that, starting from that original node, or do you go deeper and exploit more, go deeper down the path, right. That you had sort of started down. And there are two extremes, right. Like there's some models that just continuously refine the same prompt or not prompt the same output more and more and more. And in that sense they risk getting stuck in a rut. Right. They don't do much exploration. Another extreme variant to this is to say, okay, let's just like sample a crap ton, like generate a bunch of different possible solutions starting from the same prompt, but never go deeper in, you know, and refine any of them. And that's the other kind of more exploration heavy thing. Now in practice, what you want to do is balance those two, right? This is the classic reinforcement learning. And so what this paper is going to do is ask the question, what if we could vary in a principled way the number of different kind of branches that we try versus the depth that we push at in each branch? Right. What if we could trade off in a principled way exploration versus exploitation in a tree search setting? Right. So that's what they're going to do. Normally what you do in multicolored tree search is you'll like fix the number of child nodes with a fixed hyper parameter. So you'll do that. Then maybe you go to the next level or you pick one branch and expand. But what they want to do here is kind of dynamically do it. And so it gets a bit involved. But the fundamentals are they actually have an internal model that is being trained along with the main training loop to predict the value of creating a new node, of creating like doing more exploration, in other words, versus doing more exploitation. And they have different models essentially that predict the value of each for each part of the tree. So they kind of decompose the tree and go like, okay, in this chunk of the tree, you know, what is the value of adding a new node versus the value of pushing forward in the existing nodes? And essentially this is a mechanism that they're going to use. It's really interesting actually that details, if you want to look into how they implement it, it's quite interesting. They use this technique called Thomson's sampling, which is a sort of Bayesian friendly way of making this kind of principled trade off. And so the results are really impressive. They end up basically like outperforming on average all models in terms of the benchmarks that they look at the best average ranks really across all benchmarks anyway, they compare it to all the variants we talked about, just repeated sampling from the same node or just like doing sequential refinement the whole way through, or a standard Monte Carlo tree search where you fix the number of branches at each layer, all that sort of thing. One of the really amazing things is the ARC AGI sort of experimentation that they do. One last detail I gotta give you. So as they're doing this, right, you imagine like you start off at a given node, you can ask yourself, should I try refining the solution or should I try going up a level and spawning a new alternative? Another question you could ask yourself if you're spawning a new alternative is which model should I use to spawn this new alternative? And they're actually going to include that. They're going to allow the model to build another model to predict which model it should use to spawn those branches and in that way essentially integrate together. They're creating like a complex of systems here or complex of models integrating together many submodels where at each node you have the choice of explore. And then within explore, which model do I use to explore or exploit and refine? Right. So that's kind of a very unusual sort of hybrid model approach. On Arc AGI 2 they end up doing like really well, like scoring around 30% on pass at 250. So 250 at bats here with basically a number of LLM calls, budget of 250 calls, they end up hitting like around 30%, which is pretty wild. That performance, by the way, is highest when you allow the model to use the most or the framework, I should say to use the most models. So when they combine 0.4Mini Gemini 2.5 Pro and Deep Seq R1, the May 28 version, they end up getting the best performance. As you start dropping models from that list, you start to see progressively worse performance. And so this really is a framework for getting many models to play nice together and to choose what which model to use to expand nodes and also whether to expand nodes or iteratively refine. So it's a really interesting paper in a sense, a meta paper or a meta model. I just thought this is a fascinating read.
[71:31]
B
Yeah, this is somewhat significant, practically speaking, because this is one of the standard ways to structure reasoning. Right. And on the usage side, what this looks like is on their benchmarks, for instance, on coding and on RKGI is sort of quasi puzzles that measure intelligence. The meaning of tree search would be, you know, a given node is a solution and you're able to sample multiple solutions. That's the breadth, the width, or you can iterate on a solution that's with depth and you, you should be getting some feedback such as test outcomes or scoring. And one of the limitations of search is you do need a scoring function of some sort to base the tree expansion on this. So as you said, another interesting bit here is kind of the adaptation of Monte Carlo tree search to the context of LLMs. It's not super kind of intuitive because you can within a node also generate more Tokens potentially. And there's various nuances, quite a bit of detail in the implementation. But the gist is it allows you to, in a very principled sort of classic way of Monte Carlo researches with default. Right. It's one of the big algorithms, as is Thompson sampling, very highly researched. So they very much like adapted in a very clean way for reasoning and seem to get good results. Next we've got the automated LLM speed running benchmark reproducing nano GPT improvements. So they want to evaluate an AI agent's ability to reproduce scientific results and they focus on nano GPT speed run tasks. So they want to test the reproducibility of scientific results by AI. That's a motivation, right? Kind of get the results of a given research paper with your own implementation using AI. And this nano GPT speedrun task is designed to assess the efficiency and accuracy of AI models in repeating scientific improvements. So now we have a benchmark for doing that. And as with any benchmark, it sets a goal to target and at least for now has room for improvement.
[74:06]
A
Improvement, Yeah. I like to think of this as a sort of companion to or a complement to the meter AI evals when they look at, you know, what is the time horizon that AI agents can successfully automate on. So, you know, tasks that take humans an hour, tasks that take humans two hours. They tend to look at AI research tasks for that because they're interested in the sort of recursive self improvement loop where we get to the point where AI is just automating all of AI research. This is in that same spirit. And so NanoGPT is essentially like a version or an instantiation of GPT2, an implementation, I should say, of GPT2 in PyTorch that Andrej Karpathy put together a while ago. And so what's happened since GPT2 came out is we've had a series of breakthroughs, right? A series of innovations where we've iteratively improved the training time for GPT2 from 45 minutes to under 3 minutes. And there are 21 successive optimization records that have happened. So you think of it as like 21 steps down the ladder that people have taken. And each of the tasks associated with this benchmark tasks, whatever agent you're testing with, reproducing one of those 21 steps, right? So starting with the original GPT2, or at least its nano GPT version, modify that in order to achieve the first speed boost that we got historically, and then you see, okay, well, what fraction of that speed Boost did the agent successfully recover. And so that enters your sort of evaluation framework. And then you try the next one. Okay, so now that we're here, let's reset. Let's give the model the correct sort of version of GPT2 that reflects that next step, that next breakthrough, and start from there. Now replicate the second breakthrough through, see what fraction of that acceleration the model is able to recover and then repeat all the way down the line 21 times and then add up together, or one eval would be to add up together all the time savings that your agent was able to achieve. Compare that to the time savings of about 42 minutes that was actually achieved by human researchers over time, and then use that essentially to compare different models. And so, yeah, this is really quite interesting. O3 mini is the best performer they looked at here. It recovered about 40% of the speed up with hints. So this is a variant that they had. They have versions with like three different layers of hints that you can give to the model and then also versions with no hints at all. And you know, three different layers of hints. Like one is a description of the pseudo code that would be required to reproduce it. The second layer is natural language explanations. And the third layer is a full mini paper format with all the technical details. So gradually kind of giving the model more and more. But there is a version where they just zero shot it and you're not given any more context. And so Claude 3.7 sonnet, also comparable to O3 mini apparently in some conditions. But Deepseek R1 basically just like does really badly, sometimes does even worse with hints than without, which itself is kind of interesting. And then Gemini 2.5 Pro basically just bombed it got basically zero in their aggregate measures and so all kinds of interesting observations about what works, what doesn't. Worth looking at the paper. This is kind of more, I think of it as more encyclopedic knowledge that will go bad pretty quickly. But certainly the benchmark itself seems like a really important and interesting contribution that I would keep an eye out on. You know, again, think of it as a complement to those very famous meter evals. I think it's a great way of looking at that.
[77:34]
B
Right. And also just a fun, I guess, way to do this. I did not know that there was a speedrunning challenge to train GPT2 in the shortest amount of time. That's where the Nano GPT comes from. Because by these days GPT2 is considered nano. I believe GPT2 was what, like 1ish billion, 2 billion parameters.
[77:57]
A
So the version they do here is the 124 million parameter version. Right.
[78:02]
B
But yeah, tiny by today's standards. Right. And speaking of Meta and their evaluation suite, that is our next story. We have just an update. So they posted saying that just to quickly recap, they released a paper several months ago now measuring AI ability to compete to complete long tasks. And they basically have a task suite where they roughly know how long a given task takes. Could be five minutes, could be 10 minutes, could be an hour. And they measured the ability of various models to reliably complete those tasks with for instance, like there's a 50% chance you are going to get this done in an hour or less. So since that release there's been a couple of months have passed and they published an update where Claude 4 opus now reaches 50% time horizon of 80 minutes. So 50% chance it completes an 80 minute task in a time span. Sonnet reaches the Sonnet 4 reaches the 65 minute point. So they are now exceeding an hour. Slotting into the trend. The kind of prediction fit that came out with a paper.
[79:25]
A
Yeah, this is really interesting because there's so few data points, right? Like every frontier model basically is a data point. That's where the bar is. And so figuring out exactly what the trend says is really hard and even small. If you look at the plots, even these small adjustments in the slope of that log plot are the difference between hitting ASI or hitting, let's say AI agents that can do month long tasks coherently in say two years versus three or four years. So things can be very sensitive to that. So these small little updates, every time you get a new model, you want to fit it to the plot really quickly and be like, oh, how does that affect the slope of the curve? Certainly when O3 came out, that was a big, big update. If you look at the plot, O3 seems to in concert with other models like Sonnet 3.7 and Zero1 in Sonnet 3.5 from back in October really seems to suggest there's an even steeper trend than is otherwise indicated. Claude for Opus does not, by the way, beat 03. It actually. So 03 is above an hour and a half. Claude for Opus is a few minutes shy of that. Like I don't know what an hour and 15 you said Andre, something like that. It's notable, you know, it has been a little while and we're not, we're not exceeding now. I would say that's still within noise looking at the plots, but it could make you update A little bit. This is the debate that's happening right now. Right. People are trying to figure out what this means exactly and probably overthinking it. We're going to have to wait until the next OpenAI or next anthropic agentic model drops. But definitely, you know, this is something really, really to keep an eye on just because of the implications. Right. If these curves really do curve the way they seem to, then we could be in for a hell of a party over the next few years. And how much of a party? Well, that's contingent on a relatively small number of data points.
[81:15]
B
Yeah. And you know, this is a tricky thing to evaluate, obviously, because, you know, you have a set of tasks that they evaluate on and you know, how do you really know how long it takes? But MITosenses seems pretty plausible from using cloud code like that, it can autonomously finish a one hour task. It's definitely getting there, in my opinion. And next story, a research paper, and this one is titled Performance Prediction for Large Systems via Text to Text Regression. The gist of it is the focus is on predicting the outcomes of some sort of configuration of a system. So for instance, you have a cluster, you have some way to configure the cluster and you want to be able to predict the latency of that setup. Tricky task, very useful task to do well on. And they are training a model that does that for you based on system logs and things like. Like that. Go from previous data to a prediction of your performance with a new setup and I get, as you might expect, really good correlation and performance with this approach.
[82:30]
A
Yeah, I guess the core of this is a debate that's been happening for a long time in especially language modeling, but elsewhere too, as to whether the decoder only architecture is the right way to go or whether you should use an encoder decoder structure. Right. So an encoder decoder is a model that will start by, through many layers, specialize in just generating a really good encoding of an input, and then have separate layers that specialize in massaging that encoding to turn it into a decoding in different clumps, essentially in different stages, with an optimization routine that reflects that intent. So the advantage people will claim for the decoder only version is you have essentially like an integrated thing, like one model that's able to kind of address dependencies and interactions between the lowest layers and the highest layers without having to go through a bottleneck where you have a well defined encoded latent representation. Whereas the encoder decoder side would Say, well, it's good to have specialization so you can make a really good encoding and then separate that out from the decoder step, which is kind of this fundamentally different operation. And I mean the answer as to which of these approaches is best does seem to be context dependent. Certainly this paper strengthens that argument. What they're going to do here is have two encoder layers again that specialize in just taking in this semi structured data about the state of a system. As you said, they used Google's Borg compute architecture as their testing ground for this. So they use system logs as inputs. They use all kind of like the equivalent of like check engine lights and things like this that they feed in. And some of this data is in text form by the way. But this is not a language model. It's not going to learn to understand the semantics of the text. It's only going to learn to understand those indicators insofar as they are correlated to the one metric, the number, that the model ultimately is going to predict. And so this is not an autoregressive model. It's taking these raw inputs and it's predicting a number which is like kind of a measure of the efficiency of the overall system, the predicted efficiency. So you've got two encoder layers, two decoder layers for 60 million parameters in total. So relatively small model. So this is a really effective way it seems, of making these predictions on how the system is going to, is going to work. The Borg cluster scheduling system, this is the kind of source of data they use. It's like big cluster sort of orchestration system that Google uses is all the raw data that they're using to train this. They do it using cross entropy loss over response tokens that they get from the system system that indicate how it's doing, that indicate like what the status is of the overall system. And so anyhow it's pretty interesting. It's, it is more of a niche sort of application. It's showing that just using a raw LLM trained from scratch on text data is not necessarily the best play if you have a more structured problem. Again, this is not perfectly structured. You still have language inputs, but those language inputs again are you think of them more as categorical variables and that's sort of the interpretational frame that they're applying here.
[85:39]
B
Next up, just a couple more papers. The next one is does Math reasoning improve general LLM capabilities? Understanding transferability of LLM reasoning. We've really got to get going. Sir. Just a very short gist of a paper they are exploring, if you train specifically to do better on math, are you going to be able to do better on reasoning in general outside of math, like, I don't know, science problems, for instance, or coding. They find that depending on how you do it, you might actually get negative transfers. So supervised training just doesn't work as well compared to reinforcement learning. Reinforcement learning kind of has a more subtle effect that doesn't mess up your initial model as much and generally seems to result in better transfer and last up. Correlated errors in large language models. Another kind of empirical analysis paper, they are looking at the correlation of errors among different large language models using several data sets. And they look at 349 LLMs on 12,000 multiple choice questions. And so the question is like among different LLMs, how similar are they in terms of what they get wrong? They found that the correlation is pretty high. Models agree on incorrect answers about 60% of the time on the helm leaderboard. So much more likely than random chance. And suppose not entirely surprising perhaps, but still interesting from an empirical analysis perspective.
[87:27]
A
Yeah, absolutely. I mean, this remains true too, regardless of architectures used and model developers, which sort of leaves, I mean, and presumably optimization routines as well. So that basically leaves like the data. Right. And it kind of makes a lot of sense. Right. There's only so much Internet data and everybody will be using highly overlapping data sets as part of their training. So in some sense, maybe not surprising in others. I mean, one of the things you would expect is like, I'd be interested in seeing the overlap of this with just general like frequency of errors from these models because as the frequency of errors drops, the errors themselves get more and more rare and scarce. And so you're getting a more and more distilled picture of the. I mean, I don't want to, it's definitely not the irreducible entropy of the training data, but it's, it's sort of gesturing in that direction. So anyway, kind of, kind of interested in what that looks like if they're plotted together. But interesting paper.
[88:26]
B
Yeah. And they focus in particular on this area of job applicant screening as kind of a outcome of this analysis. And as you might expect, having lower correlation is better because if you have lower error correlation, it means you can look at several LLMs and potentially avoid an error because one LLM gets it wrong and everyone gets it right. And the gist of it is, you know, you need to sample Quite a few LLMs to be able to get to lower error because of a decently high correlation among them.
[89:04]
A
And up next onto the policy and safety section, we're starting with forecasting biosecurity risks for from LLMs. Okay, so there's been a lot of talk about whether LLMs actually do make it more likely that bad actors are going to be able to design or release more dangerous bioweapons. And you know, famously there was this like RAND study from a year ago that said, guys, don't worry, it's or not don't worry, but like good news, there's no meaningful uplift from and I think at the times like GPT4 or something something and then OpenAI came out with something a few months or weeks later saying actually we tried this or something analogous to it and we have access to the full unlocked version of GPT4 and we do get a quite significant detectable increase in the probability that people with a little bit of training or a significant amount of training are able to access dangerous bioweapons. And so this is another take on this. And we've seen by the way, other benchmarks too in system cards from Anthropic, from openaisense that have meaningfully increased that even further quite significantly. So this is another take on it. Instead of throwing these models directly at tasks that people think will be correlated with high bioweapon risk, what they're doing is they're turning to a bunch of experts in biosecurity and biology. So 46 domain experts in biosecurity and biology and 22 expert forecasters. Right. These so called superforecasters. And what they look at is okay for all these folks, we want to get you to predict the probability of a human caused epidemic causing over 100,000 deaths by I think it's 2028. Yeah. And so they, that's the, you know, the base question. There are a whole bunch of other questions correlated with that or that follow from that. But that's kind of the base question, that's the meat and potatoes. And then they divide the group up or the overall group up into in different ways to look at how, how accuracy or sorry, how that predicted probability changes depending on who you ask. Right. So for example, the overall assessed probability was somewhere between it was around 1.5% with AI and 0.3% without. So they're predicting a very significant increase in the probability of a 100,000 deaths from again a human caused epidemic by 2028, which is quite significant. But it turns out that the people who most believe or who assign the highest probability to this are also the people who are more accurate on predicting the progress of large language models. They're also the people who have the most experience in biosecurity. They're also the people who get highest accuracy on low probability questions that they ask otherwise in the survey. So that's kind of bad news, right? The people who are the best at forecasting this sort of stuff tend to assign the highest probability to this, ultimately leveling out at around 1 to 3%, something like that, but still significant. And so this is something that does suggest, hey, there is meaningful uplift, at least according to these forecasters. So take it with a grain of salt. The last thing I'll mention is they do say that mitigation measures are probably going to be enough to buy down this risk. When they were asked about mitigation measures like including mandatory screening of synthetic nucleic acid orders and just basic AI model safeguards, they reduced their risk forecast back to close to baseline level. So they basically figured if you do put in the right mitigation measures, you should be able to essentially buy down all the risks that comes from this. I'm personally really skeptical about that. I think people very much sort of overestimate the effectiveness of a lot of these safeguards for reasons we could talk about. But yeah, anyway, I think a really interesting study, and again, a fundamentally different angle, right, from these more empirical studies that that RAND has put out, that OpenAI has put out, that anthropics put out, that are, in their own right, very useful. And RAND gets credit for kicking off that trend so many months, I want to say, over a year ago now.
[93:07]
B
Yeah, quite an interesting read. And they do go into quite a bit of detail. So they start out with this unconditional question, what's the probability of 100,000 deaths due to a pathogen in 2028? And then they then condition on various hypothetical advancements in NLMs to see what the change is. So they begin with this 0.3% baseline that rises to 1.5 conditional on the several hypothetical algorithmic capabilities. And amusingly, they then check, eventually they have, and the people responding to the survey thought that would not happen until 2030. So that is an interesting data point, saying that maybe the forecasters are underestimating the degree to which LLMs are moving and are able to achieve these advancements which could color your prediction, or at least would mean that this 1.5% probability is their actual prediction given the state of LLMs.
[94:18]
A
And obviously, you know, people listen to podcasts, you're aware of my bias on this. Like, I do think that AI is moving a lot faster than most people realize or want to admit to themselves. And one tell is things like this, this happens over and over and over again. You'll have people say like, you know, oh, we're not going to have this. When you actually get people to give you dates by which they think certain capabilities will emerge, they tend to just hilariously predict like, oh, It'll be another 10 years, another five years. And the usual case is like the thing gets done in a month or two. But in this case, this is such a great example of it because it literally had already happened. It's so hard to keep up with the space and fairness. It does move that fast. And we ourselves are surprised by things all the time. But that's kind of part of the problem, right, that you need to have some amount of like, yeah, I guess epistemic uncertainty when it comes to this stuff and factor it in. If you find you continually get surprised by how fast things are moving, then, you know, maybe that implies you should just change your world model. And anyway, I think a lot of people are banging that drum these days.
[95:21]
B
Yeah. And the specific hypotheticals here are quite specific. So they're talking about AI enabling 10% of non experts to synthesize DNA fragment, I think of some influenza from 1918 in a laboratory. They are looking at the virology capabilities test which came out just a couple months ago. So these are, you know, not sort of just like, oh, you do this well at some coding benchmark, very specific to WAT lab work, virology work, things like that, which is obviously quite relevant and I think does lend this more credibility as an analysis. Speaking of predictions, next we have AI task length horizons in offensive cybersecurity. So this is an adaptation really of the methodology of matter we just discussed about predicting the length of time tasks that alarms can do. This is less formal, just FYI. It's a blog post by just one person. They kind of estimated the length of tasks for various benchmarks, by the way.
[96:38]
A
Sorry, this is less formal in a space where it's like. Now that the formal version of course is a preprint slapped together and thrown on the archive without peer review. Right. You just. It's just funny that that's like.
[96:53]
B
Yeah, yeah, yeah. I'm just outlining it because a blog post, it's totally, totally.
[96:59]
A
I had this double take where I agreed with you, but then I was like, wait a minute, like, what's the bar?
[97:03]
B
Yeah, it's not like these things are being released in journals, but anyways, in this slightly more informal analysis, they have tasks ranging from 0.5 seconds to 25 hours in human estimated times. And they are seeing that still pretty early. Recurrent models can solve six minute tasks with 50% success rates. But as with matter, you know, you can do little analysis showing that these models are likely to double every six months or sorry, four months or so.
[97:44]
A
This is exactly the debate. Yeah. That we were talking about earlier. Right. How do you fit that curve? Some ways of fitting it you get four months, some ways you get six, some ways you get seven. It's pretty unclear.
[97:55]
B
Yep. So there you go. And more empirical analysis and obviously related to safety in the sense that cybersecurity is a huge challenge and LLMs are kind of an obvious fit for hacking, for things like that, where unlike biology, for instance, where you need a wet lab and you need to work with a human, here you could very easily see an agent going off and doing some hackery. And with the launch of AI coding as well, it's going to be a lot of cybersecurity stuff going on in the next few years.
[98:31]
A
It's fine. Everything's fine, guys. Yeah. This benchmark, by the way, is in my opinion, extraordinarily badly needed. Any threat model that you have that runs through, you know, AI self replication, loss of control, weaponization. Right. The cyber use case is arguably, and you would have to argue this, but arguably is the most sort of real and present that you might expect impacts from in the near term. And so you should be very friggin interested in measuring how successful these models are at long horizon. Tasks that look like capture the flag challenges, that look like malware generation challenges, natural language to bash translation, that sort of thing. So they look at five different buckets of tasks that have different time horizon characteristics, five different benchmarks, I really should say. So there's like Cybash Bench, which is the shortest timeline task, which were actually created by the author of this paper, presumably because there just aren't tasks short enough that like GPT2 can do anything meaningful whatsoever. So that's like one to three second tasks. @ least that's how, how he assesses. And, and we could get into this and at some point we may, but the open question is always how do you assess the amount of time that it takes for humans to complete these tasks? That's itself a very interesting question, especially as you get into very, very short and very, very long tasks. But anyway. NL2 Bash Enter code capture the flag NYU capture the flag. Cybench. Cybench by the way interesting because the task length there range from 2 minutes to 25 hours. So you're really covering quite a wide kind of temporal range for models from 2019 to mid-2025. And it's all the usual curve fitting stuff. Check out our first podcast on the meter evals where we did a deep dive into their methodology. That'll give you a good sense for how this is being assessed here. It looks like a five month doubling time here, so six minutes today, but it's doubling every five months. So the five month doubling time suggests that you would reach a week long task within about five years. There's a lot of caveats here. I think one of the really interesting things to note though is that the time horizon here is so much shorter than the time horizon we see with the metery valves. Right. The meter evals are showing us hour and a half for O3 as we just talked about, and yet here we're talking about like six minutes. So yeah, what's the delta there? Well, part of it is the labs are directly optimizing for recursive self improvement. Right. This is not even a secret, this is just straight up what they will tell you in their blog post, which itself is a super dangerous thing to do and shouldn't be done. But that's part of the roadmap. So there's optimization pressure pointed directly along that axis here. You can think of the cyber capabilities, the offensive cyber capabilities, at least for these models that are kind of public and publicly used as being a side effect of optimization against the kind of core AGI benchmarks. And so that's one reason why you're not seeing necessarily the same impressive uplift here, but you're still seeing the doubling time. That's quite interesting. Right. It suggests some robustness to the broader trend of exponential coherence length increases in these AI agents.
[101:38]
B
Yeah, I think as we met earlier, like similar caveats, even more so here in terms of like the estimate themselves of of time to complete by a human is quite hard to be reliable and here there's only one person able to do that. The task distribution is also more heavily leaning towards the short side. So yeah, quite like a large majority of the tasks are between 1 second and 10 minutes and you do get a decent amount going up to one hour, but then outside of like two hours you get very few tasks. So yeah, very impressive effort by a single individual, but do take it with a grain of salt regardless. Clearly the case that LLMs are able to do a bunch of cybersecurity stuff. Alrighty, just a couple more things. Moving on to the policy side and we start with the US where we've been dealing with the saga of the One Big Big Beautiful Bill, which just passed yesterday through the House and is going to President Trump's desk to sign. The One Big Beautiful bill is primarily about the budget, about various tax cuts for the rich and various cuts to services by the government. But in it, tucked away, there was a section that we covered previously that would have banned regulation of AI by the states for 10 years, I believe, which became a bit controversial once it was highlighted. It was then removed from the bill. The Senate voted 99 to 1 to remove this proposed 10 year moratorium on state level AI regulations. So it's out. And this article goes into the aggressive lobbying for the moratorium led by A16Z and Meta and others, and must be.
[103:45]
A
Said, seemingly OpenAI as well. This is quite interesting because the case has been made. So there's this guy, Adam Terrier, I want to say, who's kind of famous in D.C. for having come up with the idea of a 10 year state moratorium on on AI regulation. His big claim has been to trot around this number where he says that there are over a thousand state level bills that are imposing regulations on AI and that this would create an untenable mishmash of like state level, like regulation that you had to have to adhere to. And then of course there may be a federal package that comes through at some point and this makes it impossible for small companies, as Andreessen Horwitz calls them little tech, to compete in that space. There's just a little problem. That 1000 figure is for all intents and purposes, like embarrassingly made up. It basically seems to come from a search of a database of state level regulation that just uses AI. So a lot of these things are just using the term AI, defining it, or even in some cases finding ways to advocate for it to get it to be used in education and things like that and don't actually introduce any meaningful constraints on companies ability to use or develop AI. And so it's sort of disingenuous frankly to talk about it as like there's a thousand things like this when you whittle it down. It seems like the estimates I've seen are around 40 of these things that actually are material. The majority of those will not actually pass either. So the number gets whittled down pretty fast by about like two orders of magnitude, which I think is significant. The other piece of this too is the argument was historically that like, rather than having the states regulate this, we should regulate at the federal level. Pass a law at the federal level, which sounds like a great idea until you realize that the federal government has been gridlocked on the issue of AI regulation legislation for forever. Right? I mean, it's been five years since GPT3, it's been three years since ChatGPT. We're still in this endless cycle of having committee testimony and hearings and investigations and all this stuff and it never really goes anywhere. And this is a recognized pattern in the Valley. People know, lobbyists know that there's this gridlock at the federal level. And so by saying, hey, let's preempt any state level legislation for 10 years by the way, which is like OpenAI internally believes superintelligence gets hit within like, you know, five years tops, more likely three years, something like that. So the idea that like at the state level there's no regulation for 10 years seems pretty insane. And then there's a basic question or fact of the matter that states are different, right? I mean, California which has OpenAI in their borders, which has, you know, a lot of big labs in their borders, versus Idaho which, you know, or Virginia which has a bunch of data centers but no frontier labs. Obviously these states are going to have fundamental differences in the way that they need to regulate and legislate AI. So it actually does make sense that you should have some freedom. Like I'm old enough to remember when states rights was, was a thing among kind of more libertarian leaning people such as actually, you know, myself in this space. So yeah, it kind of like seems like a weird play to try to strip away states rights, fortunately. And, and I think this is a reflection of just good sort of education in the Senate on this issue. This was thrown out very recently, overwhelmingly voted against this provision on a 99 to 1 margin to rip out this state level preemption. So it's a pretty remarkable defeat by the end of it. I think Ted Cruz was sort of like championing this thing forward using the China scary line, which I actually take as well. Right. If you're tracking the work that we've done, nobody is more on the China scary camp than we are. But there's a kind of fundamental misunderstanding here of the role that state level AI legislation can play in a context where there is nothing happening federally. Like this is just being real about it. That's the consequence. And you have to imagine that's exactly what the, you know, the companies that have been lobbying for this, especially Andreessen Horowitz, were thinking, and now the backlash. You know, the problem is when you do something like this, it is so obvious to people that what you are trying to do is like, lock in a competitive advantage for the, you know, the open eyes of the world that, yeah, now you're going to get a backlash. What this looks like is exactly what it is. It looks like Marc Andreessen stepping into the federal level, trying to backdoor his way into putting some pretty extreme legislation on the table that's reactionary in just the same way as a lot of, like, burdensome regulation that's been proposed at the federal level would be reactionary in the other direction. And so I think that this could actually backfire in some pretty concerning ways. You just got to be more careful about this and especially like, you know, taking the temperature of the public on this. People are interested in regulating, and so this doesn't kind of match the public perception. So I'll get off my soapbox. But I just, I feel like this is a bit of an own goal for the people who are looking for this sort of thing, like federal nullification of state laws in this way, without a federal framework in place is straightforwardly unprecedented. Like, this would never have been done. And so the bet here is literally that, like, we are so confident that we don't want any state level AI legislation over the next 10 years as super intelligence may come and go. We are so confident that we won't need that, that we're going to lock it in at the federal level. Like, that's, that's some balls, dude. That's some real, real balls. I wish that were the case. I, I think we need a little bit more flexibility on this. And you know, Ted Cruz is doing his best and everybody is, but I think ultimately it just reflects a misunderstanding fundamentally of the trajectory of the technology. Yeah.
[109:22]
B
And worth noting, this is particularly important in the case of the US because there is no real federal regulation at the national level, and there is not going to be any, at least until this president is out of office, just based on who he chose to be leading on the tech side. So effectively the states are the ones doing any sort of regulation. For instance, by the way, I'm a.
[109:53]
A
Little skeptical, like, I think that the feds may well come in and regulate this, but the point is that they retain the option to. Right. They're not like saying, hey, we're not going to put in any regulation for 10 years. Blanket statement. Like, that's what's insane to me. Right. It's like, let's enshrine this in the law of the. Like what? Like it's just literally like let's take options off the table. Right, Sorry. Yeah, it just seems so. It's also a bipartisan thing. Like Marjorie Taylor Greene famously came out and said, hey, like I, I voted for this bill before I realized that this crazy thing was in there. Now that I see it, I'm like, holy shit, I never would. Here's their quote, right? I'm not voting for the development of Skynet and the rise of the machines. By destroying federalism for 10 years. By taking away states rights to regulate and make laws on all AI. Like that's, this is a bipartisan issue. That's why it's 99 to 1 in the Senate. This is an insane like to think you can backroom your way. I think this turns a lot of people off the kind of Washington and lobby treadmill here. It just kind of seems to be exposed for an attempted, I won't call it like a fraud on the American people here, but this is like an undemocratic play that was attempted and that's not good. You know, I mean, anyway.
[111:06]
B
Yeah, yeah. So also just kind of weird that they tried to sneak us in in the budget reconciliation. There's like nuance where the states would lose some federal grants if they go against it. Also notable because in hindsight or looking back so far, the biggest effort to regulate AI was last year with SB 1047 in California. As we've covered, that was defeated in significant part due to large lobbying by tech. It went to as far as the governor, the governor vetoed it. So this would basically have prevented that. Right. And so likely it's going to happen again in California. There are efforts to do a revamped 1047 and quite significant in that context as well.
[111:58]
A
And by the way, just like to make the point, if anybody is clutching their partisan pearls on this. Right. This is again not a partisan thing. SB 1047 obviously passed the highly Democratic legislature in California, but it was vetoed by Gavin Newsom. Right. Like the most liberal governor basically in the entire country under pressure from Nancy Pelosi among other people at the federal level writing in and saying it should be scrapped. So it like this is a really fascinating issue in that it just crosses nukes, all partisan lines. I love issues like that by the way, because they prevent us from just seeing things through this lens that we, you know, we want to see it through. You know, Republican versus Democrat. The reality is that's not what's happening by figuring out what the right play is. I anyway, I genuinely feel bad for anybody who's in the in the hot seat of having to make the calls on this. They're hard to make, but surely preserving optionality ought to be part of our basket here.
[112:50]
B
Yeah, we'll say. I don't know if Gavin Youssef is the most liberal, but in a very democratic state, to be fair. Anyways, moving on, one last story. It's about Denmark. They are going to tackle deepfakes by giving people copyright to their own features. We're going to be amending copyright law to give individuals rights over their own body, facial features and voice. It's one of the first initiatives of its kind in Europe, has brought support and it's going to take a little while. It's still being submitted for consultation and will be formally submitted in the autumn. Kind of relates to some efforts elsewhere, certainly in Hollywood. There's been negotiations, but to my knowledge, not too much on the law side as far as copyright over your own appearance.
[113:47]
A
Yeah, how you actually define the bounds on that too is going to be really interesting. Right? That's always AI has a way of fuzzing the boundaries around everything and so, you know, how much can you modify a face until it's not your face anymore? All this stuff. But yeah, really interesting because we accept makeup, right? We accept hairstyle differences, all this stuff. So at what point is AI an adornment versus a fundamental change of appearance? Anyway, some of the interesting philosophical questions we'll have to deal with.
[114:16]
B
And with that, we are finished with this episode as as promised, kind of a long one with lots discussed. So hope you enjoyed that. Thank you as always for listening, in particular if you make it all the way to the end. We also appreciate it if you share the podcast, if you review it and so on. But more than anything, do keep tuning.
[114:37]
A
In.
[114:52]
B
Tune in, tune in when the AI news begins begins.
[115:04]
A
Break it down Last weekend AI come and take a ride Hit the low down on tech and let's let it slide Last week in AI come and take a ride Up a lab to the streets AI reaching high New tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in, tune in get the latest with ease Last weekend AI coming take a ride Hit the low down on tech and let it slide the headlines pop data driven dreams they just don't stop Every breakthrough, every code.
[116:11]
B
Unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.