Loading summary
A
Foreign welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. And you can check out the episode description for the links to that and timestamps. I'm one of your regular hosts, Andrei Karenkov. I studied AI in grad school and now work at a generative AI startup.
B
And I'm your other host, Jeremy Harris, co founder of Gladstone AI, AI, National Security, blah blah blah, as you know. And I'm the reason this podcast is going to be an hour and a half and not two hours. Andre was very patiently waiting for like half an hour while I just sorted out just my my daughter's been teething and it's wonderful having a daughter, but sometimes teeth come in sitting six or eight in a in a shot and then you have your hands full. And so she is the greatest victim of all this. But it's a close second because boy that was I kept saying five more minutes and it never happened. So I appreciate the patience, Andre.
A
I got an extra half hour to prep, so I'm not complaining. And I'm pretty sure you had a rougher morning than I did. I was just drinking coffee and waiting, so not too bad. But speaking of this episode, let's do a quick preview. It's gonna be again kind of less of a major Newsweek, some somewhat decently big stories tools and apps. Gemini CLI is a fairly big deal applications and business. We have some fun OpenAI drama and a whole bunch of hardware stuff going on and not really any major open source stuff this week, so we'll be skipping that research and advancements, exciting new research from DeepMind and just various papers about scalable reasoning, reinforcement, learning, all that type of stuff. Finally, in policy and safety we'll have some more interoperability safety China stories, the usual and some pretty major news about copyright following up on what we saw last week. So that actually would be one of the highlights of this episode towards the end. Before we get to that, do want to acknowledge a couple reviews on Apple podcasts as we do sometimes. Thank you to the kind reviewers leaving us some very nice comments. Also some fun ones. I like this one. This viewer said I want to hear a witty and thoughtful response on why AI can't do what you're doing with the show. And wow, you're putting me on the spot being both witty and thoughtful. And it did. Maybe think I will say I did try NotebookLM a couple months ago. Right. And that's the podcast generator from Google. It was good, but definitely started repeating itself. I found that LLMs still often have this issue of losing track of wherever at like 10 minutes, 20 minutes in repeating themselves or just otherwise and repeating themselves too.
B
And they'll just keep saying the same thing and repeating over and over, like they'll repeat and repeat a lot. So yeah, yeah, that was, that kind.
A
Of repetition was solved a couple years ago, thankfully. But honestly, you could do a pretty good job replicating Last week in AI with LLMs these days, I'm not gonna lie. But you gotta have to do very precise prompting to get our precise Personas and personalities and voices and so on. So I don't know. Hopefully we're still doing a better job than AI could do, or at least doing a different job than the more generic kind of outcomes you could get trying to elicit AI to make an AI news podcast.
B
But dude, what AI could compete with starting 30 minutes late because it's daughter's teething? Like I challenge you right now, try it. You're not going to find an AI that can pull that off.
A
You can have an AI that says it does, but will the emotion of that experience actually be in it? I don't think so.
B
I think the copium way. Right. People are often like, oh, it won't have the heart, it won't have like the soul. You know, the podcast. It will, it will. In fact, I think arguably our job is to surface for you the moment that that is possible, that you can stop listening to us. One of the virtues of not being like a full time podcaster on this too is we have that freedom, maybe more than we otherwise would, but man, I mean, it's. I would expect within the next 18 months, hard to imagine that there won't be something comparable. But then you, you know, your, your podcast host won't have a soul. They'll be stuck inside a box.
A
Well, in fact, I'm certain. I believe as of quite a while ago there are already AI generated AI news podcasts out there. I haven't checked them out, but I'm sure they exist. And nowadays they're probably quite good. And you get one of those every day as opposed to once a week and they're never a week behind. So in some ways definitely superior to us. But in other ways, can they be so witty and thoughtful in responding to such a question? I don't know. I don't think can they be so lacking in wit and thought as we can be Sometimes that's a challenge.
B
They'll never out compete with our stupid.
A
Yes, as is true in general, I guess you'd have to really try to get AI to be bad at things when it's actually good. Anyways, a couple more reviews lately so do want to say thank you. Another one is called this the Best AI podcast which is quite the honor and says that this is the only one they listen to at normal speed. Most of the other podcasts are played in 1.5 or 2x speed. So good to hear we are using up all our two hours at a good pace.
B
That's right.
A
Funny, a while ago there was a review that was like I always speed up through Andre's talking and then have to listen. Don't worry for Jeremy. So maybe I've sped up since then. So yeah, as always, thank you for the feedback and thank you for questions that you bring in. I think it's a fun way to start the show. But now let's go into the news, starting with tools and apps and the first story is I think one of the big ones of this week, Gemini cli. So this is essentially Google's answer to Claude code. It is is a thing you can use in your terminal which for any non programmers out there is just the text interface to working on your computer. So you can, you know, look what files they are, open them, read them, type stuff, et cetera, all via non UI interface. And now this CLI is that is Gemini in your terminal and it has the same source of capabilities at a high level as cloud code. So it's an agent and you launch it and you tell it what you want it to do and it goes off and does it. And it sort of takes turns between it doing things and you telling it to follow up, to change what it's doing or to check what it's doing, et cetera. With this launch, Google is being pretty aggressive, giving away a lot of usage. 60 model requests per minute and 1000 requests per per day. It's a very high allowance as far as caps and there's also a lot of usage for free without having to pay. I'm not sure if that is the cap for free, but for now you're not gonna have to pay much. I'm sure sooner or later you get to the cloud code type of model where to use cloud code. At the highest level you have to pay $200 per month or $100 a month, which is what we at our company already do. Because cloud code is useful. From what I've seen on conversations online. The vibe eval is that this is not quite as good as cloud code. It isn't as capable as software engineering at using tools, just generally figuring things out as it goes. But it was just released. Could be a strong competitor soon enough.
B
Yeah, I'm still amazed at how quickly we've gotten used to the idea of a million token context window, by the way, because this is powered by Gemini 2.5 Pro, the reasoning model and that's part of what's in the backend here. So that's going to be the reason also that it doesn't quite, you know, live up to the Claude standard, which is obviously a model that's a lot. I don't know, it just seems to work better with code. I'm curious about when that changes by the way, and what Anthropic's actual recipe is like. Why is it working so well? We don't know obviously, but someday, maybe after the singularity, when we're all one giant hive mind, we'll know what actually was going on to make the Claude models this good and persistently good. But in any case, yeah, it's a really impressive play. The advantage that Google has of course over Anthropic currently is the availability of just a larger pool of compute. And so when they think about driving costs down, that's where you see them trying to compete on that basis here as well. So a lot of free prompts, a lot of free tokens. I should say good deals on the token counts that you put out. So you know, it's one way to go. And I think as the ceiling rises on the capabilities of these models, eventually cost does become a more and more relevant thing for any given fixed application. So that's an interesting dynamic. Right? The frontier versus the fast followers. Don't know if it's quite right to call Google a fast follower. They're definitely doing some frontier stuff. But anyway, yeah, so interesting next, next move here. Part of the productionization obviously of, of these things and entering workflows in very significant ways. I think that's, you know, this is heading in slow increments towards a world where agents are are doing more and more and more and more know context windows, coherence lengths are all part of that.
A
Right? Yeah, we discussed last year like towards the beginning of last year was real kind of hype train for agents and the agentic future. And I think cloud code and Gemini Cli are showing that we are definitely there. In addition to things like replit lovable, broadly speaking, LLMs have gotten to a point partially because of reasoning, partially presumably just due to improvements in LMS where you can use them in agents. And they're very successful from what I've seen. Part of a reason cloud code is so good. It's not just Claude, it's also just cloud code. Particularly the agent is very good at using tools, it's very good at doing text search, text replacement, it's very keen on writing tests and running them as it's doing software engineering. So it is a bit different than just thinking about an LLM. It's the whole sort of suite of what the agent does and how it goes about its work that makes it so successful. And that's something you don't get out of a box with LLM training, right? Because tool usage is not in your pre training data, it's something kind of on top of it. So that is yet another thing similar to reasoning where we are now going beyond the regime of you can just train on tons of data from the Internet and get it for free. More and more things in addition to alignment now you need to add to vellm beyond just throwing a million gigabytes.
B
Of data at it, it really is a system, right? Like at the end of the day it's also not just one model. A lot of people have this image of like, you know, there's one monolithic model in the back end. Assume that there's a lot of like models choosing which models to answer a prompt. And I'm not even talking about moe stuff like just literal software engineering in the back end that makes these things have the holistic feel that they do.
A
So yeah, FYI by the way, I didn't remember this so I looked it up. CLI stands for command line interface, Command line, another term for terminal. So again for any non programmers fun detail. And speaking of cloud code, the next story is about anthropic and they have released the ability to publish artifacts. So artifacts are these little apps essentially you can build within Claude, you get a preview and interactive web apps more or less. And as with some other ones, I believe Google allows you to publish Gems is what they call it now you can publish your artifacts and other people can browse them. They also added the support to building apps with AI built in with Claude being part of the app. So now if you want to build like a language translator app within Claude, you can do that because the app itself can query Claude to do a translation. So you know, not a huge delta from just having artifacts, but another sort of seemingly Trend where all the LLMs tend to wind up at similar places. As far as you add things like artifacts when you make it easy to share what you build and it's something that anyone can do. Most users on their free Pro Max tiers can share and they'll be interested to see what people build.
B
And if I'm replit I'm getting pretty nervous looking at this. Granted, obviously relet has so repl it write that platform that lets you essentially like launch an app really easily, takes abstracts away all the like server management and stuff and like you've got kids launching games and all kinds of useful apps and learning to code through it. Really, really powerful tool and super, super. I mean it's 10x year over year, it's growing really fast. But you can start to see the, the frontier moving more and more towards let's make it easier and easier at first for people to build apps. So we're going to have you know, an agent that just writes the whole app for you or whatever and, and just produces the code. But at what point does it naturally become the next step to say, well let's do the hosting, let's abstract away all the things you could see OpenAI you could see anthropic launching a kind of app store. That's not quite the right term. Right, because we're talking about more fluid apps but you know, moving more in that direction, hosting more and more of it and eventually getting to the point where you're just asking the AI company for whatever high level need you have to and it'll build the right apps or whatever. Like that's not actually that crazy sounding today. And again that swallows up a lot of the repli business model and it'll be interesting to see how they respond.
A
Yeah, and this is particularly true because of the converging or parallel trend of these context model protocols that make it easy for AI to interact with other services. So now if you want to make an app that talks to your calendar, talks to your email, talks to your Google Drive, whatever you can think of, basically any major tool you're working with, AI can integrate with it easily. So if you want to make an app that does something with connection to tools that you use, you could do that within cloud. So as you said, I think both replit and lovable are these emerging titans in the world of building apps with AI. And I'm sure they'll have a place in the kind of domain of more complex things. We need databases and you need authentication and so on and so on. But if you need to build an app for yourself or for maybe just a couple of people to speed up some process, you can definitely do it with these tools now and share them if you want. And onto applications and business as promised. Kicking off with some OpenAI drama we. Which we haven't had in a little while. So good to see it isn't ending this time. It's following up on this IO trademark kind of lawsuit that happened. We covered it last week where we had OpenAI. Sam Altman announced the launch of this IO initiative with Jony I've. And there's another AI audio hardware company called IYO spelled differently I Y O instead of IO. And they sued alleging that, you know, they stole the idea and also the trademark. The names sound very similar. And yeah, Sam Altman hit back, decided to publish some emails, just screenshot of emails showing the founder of IO, let's say, being very friendly, very enthusiastic about meeting Altman and wanting to be invested in by OpenAI. And the basic gist of what Sam Altman said is this founder Jason Rugolo, who filed the lawsuit, was kind of persistent in trying to get investments from Salvatman. In fact, even reached out in March prior to the announcements with Johnny I've. And apparently Sam Altman, you know, let him know that the competing initiative he had was called IO. So definitely, I think an effective pushback on the lawsuit, similar in a way to what OpenAI also did with Elon Musk. Just like here's the evidence, here's the receipts of your emails. I not too sure if what you're saying is legit.
B
This is becoming. Well, two is not yet a pattern, is it? Is it three? I forget how many takes to make a pattern, they say. Then again, I don't know who they are or why they're qualified to tell us its pattern. But yeah, this is an interesting situation. One interesting detail kind of gives you maybe a bit of a window into how the balance of evidence is shaping up so far. We do know that in the lawsuit, eo. So not IO but eo.
A
So this is.
B
I was going to say Jason Derulo. Jason Rugolo's Rugolo. Yeah. Rugolo's company did end up. Sorry, where was it? They were actually. Yeah, they were granted a temporary restraining order against OpenAI using the IO branding themselves. So OpenAI was forced to change the IO branding due to this, this temporary restraining order, which was part of EO's trademark lawsuit. So at least at the level of the trademark lawsuit, there has been an appetite from the courts to put in this sort of preliminary temporary restraining order. I'm not a lawyer, so I don't know what the standard of proof would be that would be involved in that. So at least at a trademark level, maybe it sounds vaguely similar enough. So yeah, for now let's tell OpenAI they can't do this. But there's enough fundamental differences here between the devices that you can certainly see OpenAI's case for saying, hey, this is different. They claim that the IO hardware is not an in ear device at all, it's not even a wearable. That's where that information comes from. That was itself doing the rounds. This big deal. OpenAI's new device is not actually going to be a wearable after all. But we do know that, apparently. So Rugolo was trying to pitch a bunch of people about their idea about the IO concept, sorry the EO concept way back in 2022, sharing information about it to former Apple designer Evans Hanke, who actually went on to co found IO. So you know, there's a lot of overlap here. The claim from OpenAI is, look, you've been working on it since 2018, you demoed it to us, it wasn't working. There were these flaws, maybe you fixed them since, but at the time it was a janky device, so that's why we didn't partner with you. But then you also have this whole weird overlap where yeah, some of the founding members of the EO team had apparently spoken directly to EO before. So it's pretty messy. I think we're going to learn a lot in the, in the court proceedings. I don't think these emails give us enough to go on to make a firm determination about what because we don't even know what the hardware is and that seems to be at the core of this. So what is the actual hardware and how much of it did OpenAI did love from? Did IO actually see?
A
Right. And in the big scheme of things, this is probably not a huge deal. This is a lawsuit saying you can't call your thing IO because it's too similar to our thing eeo and it's also seemingly some sort of wearable AI thing. So worst case, presumably the initiative by Sam Altman and Johnny I've changes I think more than anything this is just another thing to track with OpenAI. Right. Another thing that's going on that for some reason we don't have these kinds of things with Anthropic or Mistral or any of these other companies. Maybe because OpenAI is the biggest there just Tends to be a lot of this, you know, in this case, legal business drama, not interpersonal drama, but nevertheless a lot of headlines and honestly, juicy kind of stuff to discuss. Yeah, yeah, yeah, yeah. So another thing going on and another indication of the way that Sam Altman likes to approach these kinds of battles in a fairly public and direct way.
B
Up next, we have Huawei Matebook contains Kirin X90 using SMIC 7 nanometer N2 technology. If you're a regular listener of the podcast, you're probably going, oh my God. And then. Or maybe you are. I don't know. This is maybe a little in the weeds, but either way, you might want a refresher on what the hell this means. Right. So there was a bunch of rumors actually floating around that Huawei had cracked. Sorry, the. That smic, which is China's largest semiconductor foundry, or most advanced one, you can think of them as being China's domestic tsmc. There's a bunch of rumors circulating about whether they had cracked the 5 nanometer node. Right. That critical node, that is what was used, or a modified version of it was used to make the H100 GPU, the Nvidia H100. So if China were to crack that domestically, that'd be a really big deal. Well, those rumors now are being squashed because this, this company, which is actually based in Canada, did an assessment. So TechInsights, we've actually talked a lot about their findings, sometimes while mentioning them by name, sometimes not. We really should. TechInsights is a very important firm in all this. They do these teardowns of hardware. They'll go in deep and figure out, oh, what manufacturing process was used to make this component of the chip. Right. That's the kind of stuff they do. And they were able to confirm that, in fact, the Huawei X90 so system on a chip is, was actually not made using 5 nanometer equivalent processes, but rather using the old 7 nanometer process that we already knew SMIC had. So that's a big, big deal from the standpoint of their ability to onshore domestically GPU fabrication and keep up with the West. So it seems like, like we're like two years down the road now from when SMIC first cracked the 7 nanometer node. And we're still not on the 5 nanometer node yet. That's really, really interesting. And so worth saying, like Huawei never actually explicitly said that this new PC had a 5 nanometer node. There was just a bunch of rumors about it. So what we're getting now is just Kind of the, the decisive quashing of that rumor.
A
Right. And broader context here is of course that the US is preventing Nvidia from selling top of line chips to Chinese companies. And that does limit the ability of China to create advanced AI. They are trying to get the ability domestically to produce chips competitive with Nvidia. Right now they're, let's say about two years behind is my understanding. And this is one of the real bottlenecks is if you're not able to get the state of the art fabrication process for chips versus just less compute, you can get on the same amount of chip. Right. It's just less dense. And this arguably is the hardest part. Right. To get this thing, it takes forever, as you said, two years with just this process. And it is going to be a real blocker if they're not able to crack it.
B
Yeah. The fundamental issue China is dealing with is because they have crappier nodes, so they can't fabric the same quality of nodes as tsmc. They're forced to either steal TSMC fabbed nodes or find clever ways of getting TSMC to fab their designs, often by using subsidiaries or shell companies to make it seem like they're maybe we're coming in from Singapore and asking TSMC to fab something or we're coming in from a clean Chinese company, not Huawei, which is blacklisted. And then the other side is because their alternative is to go with these crappier 7 nanometer process nodes. Those are way less energy efficient and so the chips burn hotter or they run hotter rather, which means that you run into all these kinds of heat induced defects over time. And we covered that I think last or two episodes ago. Last episode I was on. So anyway, there's a whole kind of hairball of different problems that come from ultimately the fact that SMIC has not managed to keep up with TSMC.
A
Right. And you're seeing all these 10 billion $20 billion data centers being built. Those are being built with racks and racks and huge amounts of GPUs. The way you do it, the way you supply energy, the way you cool it, et cetera, all of that is very conditioned on the hardware you have in there. So it's very important to ideally have the state of art to build with. Next story also related to hardware developments, this time about amd. And they now have an ultra ethernet ready network card, the Pensando Polara, which provides up to 400 gigabits per second. Is that it? Per second performance. And this was announced at their advancing AI event, it will be actually deployed by Oracle Cloud with the AMD Instinct, A350X GPUs and the network card. So this is a big deal because AMD is trying to compete with Nvidia on the GPU front and their series of GPUs does seem to be catching up or at least has been shown to be quite usable for AI. This is another part of the stack, the Internship communications. But it's very important and very significant in terms of what Nvidia is doing.
B
Yeah, 100%. This is by the way, the industry's first Ultra Ethernet compliant nic. So a network interface card. So what the NIC does you've got. And you can go back to our hardware episode to kind of see more detail on this. But in Iraq, say at the rack level, at the POD level, you've got all your GPUs that are kind of tightly interconnected with accelerator interconnect. This is often like the Nvidia product for this is NVLink. This is super low latency, super expensive interconnect. But then if you want to connect like pods to other pods or racks to other racks, you're now forced to hop through a slower interconnect part of what's known sometimes as the back end network. And when you do that, the Nvidia solution you'll tend to use for that is InfiniBand. So you've got NVLink for the really like within a pod. But then from pod to pod you have InfiniBand. And InfiniBand has been a go to de facto like kind of gold standard in the industry for a while. Companies that aren't Nvidia don't like that because it means that Nvidia owns more of the stack and has an even deeper kind of de facto monopoly on different components. And so you've got this thing called the Ultra Ethernet Consortium that came together is founded by a whole bunch of companies, amd, notably Broadcom, I think Meta and Microsoft were involved intel and they came together and said, hey, let's come up with an open source standard for this kind of interconnect with AI optimized features that basically can compete with the InfiniBand model that Nvidia has out. So that's what Ultra Ethernet is. It's been in the works for a long time. We've just had the announcement of specification 1.0 of that ultra Ethernet protocol. And that's specifically for hyperscale AI applications and data centers. And so this is actually a pretty Seismic shift in the industry. And there are actually quite interesting indications that companies are going to shift from Infiniband to this sort of protocol. And one of them is just cost economics. Like Ethernet has massive economies of scale already across the entire, like, networking industry and InfiniBand's more niche. So as a result, you kind of have ultra Ethernet chips and like switches that are just so much cheaper. So you'd love that. You also have vendor independence. You have because it's an open standard, anyone can build to it instead of just having Nvidia own the whole thing. So the margins go down a lot and people really, really like that. Obviously, all kinds of operational advantages. It's just operationally more simple because data centers already know Ethernet and how to work with it. So anyway, this is a really interesting thing to watch. I know it sounds like, it sounds boring, it's the interconnect between different pods in a data center. But this is something that executives at the top labs really sweat over because there are issues with the InfiniBand stuff. This is one of the key rate limiters in terms of how big models can scale.
A
Right? Yeah. To give you an idea, Oracle is apparently planning to deploy these latest AMD GPUs with a Zetascale AI cluster with up to 150, 31 and 72 Instinct Mi355X GPUs. So when you get to those numbers, think of it, 131,000 GPUs. GPUs aren't small, right? The GPUs are pretty big. They're not like a little chip, they're, I don't know, like notebook sized, ish. And there's now 131,000 and you need to connect all of them. And when you say pod, right, Typically you have this rack them like almost a bookcase. You could think where you connect them with wires, but you can only get, I don't know how many, typically 64 or something on that side. When you get to 131,000, this kind of stuff starts really mattering. And in their slides in this event, they did, let's say, very clearly compared themselves to competition, said that this has 20x scale over and feeling bad, whatever that means, has performance of 20% over competition, stuff like that. So AMD is very much trying to compete and be offering things that are in some ways ahead of Nvidia and others, like Broadcom and so on. And next up, another hardware story, this time dealing with energy. Amazon is joining the big nuclear party by buying 1.92 gigawatts of electricity from Talon Energy's Suskana nuclear plant in Pennsylvania. So nuclear power for AI, it's all the rage.
B
Yeah, I mean, so we've known about. If you flip back originally this was the 960 megawatt deal they were trying to make and that got killed by regulators who were worried about customers on the grid. So essentially everyday people who are using the grid who would in their view unfairly shoulder the burden of running the grid today, you know, Susquehanna powers the grid and that means every kilowatt hour that they put in leads to transmission fees that support the grid's maintenance. And so what, what Amazon was going to do was going to go behind the meter, basically link the power plant directly to their data center without going through the grid, so there wouldn't be grid fees. And that basically just means that the general kind of grid infrastructure doesn't get to benefit from those fees over time. Sort of like not paying toll when you go on a highway. And this new deal that gets us to 1.2 gigawatts is a revision in that it's got Amazon basically going through in front of the meter, going through the grid in the usual way. They're going to be, as you can imagine, a whole bunch of infrastructure needs to be reconfigured, including transmission lines. The those will be done in spring of 2026. And the deal apparently covers energy purchased through 2042, which is sort of amusing because like imagine trying to.
A
But yeah, I guess they are predicting that they'll still need electricity by 2042, which assuming x risk doesn't come about, I suppose. Yeah, yeah. Next story. Also dealing with nuclear and dealing with Nvidia, it is joining Bill Gates and others in backing Terrap Power, a company building nuclear reactors for powering data centers. So this is through Nvidia's venture capital arm and ventures and they have invested in this company, TerraPower investing seems like 650 million alongside Hyundai. And TerraPower is developing a 345 megawatt Natrium plant in Wyoming right now. So they're, you know, I guess in the process of starting to get to a point where this is usable, although it probably won't come for some years.
B
Your instincts are exactly right on the, on the, the timing too. Right. So there's a lot of talk about SMRs like small modular reactors, which are just a very efficient way and very safe way of generating nuclear power on site. That's the exciting thing about them. They are the obvious, apart from like fusion, they are the obvious solution of the future for powering data centers. The, the challenge is when you talk to data center companies and builders, they'll, they'll always tell you like, yeah, SMRs are great, but you know, we're looking at first, first approvals, first SMRs generating power like at the earliest, you know, like 2029, 2030 type thing. So, you know, if you have sort of shorter AGI timelines, they're not going to be relevant at all for those, if you have longer timelines, even kind of somewhat longer timelines, then, then they do become relevant. So it's a really interesting space where we're going to see a turnover in the kind of energy generation infrastructure that's used. And, and this, you know, people talk a lot, a lot about China and their energy advantage, which is absolutely true. I'm quite curious whether this allows the American energy sector to do a similar leapfrogging on smarts that China did, for example, on mobile payments. Right. When you, when you just like do not have the ability to build nuclear plants in less than 10 years, which is the case for the United States, just like don't have that know how and frankly the willingness to deregulate, to do it, and the industrial base, then it kind of forces you to look at other options. And so if there's a shift just in the landscape of power generation, it can introduce some opportunities to play catch up. So sort of, I guess that's a hot take there that I haven't thought enough about. But that's an interesting dimension anyway to.
A
The SMR story, by the way. One gigawatt apparently equivalent to 1.3 million horsepower. So not sure if that gives you an idea of what a gigawatt is, but it's a lot of energy. Yeah. One million homes for one day or what does that actually mean?
B
So gigawatt is a unit of power. So it's like the amount of power that a million homes just consume basis.
A
Yeah, yeah, exactly. So 1 gigawatt is a lot. So is 345 megawatts. Now moving on to some fundraising news, Mira Moradi, her company Thinking Machines Lab, has finished up their fundraising, getting $2 billion at a $10 billion valuation. And this is the seed round. So yet another billion round. Billion dollar seed round. And this is of course the former CTO of OpenAI, left in 2024, I believe, and has been working on setting up Thinking Machines Lab, another competitor in the AGI space, presumably planning to train their own models, recruited various researchers, some of them from OpenAI and now has billions to work with that they'll deploy, presumably to train these large models.
B
Yeah, it's funny, everyone just kind of knew that it was going to have to be a number with billion after it just because of the level of talent involved. It is a remarkable talent set. The round is led by Andreessen Horowitz. So a 16Z on the cap table. Now, notably though, Making machines did not say what they're working on to their investors. At least that's what this article. That's what it sounds like. The wording is maybe slightly ambiguous. I'll just read it explicitly. You can make up your mind. Thinking Machines Lab had not declared what it was working on, instead using Marathi's name and reputation to attract investors. So that suggests that a 16Z cut. They didn't cut the full $2 billion check, but they led the round. So hundreds and hundreds of millions of dollars just on the basis of like, yeah, you know, Mirror is a serious fucking person. John Shulman's a serious fucking person. You know, Jonathan Lackman, like all kinds of people bears off. These are, these are really serious people. So we'll cut you $800 million check whatever they cut as part of that. That's both insane and tells you a lot about how the space is being priced. The other weird thing we know, and we talked about this previously, but it bears kind of repeating. So Marathi is going to hold. So Mira is going to hold board voting rights that outweigh all other directors combined. This is a weird thing, right? This is not. What is with all these AGI companies and the really weird board structures? A lot of it is just like the OpenAI mafia. Like people who worked at OpenAI did not like what Sam did and learned those lessons and, and then enshrined that in the way they run their company in their, in their actual corporate structure. And Anthropic has, you know, their public benefit company set up with their, their oversight board. And now Thinking Machines has this mirror Marathi dictatorship structure where she has final say basically over, over everything at the company. By the way, everything I've heard about her is, is exceptional. Like every OpenAI person I've ever spoken to about Mira has just like glowing things to say about her. And so even though $2 billion is not really enough to compete, if you believe in scaling laws, it tells you something about, you know, the kinds of decisions people will make about where they work, include who will I be working with. And this seems to be a big factor, I would guess, in, in all these people leaving OpenAI. She does seem to be a genuinely accept exceptional, like I've never met her. But again, everything I've heard is just like glowing and both in terms of competence and in terms of kind of smoothness of working with her. So that may be part of what's attracting all this talent as well.
A
Yes. And on the point of not quite knowing what they're building, you go if you go to thinkingmachines AI, this has been the case for a while. You'll get a page of text. The text is, let's say like reads like a mission statement. That sure is saying a lot. There's stuff about scientific progress being a collective effort emphasizing human AI collaboration, more personalized AI systems infrastructure quality, advanced multimodal capabilities, research product co design, empirical iterative approach to AI safety measuring. What truly matters? I have no idea. This is like just saying a whole bunch of stuff and you can really take away whatever you want. Presumably it'll be something that is competing with OpenAI and anthropic fairly directly, is the impression. And near the bottom of the page at ThinkingMachines AI founding team has a list of a couple dozen names, each one with you can hover over it to see their background. As you say, like real heavy hitters. And then there are advisors and a Join us page. So yeah, it really tells you what if you gain a reputation and you have some real star talent in Silicon Valley, that goes a long way. And on a note, next story quite related. Meta has hired some key OpenAI researchers to work on their AI reasoning models. So a week ago or two weeks ago we talked about how Meta paid a whole bunch of money, invested rather in Scale AI and hired away the founder of Scale AI, Alex Wang, to head their new super intelligence efforts. Now there are these reports. I don't know if this is highlighting it particularly because OpenAI or perhaps this is just the juicy details. I'm sure Meta has hired other engineers and researchers as well, but I suppose this one is worth highlighting. They did hire some fairly notable figures for from OpenAI services Lukas Bayer, Alexander Koleshnikov and Shihao Zhai, who I believe founded the Sweden Office Switzerland office was it. Anyway, they were a fairly significant team at OpenAI, or so it appears to me. And I think Lucas Baer did post on Twitter and say that the idea that we are paid $100 million was fake news. This is another thing that's been up in the air. Sam Altman has been taking, you could say some gentle swipes saying that Meta has been promising insane pay packages. So all this to say is this is just another indication of Mark Zuckerberg very aggressively going after talent. We know he's been personally messaging dozens of people on WhatsApp and whatever, being like, hey, come work for Meta. And perhaps not surprisingly, that is paying off in some ways in expanding the talent of this super intelligence team.
B
Yeah, there's a lot that's both weird and interesting about this. The first thing is anything short of this would be worth zero. When you are in Zuck's position and you are, I'll just sort of like, this is colored by my own interpretation of who's right and who's wrong in this space. But I think it's increasingly sort of just becoming clear. In fairness, I don't think it's just my biases saying that when, when your company's AI efforts, despite having access to absolutely frontier scales of compute, so having no excuses for failure on the basis of access to infrastructure, which is the hardest and most expensive thing, when you've managed to tank that so catastrophically because your culture is taken, is screwed up by having Yann Lecun as the mascot, if not the leader of your internal AI efforts. Because he's not actually as influential as it sounds, or hasn't been for a while on the internals of Facebook. But he has set the beat at Meta, being kind of skeptical about AGI, being skeptical about scaling, and then like changing his mind in ego preserving ways without admitting that he's changed his mind. I think these are very damaging things. They destroy the credibility of Meta and have done that damage. And I think the fact that Meta is so, so far behind today is a reflection in large part a consequence of Yann Lecun's personality and his inability to kind of update accordingly and maintain like epistemic humility on this. I think everybody can see it. He's like the old man who's still yelling at clouds and just like as the clouds change shape, he's like trying to pretend they're not. But, but I think just like speaking as like if I were making the decision about where to work, that would be a huge factor. And it has just objectively played out in a catastrophic failure to leverage one of the most impressive fleets of AI infrastructure that there actually is. And so what we're seeing with this set of hires is people who are, I mean, so completely antithetical to Yann Lecun's way of thinking, like Meta could not be pivoting harder in terms of the People, it's poaching here. First of all, OpenAI obviously one of the most scale pilled organizations in the space. Probably the most scale pilled anthropic actually is up there too. But also scale AI's Alex Wang. So, okay, that's interesting. Very scale pilled dude. Also very AI safety pilled dude. Daniel Gross arguably quite AI safety pilled. @ least that was the mantra of Safe Superintelligence. Weird that he left that so soon. A lot of open questions about how Safe Superintelligence is doing. By the way, if Daniel Gross is now leaving. I mean, DG was the CEO, right? Co Founded it with Ilya. So what's going on there? But. So that's a hanging chad. But just Daniel Gross being, being now over on the Meta side, you have to have enough of a concentration of exquisite talent to make it attractive for other exquisite talent to join. If you don't break that critical mass, you might as well have nothing. And that's been Meta's problem this whole time. They needed to just like jumpstart this thing with a massive capital infusion again, these massive pay packages. That's where it's coming from. Just give people a reason to come, get some early proof points to get people excited about Meta again. And the weird thing is with all this, like, I'm not, not confident at all in saying this, but you could see a different line from Meta on safety going forward too, because Yann Lecun was so dismissive of it. But now a lot of the people they've been forced to hire, because there is, if you look at it objectively, a strong correlation between the people and teams who are actually leading the frontier and the people in teams who take loss of control over AI seriously. Now Meta is kind of forced to change in some sense its DNA to take that seriously. So I think that's just a really interesting like shift. And I know this sounds really harsh with respect to Yann Lecun, like, you know, take it for what it is. It's just one man's opinion. But I've, I have spoken to a lot of researchers who feel the same way and again, I think the data kind of bears it out. Essentially Mark Zuckerberg is being forced to pay the Yann Lecun tax right now. And I don't know what happens to Yann Lecun going forward, but I do kind of wonder if his metadays may be numbered or, you know, if there's going to be a face saving measure that has to be taken.
A
There Right, for context, Yann Lecun is Meta's chief AI scientist. He's been there for over a decade, hired like I think around 2013, 2012. Biometa, one of the key figures in the development of neural networks really over the last couple decades, and certainly is a major researcher and contributor to the rise of deep learning in general. But as you said, a skeptic on large language models and a proponent for sort of other techniques. I will say not entirely bought into this narrative personally. The person heading up the effort on LLAMA and LLMs was not Yen Lacuna as far as aware, but there was another division within META that focused on generative technology that has now been revamped. So the person leading regenerative AI efforts in particular has been, has left and now there is an entirely new division called AGI Foundations that is now being set up. So this is part of a major revamp. Yann Nakun still leading his more like research publication type side of things, and perhaps as far as I know, not very involved in this side of scaling up LLAMA and LLMs and all of this, which is less of a research effort, more of an R and D kind of compete with OpenAI and so on.
B
No, absolutely agree. And that was what I was referring to when I was saying Yann Lecun is not sort of involved in the day to day kind of product side of the org. It's been known for a while that he's not actually doing the heavy lifting on llama, but he has defined what it means, essentially articulated Meta's philosophy on AI and AI scaling for the last however many years. And so it's understood that when you join Meta, or at least it was, that you are buying into a sort of Yann Lecun aligned philosophy, which I think is the kind of core driving problem behind where Meta finds itself today.
A
Yeah, that's definitely part of it. I mean that's part of the reputation of Meta as an AI research club also, I mean part of the advantage of Meta and why people might want to go to Meta is because of their very open source friendly nature.
B
They're only very open source friendly because they're forced to do that because it's the only way they can get headlines while they pump out mediocre.
A
But regardless, regardless, it's still a factor here. One last thing worth noting on this whole story, I mean you could do a whole speculative analysis of what went on Meta. They did also try to throw a lot of people at the problem scale up to from a couple Hundred to like a thousand people I think probably had a similar situation to Google where it was like big company problems, right? OpenAI anthropic, they're still, they're huge, but they don't have big company problems.
B
That's a great point.
A
They have scaling company problems. So this revamp could also help. Yeah. Alrighty. Onto research and advancements. No more trauma talk, I guess. Next we have a story from DeepMind and they have developed alpha genome, the latest in their alpha line of scientific models. So this one is focused on helping researchers understand gene functions. It's not meant for personal genome prediction, but more so just general identification of patterns. So it could help identifying causative mutations in patients with ultra rare cancers. So for instance, which mutations are responsible for incorrect gene expression? I'm going to be honest, you know, there's a deep, a lot of deep science here with regards to biology and genomics which I am not at all an expert on. And the gist of it is similar to AlphaFold, similar to other alpha efforts on the benchmarks dealing with the problems that geneticists deal with, the kind of prediction issues the analysis alpha genome kind of beats all existing techniques out of a park on almost every single benchmark. It is superseding previous efforts and the swan model is able to do a lot of things all at once. So again, not really my background to Commonwealth too much but I'm sure that this is along the lines of AlphaFold in terms of AlphaFold was very useful scientifically for making predictions about gene folding, protein folding. Alpha genome is presumably going to be very useful for understanding genomics, for making predictions about which genes do what, things like that.
B
It's a really interesting take that's I guess a fundamentally different way of approaching the let's understand biology problem that that Google LeapMind and then it's it's subsidiary, I guess it's, it's spawned company Isomorphic Labs, which by the way Demis is the CEO of and very focused on, I hear has kind of been. Been very focused on. Anyway, when you look at AlphaFold you're looking at essentially predicting the structure and to some degree the function of proteins from the Lego blocks that make up those proteins, right, the amino acids, the individual amino acids that get chained together, right. So you got you know, 20amino acids you can pick from and that's how you build a protein. And depending on the amino acids that you have some of the positive charge, some of the negative, some of their polar, some of them are not and the thing will fold in a Certain way that is distinct from the problem of saying, okay, I've got a strand of, you know, 300 billion base pair, sorry, 3 billion base pairs of DNA. And what I want to know is if I take this one base pair and I switch it from, I don't know, like from an A to a T, right? Or from a G to an A, what happens to the protein? What happens to the downstream kind of biological activity? What cascades does that have? What effects does it have? And that question is an interesting question because it depends on your ability to model biology in a pretty interesting way. It also is tethered to an actual phenomenon in biology. So there's this thing called the single nucleotide polymorphism. There's some nucleotides in the human genome that you'll often see can, can either be like a, a G or a T or something. And so you'll see some people who have the G variant and some people have the T variant. And it's often the case that some of these variants are associated with a particular disease. And so there's like a, I used to work in a genomics lab doing cardiology research back in the day and there's like famous variant called 9P21 3 or something. And you know, if some people had, I forget what it was, the T version, you'd have a higher risk of getting coronary artery disease or atherosclerosis or whatever and not if you had the other one. So essentially what this is doing is it's allowing you to reduce in some sense the number of experiments you need to perform. If you can figure out, okay, like we have all these different possible variations across the human genome, but only a small number of them actually matter for a given disease or effect. And if we can model the genome pretty well, we might be able to pin down the variants we actually care about so that we can run more controlled experiments. Right? So we know that, hey, you know, patient A and patient B, they may have like a zillion different differences in their genomes, but actually for the purpose of this effect, they're quite comparable, or they ought to be. So this is anyway, really, I think interesting Next advance from Google DeepMind and I expect that we'll see a lot more because they are explicitly interested in that direction.
A
Right. And they released a pretty detailed research paper, a preprint on this as they have of AlphaFold 55 page paper describing the model, describing the results, describing the data, all of that also released an API. So a client side ability to query the model and it is free of charge for non commercial use with some query limiting. So yeah, again similar to AlphaFold, they are making this available to scientists to use. They haven't open sourced this yet, the model itself, but they did explain how it works. So certainly exciting and always fun to see DeepMind doing this kind of stuff.
B
And up next we have Direct Reasoning Optimization dro. So we've got, you know, grpo, we've got dpo, we've like, you know, there's so many, so many POs or ROs or O's, so many O's. So LLMs can reward and refine their own reasoning for open ended tasks. I like this paper. I like this paper a lot. It's. I think I might have talked about this on the podcast before. I used to have a, a Prof. Who would like ask these very simple questions when you were presenting something and they were like embarrassingly simple and you would, you would be embarrassed to ask that question. But then that always turns out to be the right and deepest question to ask. This is one of those papers. It's like, it's very simple concept, but it's something that when you realize it you're like, oh my God, that was missing. So first let's just talk about how currently we typically train reasoning into models, right? So you have some output that you know is correct, right? Some answer the desired or target output and you've got your input. So what you're going to do is you're going to feed your input to your model, you're going to get it to generate a bunch of different reasoning traces and then in each case you're going to look at those reasoning traces, feed them into the model and based on the reasoning trace that the model generated, see what probability it assigns to the target output that you know is correct, right? So reasoning traces that are correct in general will lead to a higher probability that the model places on the target outcome because it's the right outcome. So if the reasoning is correct, it's going to give a higher probability to the outcome. So this is sort of, it feels a little bit backwards from the way we normally train these models, but this is how it's done, at least in GRPO Group Relative Policy Optimization. So essentially you reward the model to incentivize high probability of the desired output conditioned on the reasoning traces. And this makes you generate over time better and better reasoning traces because you want to generate reasoning traces that assign higher probability to the correct output. So the intuition here is if your reasoning is good, you should be very confident about the correct answer. Right? Now, this breaks, and it breaks in a really interesting way. Even if your reference answer is exactly correct, you can end up being too forgiving to the model during training, because the way that you score the model's confidence in the correct answer based on the reasoning traces is you average together, essentially, the confidence scores of each of the answer tokens in the correct answer. Now, the problem is the first token of the correct answer often gives away the answer itself. So even if the reasoning stream was completely wrong, like, even if, let's say the question was like, who scored the winning goal in the soccer game? And the answer was Lionel Messi, if the model's reasoning is like, I think it was Cristiano Ronaldo, the model is going to, okay, from there, assign a low probability to Lionel, which is the first word of the correct answer. But once it reads the word Lionel, the model knows that Messi must be the next token. So it's going to assign actually a high probability to Messi, even though its reasoning trace said Cristiano Ronaldo. And so essentially, this suggests that there are some tokens in the answer that are going to actually correctly reflect the quality of your model's reasoning. So if your model's reasoning was, I think it was Cristiano Ronaldo, and the actual answer was Lionel Messi, well, Lionel, you should expect it to have very low confidence in. So that's good. You'll. You'll be able to actually correctly determine that your reasoning was wrong there. But once you get Lionel as an as part of the prompt, then messy all of a sudden becomes obvious, and so you get a bit of a misfire there. So essentially what they're going to do is they're going to calculate, like, they'll feed in a whole bunch of reasoning traces and they'll look at each of the tokens in the correct output and see which of those tokens vary a lot. Tokens that are actually reflective of the quality of the reasoning should have high variance, right? Because if you have good reasoning trajectory, those tokens should have high confidence. And if you have a bad reasoning trajectory, they should have low confidence. But then you have some kind of less reasoning reflective tokens, like, say, Messi in Lionel Messi, because then Lionel has already given it away. You should expect Messi to consistently have high confidence because again, even if your reasoning trace is totally wrong, by the time you get Lionel, as, by the time you've, you've read Lionel Messi is obvious, it's almost like, you know, if you're writing a test and you can see like the first word in the correct answer. Well, yeah, you're going to get even if your thinking was completely wrong, you're going to get the correct second word if the answer is Lionel Messi. So anyway, this is just a way that they use to kind of detect good reasoning and then they feed that into anyway a broader algorithm that beyond that is fairly simple, nothing too shocking. They just fold this in to something that looks a lot like a GRPO to get this DRO algorithm right.
A
Yeah, they spent a while in the paper contrasting it with other recent work that deals with that doesn't pay attention to tokens basically so that just to contextualize what you were saying, their focus is on this R3 reasoning reflection reward and DRO. Direct reasoning optimization is basically GRPO what people use generally for rl, typically with verifiable rewards. Here their focus is how do we train kind of generally in an open ended fashion over long reasoning chains. Identify some of these issues in existing approaches and highlight this reasoning reflection reward that basically is looking at consistency between these tokens in the chain of thought and in the output as a signal to optimize over. And as you might expect, they do some experiments, they show that this winds up being quite useful and I think another indication of we are still in the earlyish days of using RL and training reasoning. There's a lot of noise and a lot of significant insights being leveraged. Last thing, dro I guess kind of a reference to dpo. As you said, DPO is Direct Preference optimization versus Direct Reasoning optimization. Not super related. It's just I guess fun naming conventions because aside from arguably being sort of analogous in terms of the difference between RL based preference alignment and dpo. Anyway, it's kind of a funny reference.
B
Yeah.
A
Next paper, Farseer, a refined scaling law in large language models. So we've talked about scaling laws a ton. Basically you try to collect a bunch of data points of, you know, once you use this much compute or this much training flops or whatever, you get to this particular loss on language prediction, typically on the actual metric of perplexity and then you fit some sort of equation to those data points and what tends to happen is you get a fairly good fit that holds for feature data points that typically you're like scaling up, scaling up, scaling up. Your loss goes down and down and down. And people have found that somewhat surprisingly you can get a very good fit that is very predictive, which was not at all kind of common idea or something that people had really tried pre2020. So what this paper does is basically do that, but better. It's a novel and refined scaling law that provides enhanced predictive accuracy. And they do that by just systematically constructing model loss surface and doing just a better job of fitting to empirical data. They say that they improve upon the Chinchilla law, one of the big ones from a couple years ago, by reducing extrapolation error by 433%. So a much more reliable law, so to speak.
B
Yeah, the Chinchilla scaling law was somewhat famously Google's correction to the initial OpenAI scaling law that was proposed I think in a 2019 paper. This is the so called Kaplan scaling law. And so it was Chinchilla was sort of heralded as this kind of big and ultimately maybe pseudo final word on how scaling would work. It was more data heavy than the Kaplan scaling laws, notably. But what they're pointing out here is Chinchilla works really well for mid sized models, which is basically where it was calibrated like you know what it was designed for. But it doesn't do great on very small or very large models. And obviously given that scaling is a thing, very large models matter a lot. And the whole point of a scaling law is to extrapolate from where you are right now to see like, okay, well if I trained a model 100 times the scale and therefore at let's say 100 times this budget, where would I expect to end up? And you can imagine how much depends on those kinds of decisions. So you want a model that is really well calibrated and extrapolates really well, especially to very large models. They do a really interesting job. In the paper, we won't go into detail, but especially if you have a background in physics like thermodynamics, they play this really interesting game where they'll use finite difference analysis to kind of separate out dependencies between N the size of the model and D the amount of data that it's trained on. And that ultimately is kind of the secret sauce, if you want to call it, to call it that here. There's a bunch of other hijinks, but the core piece is they sort of break the loss down into different terms, one of which only depends on N, the other of which only depends on D. So one is just model size dependent, the other is only dependent on the size of the training data set. But then they also introduce this interaction effect between N and D, between the size of the model and the amount of data it's trained on and then they end up deriving. What should that term look like? That's one of the, the framings of this that's really interesting. Just to kind of nutshell it, if Chinchilla says that data scaling follows a consistent pattern, it's like D to the power of some negative beta coefficient. Regardless of model size, no matter how big your model is, it's always D to the power of negative B. So if I give you the amount of data, you can determine the contribution of the data term. What Farce here says is data scaling actually depends on model size. Bigger models just fundamentally learn from data in a different way and, and we'll park it there. But there's a lot of cool extrapolation to figure out how exactly does this.
A
Term have to look exactly. And this is very useful not just to sort of know what you're going to get that aspect of it means that for a given compute budget, you can predict what balance of data to model size is likely optimal. And basically is when you're spending millions of dollars training a model, it's pretty nice to know these kinds of things. Right? And one more paper, next one is LLM first search. Self guided exploration of the solution space. So the gist of this is there are many ways to do search where search just means, you know, you look at one thing and then you decide on some other things to look at and you keep doing that until you find a solution. So the typical, or one of the typical ways is Monte Carlo tree search, a classic algorithm. And this was for instance done with AlphaGo. If you want to combine this in LLM, typically what you do is you assign some score to a given location and make perhaps some predictions and then you have an existing algorithm to sample or to decide where to go. The key difference here with LLM first search is basically forget Monte Carlo tree search, forget any pre existing search algorithm or technique. Just make the LLM decide where to go. It can decide how to do the search. And they say that this is more flexible, more context sensitive, requires less tuning and just seems to work better.
B
Yeah, it's all prompt level stuff, right? So there's no optimization going on, no training, no fine tuning. It's just like give the model a prompt. So number one, find a way to represent the sequence of actions that have led to the current moment in whatever problem the language model is trying to solve in a way that's consistent. So essentially format, let's say all the chess moves up till this point in a consistent way so that the model can look at the state and the history of the board, if you will, and then give the model a prompt that says, okay, from here, like, I want you to decide whether to continue on the current path or look at alternative branches, alternative trajectories. The prompt is like, here are some important considerations when deciding whether to explore or continue. And then it lists a bunch. And then similarly, they have the same. But for the evaluation stage, where you're scoring the available options and getting the model to choose the most promising one. So, you know, it's like, here are some important considerations when evaluating possible operations that you can take or actions you could take. So once you combine those things together, basically at each stage, I'll call it of the game or of the problem solving, the model has a complete history of all the actions taken up to that point. It's then prompted evaluate the options before it and to decide whether to continue to explore and kind of add new options, or to select one of the options and execute against it. Anyway, that's basically it. Like, it's a pretty conceptually simple idea. Just offload the tree and branching structure development to the model so that it's thinking them through in real time. Pretty impressive performance jumps. So when using GPT4O, when compared with standard Monte Carlo Tree Search, on this game of countdown, where essentially you're given a bunch of numbers and all the standard mathematical operations, addition, division, multiplication, subtraction, you're trying to figure out, how do I combine these numbers to get a target number? So at each stage you have to choose, okay, do I try adding these together? Do I? Anyway, so 47% on this using this technique versus 32% using Monte Carlo Tree Search. And this effect amplifies. So the advantage amplifies as you work with stronger models. So on O3 mini, for example, 79% versus 41% for Monte Carlo Tree Search. So reasoning models seem to be able to take advantage of this. You can think of it as a kind of scaffold a lot better. It also uses fewer tokens, so it's getting better performance. It's using fewer tokens, so less compute than Monte Carlo Tree Search as well. So that's really interesting, right? This is a way more efficient way of squeezing performance out of existing models. And it's all just based on very kind of interpretable and tweakable prompts.
A
Right? And they compare this not just to Monte Carlo tree Search, we also compare it to three thoughts or three of thoughts. Bread for search, best for search. All of these are, by the way, are pretty significant because search broadly is like, there's a sequence of actions I can take and I want to get the best outcome. And, you know, so you need to think many steps ahead. And so depending branches here mean, like, I take this step and this and this step. Well, you can either go deeper or wider in terms of how many steps you consider one step ahead, three steps. And this is essential for many types of problems. You know, chess go, obviously, but broadly we do search and all sorts of things. So having a better approach to search means you could do better reasoning, means you could do better problem solving.
B
And moving on to policy and safety, we have one main story here called unsupervised elicitation of language models. This is really interesting and I'll be honest, like, was a head scratcher for my. Like, I spent a good embarrassing amount of time with Claude trying to help me through the paper, which is sort of ironic because if I remember, it's an anthropic paper. But this is essentially a way of getting a language model's internal understanding of logic to help it to solve problems. So imagine that you have a bunch of math problems and solutions. So, for example, you know, what's five plus three? And then you have a possible solution, right?
A
Maybe it's eight.
B
The next problem is like, what's seven plus two? And you have a possible solution, and that possible solution is maybe 10, which is wrong, by the way. So some of these possible solutions are going to be wrong. So you have a bunch of math problems and possible solutions, and you don't know which are correct and incorrect. And you want to train a language model to identify correct solutions, right? You want to figure out which of these are actually correct. So imagine you just lay these all out in a list. You have what's five plus three? And then solution eight, what's seven plus two? Solution ten, and so on. Now, what you're going to do is you're going to randomly assign correct and incorrect labels to a few of these examples, right? So you'll say, you know, five plus three equals eight, and you'll just randomly say, okay, that's correct and seven plus two equals ten, which, by the way, is wrong. But you'll randomly say that's correct, right? And then you're going to get the model to say, given the correctness scores that we have here, given that solution one is correct and solution two is correct, what should solution three be, roughly? Or given all the incorrect and correct labels that we've assigned randomly secretly, what should be this missing label? And generally because you've randomly assigned these labels, the model is going to get really confused because there's a logical inconsistency between these randomly assigned labels. A bunch of the problems that you've labeled as correct are actually wrong and vice versa. And so now what you're going to do is essentially try to like measure how confused the model is about that problem. And you are then going to flip one label. So you'll kind of think of like flipping the correct or incorrect label on one of these problems from correct to incorrect, say, and then you'll repeat and you'll see if you get a lower confusion score from the model anyway. This is roughly the concept. And so over time you're going to gradually converge on a lower and lower confusion score. And that's it sort of like feels almost like the model is relaxing into the correct answer, which is why this is a lot like simulated annealing. If you're, if you're familiar with that, you're making random modifications to the problem until you get a really low loss and you gradually kind of relax into the correct answer. I hope that makes sense. It's sort of like you kind of got to see it and it's yeah, right.
A
Just to give some motivation, they frame this problem. This is from Tropic and a couple other institutes, by the way. They frame this in the context of superhuman models. So the unsupervised elicitation part of this is about the aspect of how do you train a model to do certain things, right? And these days the common paradigm is you train your language model via pre training, then you post train, you have some labels for your words or preferences of outputs, and then you do RLHF or you do DPO to make a model do what you want it to do. But the framework or the idea here is once you get to superhuman AI, well, maybe humans can't actually see what it does and kind of give it to labels of what is good and what's not. So this internal coherence maximization framework makes it so you can elicit the good behaviors, the desired behaviors from the LLM without external supervision by humans. And the key distinction here from previous efforts in this kind of direction is that they do it at scale. So they train a Claude 3.5 haiku based assistant without any human labels and achieve better performance than its human supervised counterparts. They demonstrate in practice on a significantly sized LLM that this approach can work. And this could have implications for future even larger models. Next up, a couple stories on the policy Side. Well, actually only one story. It's about Taiwan and it has imposed technology export controls on Huawei and smic. Taiwan has actually blacklisted Huawei and smic. Semiconductor Manufacturing International Corp. And this is from Taiwan's International Trade Administration. They have also included subsidiaries of these. It's an update to their so called strategic high tech commodities entity list. And apparently they added not just those 601 entities from Russia, Pakistan, Iran, Myanmar and mainland China.
B
Yeah, and one reaction you might have looking at this is like wait a minute, I thought China was already barred from accessing for example, chips from Taiwan. And you're, you're absolutely correct. That is the case.
A
This is my reaction.
B
Yeah, no, totally, totally. It's a great question. Like so what? Like what is actually being added here? And so the answer is because of US export controls and we won't get into the reason why U.S. the U.S. has leverage to do this, but they do. Taiwanese chips are not going into mainland China, at least theoretically. Obviously Huawei finds ways around that. But this is actually a kind of broader thing to deal with a whole bunch of plant construction technologies, for example, specialized materials equipment that isn't necessarily covered by US control. So there's broader supply chain coverage here, whereas US controls are more focused on cutting off like specifically chip manufacturing here. Taiwan is formally blocking access to the whole semiconductor supply chain. It's everything from specialized chemicals, materials to manufacturing equipment, technical services. So sort of viewed as this loophole closing exercise coming from Taiwan. This is quite interesting because it's coming from Taiwan as well. Right. This is not the US kind of leaning in and forcing anything to happen though. Who knows what happened behind closed doors. It's interesting that Taiwan is taking this kind of hawkish stance on China. So even though Huawei couldn't get TSMC to manufacture their best chips, they have been working with SMIC to develop some domestic capabilities for chip manufacturing. Anyway, this basically just makes it harder for that to happen.
A
Next up paper dealing with some concerns actually from a couple weeks ago. But I don't think we covered it so worth going over it pretty quickly. The title of a paper is your Brain on the ChatGPT Accumulation of cognitive debt when using an AI assistant for essay writing tasks. So what they do in this paper is have a few, have 54 participants write essays. Some of them can use LLMs to help them do that, some of them can use search engines to help them do that. Some of them have to do it themselves. No tools at all. And then they do a bunch of stuff. They first Measure the brain activity with EEGs to basically assess cognitive load during essay writing. They follow up by looking at recall metrics and the register results is there's significant differences between the different groups. EEGs reveal less so called brain connectivity between brain only participants and LLM participants and search participants similarly self reported ownership recall. All these things differed. This one got a lot of play I think on Twitter and so on. And quite a bit of criticism also I think in overblowing the conclusions, I think the notion of cognitive debt, the framing here is that there's long term negative effects on cognitive performance due to decreased mental effort and engagement. And you can certainly question whether that's the conclusion you can draw here. What they show is if you use a tool to write an essay, it takes up less effort and you probably don't remember what is in the essay as well. Does that transfer to long term negative effects on cognitive performance due to decreased mental effort engagement?
B
Maybe. All I have is a personal take on this too. I think that so good writers are good thinkers because you're. When you, when you are forced to sit down and write something, at least it's been my experience that I don't really understand something until I've written something about it with intent. And so in fact when I'm trying to understand something new that I actually make myself write it out because it just doesn't stick in the same way. Different people may be different, but I suspect that maybe less so than some people might assume they are. So I think, at least for people like me, I imagine this would be in effect, it's interesting they say, yeah, after writing 17% of ChatGPT users could quote their own sentences versus 89% for the brain only group, the ones who didn't use even Google. The other interesting thing here is that by various measures Google is either between using ChatGPT and going brain only, or it can even be slightly better than brain only. I thought that was quite interesting, right? Like Google is sort of this thing that allows like fairly obsessed people like myself to kind of do deep dives on, let's say technical topics and learn way faster than they otherwise could without necessarily giving them the answer. And ChatGPT at least, or LLMs at least open up the possibility to not do that. Now I will say I think there are ways of using those models that actually do accelerate, accelerate your learning. I think I've experienced that myself. But the retention, there has to be some kind of innate thing that you do at least. I don't know. I'm Self diagnosing right now. But there's gotta be some kind of innate thing that I do, like whether it's writing or drawing something or making a graphic to actually make it stick and make me feel a sense of ownership over the knowledge. But yeah, I mean, look, we're gonna find out, right? People have been talking about the effects of technology on the human brain for since the printing press, right? When people were saying like, hey, we rely on our brains to store memories. If you just start getting people to read books, well now the human ability to have long term memory is going to atrophy and you know what, it probably did in some ways, but we kind of found ways around that. So I think this may turn out to be just another thing like that, or it may turn out to actually be somewhat fundamental because back in the days of the printing press, you still had to survive. There was enough kind of real and present pressure on you to, to learn stuff and retain that. You know, maybe it didn't have the effect it otherwise would. But interesting study. I'm sure we'll keep seeing analyses and reanalyses for the next, the next few months.
A
Yeah, quite a long paper, like 87 pages. Lots of details about the brain connectivity.
B
Results and ironically it was too long for me to read. It's actually true. I used an LLM for this one.
A
It's like, anyway, I have seen quite a bit of criticism on the precise methodology of a paper and some of its conclusions. I think also in some ways it's very common sense, you know, if you don't put in effort doing something, you're not going to get better at it. That's already something we know. But I guess I shouldn't be too much of a hater. I'm sure this paper also has some nice empirical results that are useful in, as you say, like a very relevant line of work with regards to what actual cognitive impacts usage of alarms has and how important is it to like go brain only sometimes. All right, onto synthetic media and art. Just two more stories to cover. And as promised in the beginning, these ones are dealing with copyright. So last week we talked about how Anthropic scored a copyright win. The gist of that conclusion was that using content from books to train LLMs is fine, at least for Anthropic. What is actually bad is pirating books in the first place. So Anthropic bought a bunch of books, scanned them and used the scan data to train the LLM and that kind of passed the bar. It was okay. So now we have a new ruling about a judge rejecting some offers claims that Meta AI training has violated copyrights. So the federal judge has dismissed a copyright infringement claim by 13 offers against Meta for using their books to train its AI models. The judge, Vincent Bria, has ruled that Meta's use of nearly 200,000 books, including the people suing to train the LLAMA language model, constituted fair use. And this does similarly align with a ruling, very ruling about anthropic with Claude. So this is a rejection of the claim that this is piracy. Basically the judgment is that the outputs of LLAMA are transformative. So you're not infringing on copyright. And this is, you know, using the data for training and language model is fair use and copyright doesn't apply. Is at least, as far as I can tell, is again, not a lawyer. Is a conclusion. Seems like a pretty big deal. Like the legal precedent for whether it's legal to use the outputs of a model when some of the inputs to it were copyrighted appears to be being kind of figured out.
B
Yeah, this is super interesting, right? You've got judges trying to square the circle on allowing what is obviously a very transformational technology. And. But I mean, the challenge is like, no, no author ever wrote a book until, say, 2020 or whatever with the expectation this technology would be there. It's just sort of like no one ever imagined that facial recognition would get to where it is when Facebook was first founded and people or MySpace and people first started uploading a bunch of pictures of themselves and their kids and it's like, yeah, now that's out there. And you're waiting for a generation of software that can use it in ways that you don't want it to. To. Right. Like, you know, Deepfakes, I'm sure, were not even remotely on the radar of people who posted pictures of their children on MySpace in the late 90s. Right. That's like. That is one extreme version of where this kind of argument lands. So now you have authors who write books. You can say, like, in good faith or assuming a certain technological trajectory, assuming that those books, when put out in the world, could not technologically be used for anything other than just, you know, what they expected them to be used for, which is being read. And now that suddenly changes. And so. And it changes in ways that undermine the market quite directly for those books. Like, it is just a fact that if you have a great, like a book that really explains a technical concept very well and your language model is trained on that book and now can also explain that Concept, really? Well, not using the exact same words, but maybe having been informed by it, maybe having, you know, using analogous strategies, it's hard to argue that that doesn't undercut the market for the original book, but it is transformative, Right. The threshold the judge in this case was using was that Llama cannot create copies of more than 50 words. Well, yeah, I mean, you can. Every word could be different, but it could still be writing in the style of. Right. And that's kind of a different threshold that you could otherwise have imagined the judge could have gone with or something like that. But there is openness, apparently, from the judge to this argument that AI could destroy the market for original works or original books just by making it easy to create tons of cheap knockoffs. And they're claiming that likely would not be fair use even if the outputs were different from the inputs. But again, the challenge here is that it's not necessarily just books, right? It's also like you just want a good explanation for a thing, and the form factor that's best for you is a couple sentences rather than a book. So maybe you err on the side of the language model and maybe you just keep doing that, whereas in the past you might have had to buy a book. So I think overall, this makes as much sense as any judgment on this. I don't have, you know, like, I feel. Feel deeply for the judges who are put in the position of having to make this call. It's just tough. I mean, you can. You can make your own call as to what makes sense, but. But man, is this littered with nuance.
A
Yeah, it is worth noting to speak of nuance that the judge did very explicitly say that this is judging on this case specifically, not about the topic as a whole. He did frame it as copyright law being about more than anything, preserving the incentive for humans to create artistic and scientific works. And fair use would not apply, as you said, to copying. That would significantly diminish the ability of copyright holders to make money from their work. And so in this case, Meta presented evidence that book sales did not go down after Llama released for these offers, which included, for instance, Sarah Silverman, Junot Diaz, and Overall, there were 13 offers in this case. So, yes, this is not necessarily establishing precedent in general for. For any suit that is brought, but at least in this case, the conclusion is Meta doesn't have to pay these offers and generally did not go against copyright by training on the data of their books without asking for permission or paying them. And just one last story. The next one is that Getty has dropped some key copyright claims in its lawsuit against Stability AI, although it is continuing a UK lawsuit. So the primary claim against Stability Ibaguetti was about copyright infringement. So they dropped the claim about Stability AI using millions of copyrighted images to train its AI model without permission. But they still are keeping the secondary infringement and I guess trademark infringement claims that say that AI models could be considered infringing articles if used in the uk even if they trained elsewhere. So honestly don't fully get the legal implications here. It seems like in this case in particular, it was the claims were dropped because of weak evidence and lack of knowledgeable witnesses from Stability AI. There's also apparently jurisdictional issues where these kind of lacking evidence could be problematic. So development that is not directly connected to these prior things we were discussing seems to be again, fairly specific to this particular lawsuit. But another case of copyright in cases going forward, this one being a pretty significant one dealing with training on images. And if you're dropping your key claim in this lawsuit, that bodes well for Stability AI. And that's it for this episode of last week in AI. Thank you for all of you who listened at 1x speed without speeding up. And thank you for all of you who tuning week to week, share the podcast review and so on so on. Please keep tuning in.
B
When the AI begins begins it's time to break break.
C
It down Last weekend AI come and take a ride get the low down on tech and let it slide Last weekend AI come and take a ride Couple laughs through the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in on tune and get the latest with ease Last weekend AI come and take a ride get the low down on tech and let it slide Last weekend AI come and take a ride I will last through the streets AI's reaching high from neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement, excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.
Podcast: Last Week in AI
Hosts: Andrei Karenkov, Jeremy Harris
Date: July 4, 2025
Episode Title: Gemini CLI, io drama, AlphaGenome, copyright rulings
Andrei and Jeremy discuss the latest developments in the AI industry from June 2025, focusing on new tools like Google’s Gemini CLI, drama and legal disputes in the AI hardware space, significant hardware and energy news, recent research advances from DeepMind and others, evolving AI company structures, talent wars, and the latest critical rulings on AI and copyright.
The tone is witty, conversational, and informed—a blend of lighthearted banter and in-depth technical analysis.
On AI Replacing Human Hosts:
“Can they be so lacking in wit and thought as we can be? Sometimes, that's a challenge.” (04:45, Andrei)
On Replit Facing Platform Disruption:
“If I'm Replit, I'm getting pretty nervous looking at this.” (13:33, Jeremy)
On the Nature of Tech Legal Battles:
“This is probably not a huge deal ... more than anything, just another thing to track with OpenAI, right?” (20:25, Andrei)
On US-China Chip Race:
"We're still not on the 5 nanometer node yet. That's really, really interesting." (23:39, Jeremy)
On Amazon Buying Nuclear Power:
“Imagine trying to… predict power needs through 2042.” (33:09, Jeremy)
On Meta’s AI Culture Shift:
“Essentially Mark Zuckerberg is being forced to pay the Yann LeCun tax right now.” (46:24, Jeremy)
On Scaling Laws:
“Bigger models just fundamentally learn from data in a different way.” (64:58, Jeremy)
On the Tricky Impact of Generative AI on Learning:
“I think... good writers are good thinkers because...When you are forced to... write something, I don't really understand something until I've written something about it with intent.” (80:29, Jeremy)
Note: For further details, papers, and coverage, consult episode links and source notes as referenced by the hosts.