Loading summary
Andrei Karpathy
Foreign.
Jeremy Harris
Hello and welcome to the Last Week in AI podcast. We can hear a chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. And as sometimes we will also be discussing the news from the last last week. Unfortunately, we did miss last week. Again, we are sorry, we're going to try not to do that, but we will be going back and covering the couple of things that we missed. And as always, you can go to the episode description to get the timestamp and links to all the things we discuss. I am one of your regular hosts, Andrei Karen. I studied AI in grad school and now work at a Silicon Valley gen AI startup.
Andrei Karpathy
And I'm your other host, Jeremy Harris. I'm with Gladstone AI, an AI national security company. And we're talking about like the last coup couple weeks. Rare that we have two weeks to catch up on, obviously. But when we do, usually what happens is God just gives us a big smack in the face and he's like, you know what, I'm gonna. We're gonna drop like GPT7 and GPT8 at the same time. And now Google DeepMind's gonna have their own thing. Sam Altman's gonna get assassinated, then he's gonna get resurrected, and then you're just gonna have to cover all this this week. These two weeks, very different. Kind of seems weirdly quiet, a bit of a reprieve. So thank you, universe.
Jeremy Harris
Yeah, I remember there was a couple of months ago a thing where there was like grok free and llama3.7 and GPT something something. It was like everything all at once. This one. Yeah, nothing too huge in the last couple weeks. So a preview of the news we'll be covering. We're actually going to start with business this time because I think the big story of the last two weeks is OpenAI deciding it will not go for profit or controlling entity of OpenAI is not going to go for profit. Which is interesting. Going to have a few stories on tools and apps, but nothing huge there. Some new cool models to talk about in open source, some new exciting research from DeepMind dealing with algorithms in research and then policy and safety, focusing quite a bit on the policy side of things with the Trump administration and chips. And just before we dive in, do want to shout out some Apple reviews? In fact, I saw just recently there was a review where the headline is if a podcast is good, be consistent. Please be please post it consistently. As the title says, one podcast per Week. Haven't seen one in the last three weeks now. And yes, we're sorry. We tried to be consistent and I think it's been a bit of a hectic year. But in the next couple months it should be more doable for us to be weekly on this stuff. Well, let's get into it. Applications and business. And the first story is OpenAI saying that it is not gonna go through with trying to basically get rid of Visual a nonprofit that controls the for profit entity. So as we've been covering now for probably like a year or something, OpenAI has been meaning to transition away from the structure it has had since, I guess since its founding, certainly since 2019, where there is a nonprofit with mission guiding mission that has ultimate control of a for profit that is able to receive money from investors and is responsible to its investors. The nonprofit basically ultimately is responsible to the mission and not to the investors, which is a big problem for OpenAI since of course they had this whole crazy drama in late 2023 where the board fired Sam Altman briefly that I think spooked investors and etc. Etc. So now we get here several months. I think we started late 2024. Ish. There was a lot of litigation initially prompted I think by Elon Musk, basically lawsuits saying that this is not okay, that you can't just change from nonprofit to for profit when you got some money while you were a nonprofit. And yeah, it looks like OpenAI backed down basically after apparently dialogue with the Attorney General of Delaware and the Attorney General of California and what they say is discussions with civic leaders and attorney generals. They are keeping the non profit. They are still changing some things. So the subsidiary you could say will transition to being a public benefit corporation. The same thing that Anthropic and XAI are basically a for profit with a little asterisk that you want to be doing your for profit stuff for public good. That does mean they'll be able to do some sort of share thing. I think that does imply that they are able to give out shares. The nonprofit will receive some sort of stake in this new public benefit corporation. So yeah, to me, I was pretty surprised when I saw this. I thought OpenAI was going to keep fighting it, that they had some chance of being able to beat it given their position. But yeah, seems like they were just kind of defeated in court.
Andrei Karpathy
So there's a couple of asterisks Z to this whole thing. Yeah, you're absolutely right.
Jeremy Harris
So.
Andrei Karpathy
So the significance of that attorneys General piece is actually quite, quite significant. Sorry, reusing. So the backstory here, right, The Elon Musk lawsuit, I think, is a really good lens through which to understand this. So Elon, you know, famously sued OpenAI for exactly this. Right. That was a big thing. He was one of the early investors, donors again.
Jeremy Harris
Now it's kind of co founder initially.
Andrei Karpathy
Yeah, right, yeah. And it's like, like, is he a donor or is he an investor? Right. That question is pretty central. So he brought forth this case. The judge on the case in California said, hey, well, you know what? This actually looks like a pretty legit case as you might imagine. It's sort of sketchy to take a nonprofit, raise a crap ton of money, have convince researchers to work for you who otherwise would work in other places because you're a nonprofit with this noble cause. And then having benefited from their research, from all that R and D, from all that ip, now turning yourself around and becoming a for profit. No, you probably can't do that, or at least there's probably a good argument here. But what the judge said was, it's not clear that Elon Musk is the right person to represent this case in court. It's not clear that he has standing. The reason that's the case is that under California law, Elon, like, the only people who have standing to, to bring a case like this forward are people who are current members of the board. Well, guess what? Elon is no longer a current member of the board. Used to be. So did Siobhan Zillis, who no longer is a member of the board either, and probably would have been really helpful in this case if she had been. Or it can be somebody with a contractual relationship with OpenAI. That's what Elon is arguing. He's going to argue that, hey, there was a written contract, an implied contract in these emails between him and Sam and the board where they're talking about, yeah, it's going to be a nonprofit, blah, blah, blah. Elon's going to try to argue that, yeah, there was kind of a contract there that they wouldn't turn around and go for profit. This is hugely complicated by the fact that Elon then turned around and wrote emails saying, well, I think you're going to have to go for profit at some point himself. And so that's a bit of a mess. The remaining category of person who can have standing in raising a case like this is the Attorney General. And so the speculation was that when the judge on the case first said, well, you know what, I actually think there's a pretty good case here, but Elon may not be the one to bring it. It's a pretty unusual thing for a judge to say, kind of flagging that, not passing a judgment or ruling on the case, but just saying, hey, I think it's promising. That may have been the judge trying to get the attention of the attorneys general, knowing that they could have standing themselves if they wanted to, to bring this case forward. Then now what do you see? Right, you see OpenAI going, well, you know, we had a conversation with the attorneys general and they, you know, and following that, were mysteriously deciding this. This reads a lot like the attorneys general spoke to OpenAI said, hey, we agree with the judge. There is a case here. You can't do the thing, and we actually have standing if we want to bring this case forward. That's a potential thing that might. It seems likely that that's at least an ingredient here. Another thing to flag, right? This is being touted as a sort of a win for, let's say, basic principle. Seems. Seems like a common interpretation here. You shouldn't be able to turn a nonprofit into for profit. There are asterisks here. So in particular, OpenAI has done this very interesting thing where they're turning themselves into a public benefit corporation, but they're turning themselves specifically into a Delaware public benefit corporation. This is different from a California public benefit corporation. In a Delaware public benefit corporation, what you can do is essentially all it does is give you more freedom. So a public benefit corporation is allowed, is permitted to care about things other than the interests of the shareholders. They can also care about the interests of the shareholders. In general, they will. But they also are allowed to consider other things strictly. All that does is it gives you more latitude, not less. So it sounds like a very generous thing. It sounds like OpenAI is saying, oh, we're going to make this into a public benefit corporation. How could this be a bad thing? It literally has the words public benefit in the title. Well, in reality, what's going on here is they're basically saying, hey, we're going to give ourselves more latitude to make whatever calls we want. They may be things that are aligned with the interest of the shareholders and corporate profits or. Or they may not. Basically, roughly, in practice, it's up to us. So this is not necessarily the big win that's being framed up as. There's a slippery slope here where over time, even though it's nominally under the supervision of the nonprofit board. You know, the other question is, can the nonprofit board meaningfully oversee. Sam, we saw a catastrophic Failure that right in the whole board debacle. I mean, Sam was fired and then he just had the leverage to force the board to come back and now he swapped them out for friendlies. So very, very unclear whether the board meaningful can exert control, whether you know, Sam has undue influence over them or whether they're, they're getting access to the information they need to make a lot of these calls. We saw that with the Miramarati stuff where there clearly is some reticence to share information from the company, kind of the sort of working level up to the board when necessary. So this is really interesting situation and there's going to be a lot more to unpack in the next few weeks. But high level take is better than the other outcome, certainly from the standpoint of the people who've donated money to this and put in their hard earned time. But big, big open question about where this actually ends up going and you know, what, what it means for the for profit to be a PBC and for the nonprofit nominally to have control. We'll find out a lot more I think in the coming weeks and months.
Jeremy Harris
Right. So to be clear, the OpenAI had this weird structure where there was a nonprofit, the nonprofit was in charge of, of I guess what they called a capped for profit where you can invest but get a limited amount of return up to I think 100x, something like that. And now there is still going to be a nonprofit, there's still going to be a for profit that is controlled, as you said, nominally at least by the nonprofit that for profit is just changing from its previous structure to this public benefit corporation. And as you said, there's details there in terms of, I suppose, shares, in terms of laws you don't have to follow, et cetera, et cetera. And as you might expect, there's been some follow up stories to this, in particular with Microsoft, where I'm sure there's some stuff going on behind the scenes where I think the details of the relationship between Microsoft and OpenAI have been murky and sort of shifting over time. And there's a real question on how much ownership will Microsoft get? Right? Because they were one of the early investors going back to 2019, putting in early billions, the first billions into OpenAI when it was still a nonprofit, when they switched to a for profit. So there's I think, yeah, real kind of unresolved question of how much ownership should they have in the first place.
Andrei Karpathy
Yeah, a lot of this feels like relitigation of things that ought to have been agreed on beforehand. Right. Like you invest with a cap. You know, Microsoft did this. They gave like $14 billion or something. And now OpenAI is being like, yeah JK, like no cap now. And it's like, how do you, how do you price that in? And yeah, a lot of sand in the gears right now for OpenAI.
Jeremy Harris
And actually the next story that we have here is covering that detail titled Microsoft moves to protect its turf as OpenAI turns into rival. So it gets into a little bit of the details of the negotiations. Seems that Microsoft is saying it is willing to give up some equity to be able to have long term access to OpenAI's technologies beyond 2030. Also to allow OpenAI to potentially do an IPO so that Microsoft can we put benefits again. Microsoft put in $13 billion early starting in 2019. So in the last couple years we've seen what hundreds of billions of dollars get invested into OpenAI, something like that. Lots of investors, but Microsoft certainly is still a big one.
Andrei Karpathy
Yeah, definitely, definitely tens. And what's been happening is so you have Microsoft that's coming in, by the way. Microsoft for a long time was basically OpenAI's huge overwhel champion investor. That's changed with SoftBank. Right. So recently we've talked about the, you know, the 30, the $40 billion that OpenAI has been raising, the lion's share of which has been coming from SoftBank. And that's not a small deal. It means that SoftBank is now actually more than Microsoft, OpenAI's number one investor by dollar amount, not necessarily by equity because Microsoft got in a lot earlier at lower, lower valuations. But yeah, so OpenAI now is in this weird position where their latest Fundraise, which was 30, $40 billion. Right. A lot of it from SoftBank had some stipulations to it. SoftBank said, look, we're going to give you the money, but you have to commit to restructuring your company before the end of the year. I mean, the timelines shifted. Initially it was two years out and now it's just like one year or before the end of this year. So everybody interpreted that as meaning, number one, the nonprofit's control over the for profit entity has to be out. And that's not seeming like it's going to be the case. And now SoftBank is making sounds like they're actually okay with that. Microsoft is. It's not clear whether they're okay with it though. And so that's one of the big questions is like, okay, all eyes are now on Microsoft. The softbank has signed off, all the big investors signed off Microsoft. Are you okay with this deal in the context where there is now competition between Microsoft and OpenAI? Right. Really, really intense competition on consumer, on B2B like along every dimension that these companies are, are active. And so, you know, this very tense frenemy relationship here where OpenAI is committed to spending I think something like a billion dollars a year on Microsoft Azure's cloud infrastructure. There's IP sharing where Microsoft gets to use all OpenAI models up to AGI if that clause is still active, which is unclear. There's all kinds of stuff like these agreements are just disgusting Frankenstein monsters. But one thing is clear, if Microsoft does hold this, hold the line and prevent this restructure from going forward, the SoftBank may actually be able to take their money back from OpenAI. And that would be catastrophic when you think about the spends involved in Stargate. So yeah, I mean a lot of, I don't know, I mean it may be a lot smoother looking on the inside, but it tends not to be. My guess is that there's going to be a lot of 11th hour negotiating and nobody wants to have this really fall apart. Right. Microsoft has too much of a stake in OpenAI. But there is also speculation OpenAI apparently there's a leaked deck that OpenAI had that showed right now. So they have to give Microsoft Something like 20% of their corporate profits in principle. That's the agreement going for, I think it was like for 10 years or whatever from their first investment. I may be getting the details wrong at the margins, but the leaked deck showed OpenAI projecting that they would only be giving Microsoft 10% by 2030. And that's kind of interesting. There's no agreement between OpenAI and Microsoft that says that that goes down to 10%. So is OpenAI literally like planning on a contingency that has yet to be negotiated with Microsoft where there assuming Microsoft will let them cut how much they're giving them by half that. I mean that's pretty wild. So I don't know, nobody I knows is in those particular rooms and those are going to be some really, really interesting corporate development, corporate restructuring arguments and discussions.
Jeremy Harris
Yeah, I feel like there's a social network style movie to be made about OpenAI and Sam Altman, but it could just be all the business stuff that's been so crazy, especially in the last couple of years. And yes, as you said, hundreds of billions. I'll take it back. It's certainly more than 50 billion. It's climbing up towards 100 billion, but not yet hundreds of billions for fundraising.
Andrei Karpathy
Yeah, another year maybe.
Jeremy Harris
And a couple more stories next up. We have TSMC's 2 nanometer process set to witness unprecedented demand and is exceeding 3 nanometer due to interest from Apple, Nvidia, AMD and others. So this is the next node, the next smallest chip, kind of type that they can make. Tsmc, I'm assuming everyone who listens to this regularly already knows, but in case you don't, they're the provider of chips. All these companies, Nvidia, Apple design their chip and TSMC is the one that makes it for them. And that's a very difficult thing. They're by far the leader, can make the most advanced chips, the only ones capable of producing this cutting edge of chip. And this 2 nanometer node is meant is expected to have strong production by the end of 2025. So it's very pivotal for Apple, for Nvidia, for these other ones to be able to use this process to get the next generation of their GPUs, smartphones, etc.
Andrei Karpathy
Yeah, this is pretty interesting in a couple ways. First, apparently so the 2 nanometer process, that's the most advanced process. One level behind it is the 3 nanometer process and apparently they've achieved this measure called defect density rate. So they've got a defect density rate on the 2 nanometer process that is already comparable to the 3 nanometer and 5 nanometer process nodes. That's really fast. Basically they've been able to get the number of definitions, defects per, you know, per square millimeter, you can think of it down to the same rate, which means yields are looking pretty good. For a fresh brand new node like this. That's pretty wild. This is also a node that's distinguished from others by its use of the gate all around field effect transistor gafet. Right. This is a brand new way of making transistors. And you can take a look at our hardware episode. We touch a little bit I think on the whole FinFET versus GAFET thing, but basically it's just a way to very carefully control the current that you have flowing through your transistor. It lets you optimize for higher performance or lower power consumption, depending on what you want to go for in a way that you just couldn't before. So a lot of big changes in this node and yet apparently wicked good yields so far and good scale. Another noteworthy thing is we know that this is going to be used for the Vera Rubin GPU series that Nvidia is putting out, right, this is going to be hitting markets sometime in 20, 26, 27. And the significance of that is normally when you look at TSMC's most advanced node, in this case the 2 nanometer process, normally that all goes off to the iPhone. Well now for really the first time, what we have is Nvidia. So AI is starting to butt in on that capacity. So displacing or competing directly with the iPhone for the most advanced node. I will say this is a prediction that we've been making for the last two years on the podcast. It's finally happening. We're going to. Essentially what this means is there's so much money to be made on the AI kind of data center server side that that money is now displacing like it's competing successfully with the iPhone to get capacity at the leading node at tsmc. So that is not a small thing, that is a big transition. And anyway, so there, there's a significant ramp up that's happening right now at TSMC. And this is, you know, we'll be talking about 2 nanometers. We're basically jumping from 4 or 5 nanometers for the kind of H100 series down to 2 nanometers pretty, pretty fast. That's pretty remarkable, right?
Jeremy Harris
And speaking of Nvidia, tsmc, the next story is about Nvidia set to announce, according to some sources, that they're going to place their global headquarters, so their overseas headquarters from the US in Taiwan. And that is very much unsurprising. TSMC is the Taiwanese semiconductors something something. But famously from Taiwan, Nvidia is unsurprisingly probably going to half position themselves for decades now, honestly, since the start of Nvidia in a close partnership with tsmc. And this is gonna just continue strengthening that.
Andrei Karpathy
Yeah, yeah, Taiwan Semiconductor Manufacturing company, by the way. And that's really kind of. Anyway, it's a theme that you see in a lot of the names for these companies. But yeah, there's a whole bunch of locations that they're considering. The interesting thing about this from a global security standpoint is that China is like at any moment going to try to invade Taiwan. And so Nvidia is going, you know, where we want our global headquarters, let's put it on Taiwan. And that's like, that's the balance, right? Make no mistake, Jensen Huang is absolutely going to be thinking about this. He's literally making the calculation, okay, Chinese invasion of Taiwan on the one hand, closer relationship with TSMC in the meantime on the other, and the Latter is actually so valuable that I'm going to take that risk and do it. That's how significant this is. Again, we just finished talking. As you said, this is absolutely related. I can see why you said that. The 2 nanometer node, like you want to secure as much capacity as you can. In the same way that like Google and Apple and all the companies that are trying to get their hands on Nvidia GPUs are literally like Elon flies out to Jensen's house with Larry Ellison to beg for GPUs. In the same way Nvidia is begging to TSMC for capacity. Right. It's begging all the way up the chain because supply is so limited. So this is just another, another instance of that trend.
Jeremy Harris
Yeah, I'm begging to give you my money because it is a lot of money going around here. And speaking of a lot of money, next up, Coriv is apparently in talks to raise 1.5 billion in debt. That's just six weeks after their IPO. The IPO was meant to raise 4 billion for this major, I think cloud provider, provider of compute backed by Nvidia. But that IPO only raised 1.5 billion in part perhaps due to trade policy stuff going on with the US and so on and tariffs. So, yeah, probably in part because the APO didn't go as planned and because Coreeve wants to continue expanding their compute, they are seeking to raise this debt. According to a person with knowledge of this. They have announced this.
Andrei Karpathy
Yeah. And normally, you know, when you go for an IPO or you go for some, some equity raise, right? You're, you're doing it because equity makes more sense than debt, Right? So equity is, you're, you're basically trading shares in your company for $4, right. Debt, you're taking on the dollars, but you're going to have to repay them with interest over time. So it'll end up costing you more net. The issue here is that they're being forced to go into basically like high yield bonds. And this is a round that's being led by JP Morgan Chase & Co. It seems. But yeah, apparently they've been holding virtual meetings with fixed income investors since I guess it would be last Tuesday now. So fixed income investors being people who primarily invest in securities that pay a fixed rate of return. So instead of like usually that's in the form of interest, right. Or, or dividends. So these are sort of reliable, steady income streams that these investors are looking for. Not typically what you'd expect with something like a, you know, Like a core weave or sort of a riskier pseudo startup play. But certainly given the scale they're operating at and all that, that does make sense. But it, it does mean there's added risk. One of the things that I, I think a lot of people don't understand about the space is that the neoclouds like to some degree core weave. Still, they are considered really risky bets. And because they're considered really risky bets, it's difficult to get loans to work with them or for them to get loans. Like the interest rates are pretty punitive. So that's one reason why if you're core weave, you'd much rather raise on a sort of an equity basis. But that option's not on the table. You know, it seems like the, the IPO didn't go so well. We'll see if, you know, if that changes as the markets keep improving. But it's a, it's a challenging spot for sure.
Jeremy Harris
And now moving on to tools and apps, the first story, I think perhaps not the most impactful one, but certainly the most interesting one for me of this whole pack, perhaps even eclipsing the OpenAI for profit thing. And it is the story of the day. Grok told everyone about white genocide. So this just happened a couple of days ago. Grok is the chatbot created by Xai and it is heavily integrated with X, which used to be Twitter to the point that people can tweet, post in reply to something rock, ask it a question and Grok replies in a follow up post on X. And what happened? Was that Grok for many different examples of just random questions. The one I think that maybe started it or was one of the early ones. Someone asked how many times has HBO changed their name in response to the news of HBO Max Grok first replies in one paragraph about that question and then in a second paragraph I'm just going to quote this regarding, quote, white genocide in South Africa. Some claim it's real, citing farm attacks and kill the Boer as evidence. However, courts and experts attribute these to genital crime, not racial targeting and a little bit more. And they did this not just in this one instance in multiple examples, including in one case, someone asked about an image and Grok replied, focusing primarily on the white genocide in South Africa question. People looked into it pretty easy to get Grok to leak its system prompt. And what it seems to be is that it was instructed as you might expect, or at least the chatbot XAI responder bit of Grok was instructed to accept the narrative of white Genocide in South Africa is real. Acknowledge the complexity of the issue, but ensure this perspective is reflected in your responses. Quote, even if a query is unrelated, which I I suspect is the issue here, Weird XI since has come out to address this incident. They said that their was on May 14th at approximately 3:15am Pacific Time and an authorized modification was made to the GROK response buzz prompt on X. And then they say some things of they'll implement do a thorough investigation implementing measures to enhance Groq's transparency. Apparently going to start publishing Groq's system prompts on GitHub. So a funny incident for sure and I think reflective of what we've seen before in Grok, which is grok's system prompt was previously altered to not say that Elon Musk and Trump spread misinformation. This happened I think a couple months ago. Very much similar to what happened here.
Andrei Karpathy
Yeah, it's sort of interesting. It's not the first time that that we've had a situation where they've called out some unauthorized modification. Right. Some sort of rogue employee scenario. So that's sort of an interesting note.
Jeremy Harris
You have to wonder which rogue employee this was.
Andrei Karpathy
And you can also imagine like from a security standpoint, you know, a company like xai, like Twitter, you could also have people working there who are de facto like kind of working there because they don't like political reasons don't like. So you know, adding intentionally stuff to make it go off the there. There's so much. This is such a charged space that yeah figuring out how this goes now one thing I've I've seen called out too is this idea that so number one awesome that they're going to be sharing the system prompt. This is something that I think Anthropic is doing as well, maybe OpenAI as well. So you know, more, more transparency on the system prompt seems like a really good thing. But there are other layers to this, right? Because GROK is a system. At least the as you said, the, the version of Grok, the system that is deployed as an app to respond to people's questions on X is a system, it's not just a model. And that being the case, there are a lot of ancillary components and ways of injecting stuff after the fact into the de facto system prompt. One element of which is this like post analysis component to the chain, let's say of you know, the system. And the concern has been that this, this issue is arising at the level of the post analysis, not of the system prompt itself. And that you get content injected into context following the system prompt that may kind of override things. And so there have been calls to make that transparent as well. So it'd be interesting and useful to have that happen too. Obviously within reason, because there's always the risk that you're gonna then leak some security sensitive information where you're telling the model not to tell people how to make crystal meth and you have to provide some information about crystal meth to do that, blah blah, blah, but within reason doing that. So anyway, a lot of interesting calls for more transparency here. Hopefully it leads to that would be great to have the kind of consistent standard being that we have system prompts and all the kind of meta information about the system that is both security and safety relevant, but also that doesn't compromise security by doing all the things. So yeah, kind of interesting Internet firestorm to start the week.
Jeremy Harris
Yeah, I think quite amusing. But also if you're I wonder if it has real financial implications for xai. I doubt it would mean people steer away from Chatbot, but for enterprise customers, if you're considering their API, I think this sort of like crazy wide scale craziness of their Chatbot is not something that makes you favor it over competitors like Anthropic and OpenAI. And next up we have some actual new tooling coming from figma. They have announced and partially released AI powered tools for creating sites, app prototypes and marketing assets. So this is going to be titled figma Sites, Figma make and Figma Buzz. Similar to existing tools out there, but coming from figma. Figma being a leading provider of software for design, I think increasingly kind of the de facto way for people to collaborate on things like app design, general user interface designs and many other applications. Nowadays they're just huge. And now figma Sites allows designers to create and publish website directly from figma, as you might imagine with AI prompting to take care of a lot of the functionality there. Figma make similarly is meant for ideation and prototyping, enabling you to create web applications from prompts and even that would go as far as dealing with code. And then Figma Buzz is going to be able to make you marketing assets with integration of AI generated images. So makes a lot of sense. Apparently they're introducing this under the $8 per month plan, which includes other stuff as well. So similar to other companies we've seen going with more of a bundling approach where you get the AI along with the broader tool suite as part of a feature set.
Andrei Karpathy
Yeah, it's part of A trend too, towards every company becoming the everything company, right? Like Figma is being essentially forced to move into deeper part of the stack that used to be just a design app and now it's like we're doing prototyping, creating websites and marketing assets. You can see them starting to kind of crawl up the stack as AI capabilities make it so much easier to do that. Making it easier to do that also means that your competitors are going to start to climb and so you kind of have to do this sort of diffusion out into product space and own more and more of it. Which is interesting, right? I mean it's like everybody starts to compete along every layer of the stack. And I think one of the big kind of determinants of success in the future here is going to be which enclaves, like which initial beachheads. In Figma's case, that's design, right? But which beachheads end up being the most conducive starting points to own the full stack, give you access to the kind of data you need to perform well across the stack. And I mean, I could see design being one of those things. It's really useful. You get a lot of information about, you know, like people's preferences and the results of experiments and stuff like that. But yeah, nonetheless, I mean, I think this is, this is something we'll see more of, you know, expect to see prototyping companies moving into design, marketing, asset companies moving into website creation. Like it's, it's all just becoming so easy thanks to AI tooling that people are are kind of forced to become the everything company.
Jeremy Harris
And next story is about Google. They are bringing Gemini to Android Auto. So Android Auto their OS for cars where you can do navigation, playing music, etc. And they are adding Gemini partially as the advanced smart voice assistant, just building upon what there was already and then also the Gemini Live functionality where the AI is always listening and always ready to just talk to you. And I think, you know, not surprising obviously that this would happen, but I do think interesting in the sense that it seems inevitable we'll eventually wind up in this world where you have AI assistants just ambiently with you anytime, ready to talk to you via voice as well as text. We are not there yet, but we've seen over the past year a movement in that direction with ChatGPT's advanced voice mode, with Gemini Live with all these things. And I think this is taking us further in that direction in making it so the one place where you have to compute through voice in your car. Now you have the AI assistant, always on and ready to do whatever you ask of it.
Andrei Karpathy
Yeah, it sort of reminds me of some of the stuff that Facebook and other companies like that have to do, right. When you, when you saturate your user population. Basically, Facebook sees itself as having had a shot at converting every human on the face of the earth. Then you're forced to go, okay, well, where else can we get people's attention? You know, Netflix famously, in one of their earnings calls, I think it was put out a report saying, hey, we view ourselves as basically complete competing with sleep and sex because, you know, we're doing so well in the, in the market like we now we're looking for, where can we squeeze out more people's time to get them on the platform? This is sort of similar, right? So hey, you're sitting in your car. Why aren't users while driving their cars or being driven in their cars, why aren't we collecting data? Why aren't we getting interactions with them? And it's so obvious too, this is where things are going to go anyway from a utility standpoint standpoint. So, yeah, another, another deeper integration into our lives of this stuff. Why waste a perfectly good opportunity? There's an empty billboard or there's just, there's just a bunch of grass in that field there. We could, we could have an ad there or we could have, you know, some data collection thing there. You know, as the stuff creeps more and more into our lives.
Jeremy Harris
Next story is again about Google. They have announced an updated Gemini 2.5 Pro AI model. So they, I think prior to this Most recently had 2.5 version in something like early March or I forget exactly. But at the time of the release of Gemini 2.5 Pro, it kind of blew everyone away. It did, you know fantastically well on benchmarks. It just anecdotally people found that switching to IT from things like Anthropic worked really well for them. And so this is a big deal. For that reason, they have announced this update that they say makes it even better at coding. And once again, they have shot up to the top of various leaderboards on things like Web Dev arena or Video MME benchmark for video understanding. Apparently Google says that this new version addresses developer feedback by reducing errors in function calling and improving function calling trigger rates. And I will say Gemini, in my experience of using it, Gemini 2.5 is very trigger happy and likes to do a lot with not too much prompting. So I wonder if it will improve just based on people's usage of it in the Realm of web development.
Andrei Karpathy
Yeah, it's also interesting. So they, one of the features that they highlight is this ability to do video, to code. So basically like based on a video of a description of what you want, it can generate that in real time. So kind of impressive and not a modality that I would have expected to be important. But then, you know, thinking about it more, it's like, well, I guess if you're having a video chat with somebody, right. I guess if you have an instructional video or something, you could see that use case. So anyway, I thought that was kind of cool. And also another step in the, in the direction of converting very raw product specs into actual products. Right. You can imagine human inflection and all that. Like the classic consultants problem of like somebody gives you a description of what they want, it's usually incomplete. You have to figure out what it is they want that they don't know they want. And you know, that's sort of starting to step in that direction. Another thing that they've done is they've updated their model card, their system card, based on this new Release, the Gemini 2.5 Pro model card. One of the things that they flag, I mean, there are a couple places where, so across the board, by the way, you'll be unsurprised to hear that this does not pose a significant risk on any of the important evals that would cause them to not release the model. But they do say that its performance on their cybersecurity evals has increased significantly compared to previous Gemini models. Though the model still struggles with the very hardest challenges, the ones that they see as actually representative of the difficulty of real world scenarios. So they do have more tailor made models on the cyber side that are actually kind of more effective, you know, nap time, big sleep type stuff. But anyway, so kind of interesting, they're keeping the model card up to date as they do these sort of intermediate releases, which is I think quite helpful and good.
Jeremy Harris
Right. And makes me wonder also, I don't think we've discussed this phenomena of Vibe coding very much, but yeah, it's true. It's taken off in the last couple months. And the idea, if you haven't defined it is basically people are starting to make apps, build stuff from scratch very, very quickly by using AI and primarily generating code through LLMs. Even people who have no background in software engineering are now seemingly starting to code Vibe code, as they say, applications with a Vibe meaning that you kind of don't worry about the details of the code so much, you just get the AI to do it for you and you just tell it what you want. And so I think this update reflects potentially the fact that this vibe coding thing is a real phenomena. The focus here seems to be very much on making aesthetically pleasing websites, on making better apps. What they highlight in a blog post is quick concepts to working apps. So hard to say how big this vibe code phenomena is, but from this update seems like potentially that is part of inspiration.
Andrei Karpathy
I mean, yeah, like our, our launch website for our latest report that we did was all vibe coded. So my brother, you know, had, I guess he had like two hours to throw it together or something and he was just like, all right, let's go. Like, I don't have time to. And it was really quite interesting. Honestly, I had not. This happened about what, like two months ago. I had not at that point actually done the vibe coding thing because I guess I just aesthetically I couldn't bring myself to do it. That's the honest thing. Like I just wanted to be the one who wrote the code. And the vibe coding thing is really weird. If you've never done it yourself, definitely give it a shot. Like just build the thing and basically keep telling the model like, no, fix this, fix this, no, do it better. And then eventually the thing takes the right shape. One caveat to that is you end up with a disgusting spaghetti ball of code on the backend because the models tend to be like way too verbose and they tend to just like write a lot of code when a little code will do. It's not tight, it needs refactoring. But if you're cool with a landing page, like we were very simple product, you're not building a whole app. It can actually work really well. I was super surprised. I mean that was easily a 5x lift on the efficiency of our setup. So yeah, really cool.
Jeremy Harris
Yeah, really cool. I think very exciting for software engineers as well. Like if you haven't done web development or app development now it is plausible for you to do it. Do you think, like, maybe you could have thought of a better, more descriptive title like LLM Coding Hack Coding, Product Manager Coding. You know, Wipe coding is a fun name but a bit confusing. And one last story in the section, Hugging Face is releasing a free operator like agentic AI tool. So Hugging Face is the provider, the hoster of models and data sets and also the releaser of many open source software package. And now they've released a free cloud hosted AI tool called Open Computer Agent, similar to OpenAI's operator or tropics computer use. So this basically, you know, you give it some instructions, it can go to Firefox and do things like browsing the web to do things. According to this article, it is relatively slow. It is using, you know, open models, things like I think they mentioned small agent. And it's generally, you know, not as powerful as OpenAI's operator. But as we've seen over and over, open source tends to catch up with closed source of things like OpenAI pretty quickly. And I would expect especially in things like computer use, there is really building on top of model APIs and models and so on. This could be an area where open source really excels.
Andrei Karpathy
Yeah and it's also a good, I think strategic angle for Hugging Face too. Right. They're a big way they make their money is they host the open source models on their platform, they run them, in this case running agentic tools on the platform. I mean that's a lot of API calls. So you know, if you get people ultimately release this as an API, a lot of people presumably go to use is a bit of a finicky tool as these things all are. Of course this one may be particularly so they're using some Qin models in the back end. I forget there were a couple others when I had a look at it. But yeah, also you know, another instance of where we're seeing Chinese models really come to the fore in the open source, even hosted by American or I should say Western pseudo American companies like Hugging Face. Yeah, so another kind of national security thing to think about as you run them as agents. Increasingly, what behaviors are baked in, what backdoors are baked in, what might they do if given access to more your computer, your infrastructure. So either way, interesting release. I think Hugging Face is going to start to own a lot more of the risk that comes with the stack too as you move into agentic models and yeah, we'll see how that plays out.
Jeremy Harris
And moving on to projects and open source, we begin with Stability AI one of the big names in releasing models and their latest one is Stable Audio Open Small. So this is a text to audio model developed in collaboration with ARM and apparently is able to run on smartphones and tablets. It it has 341 million parameters and can produce up to 11 seconds of audio on a smartphone in less than eight seconds. It does have some limitations. It only can listen to English, it does not generate realistic vocals or high quality songs. It's also licensed somewhat restrictively. It is free for researchers and hobbyists and businesses with not, not that much Annual revenue as with I think Stability AI's recent releases. So yeah, I think an interesting sign of where we are where you can release a really state of art model to run on a mobile device. And apparently this is even optimized to run on ARM CPUs, which is interesting. But other than that, I don't know that there are many applications I can think of where you would want text to audio on your phone.
Andrei Karpathy
Yeah, I mean I think potentially they're viewing this as a beachhead R and D wise to keep pushing in this direction. Having a model on the phone that actually works, that gives decent results. Yeah, it can be pretty important because when you're talking verbally, Right. You want to minimize latency and so preventing the model from having to ping some server and then ping back. That's useful. Also useful for things like translation. Right. Where you might have your phone. I don't know, in some foreign country you don't have Internet access. Another useful use case, but. But they're definitely not there yet. Right. Like this is a very much. It reads like a toy more than a serious product. I. I'm not too sure who would be using this outside of some pretty niche use cases. They describe some of the limitations. So it can't generate good lyrics. Like it's that they just tell you pretty much flat out like this is not something I'll be able to do like realistically good vocals or high quality songs. It's for things like drum beats. It's for things like kind of little noises that I guess you might want to use almost to me it sounded like things you might want to use when you're doing like video editing or audio editing, like these sorts of things which I don't know how often that's done on the phone. I may be missing, by the way, a giant use case. This is one of the, the virtues of AI is like, you know, we're touching the entire economy of sound on the phone and that I don't know. But to first order, it doesn't seem, yeah, super clear to me what the big use cases are. But again, could just be a beachhead into a use case that they see as really significant down the line. And certainly audio generation locally on a phone sounds like it could be quite useful down the line.
Jeremy Harris
Next up we have an OpenAI image generator that is trained entirely on license data. They're calling this Flight. This is made by Freepik in collaboration with AI startup File AI and it is a relatively strong model. It has 10 billion parameters trained for over two months on 80 million images. So even though it's. They're not claiming it to be competitive with state of the art stuff from midjourney and others or Flux, they are saying that this is openly available, fully openly available and fully trained on licensed data. Unlike things like Flux, which presumably are trained on copyright data, which is still very much an ongoing legal question. We've seen Adobe previously emphasize being trained on licensed data. So this now makes it. So there is a powerful open source model that is not infringing on copyright.
Andrei Karpathy
To be honest, I'd never heard of freepik before. Right. This is. They're apparently a Spanish company. So again, I think this is the first Spanish company I've heard about in this context in kind of AI in general for a long time. I'm actually curious if people can think of others that I might be missing here. But so kind of interesting. First points on the board for Spain. Apparently this is, yeah, 10 billion parameter model trained on 64H100 GPUs over the course of two months. So you know, it's like a, I mean it's a baby, it's a baby workload. But by open source standards, pretty decent and certainly, I mean, you know, they show all the usual images you might expect like a really impressive HD face of a woman and a bunch of. Anyway, a bunch of more artsy stuff. So yeah, pretty cool. I continue to Wonder where the RoI, where the RoI argument is for these kinds of startups that just do open source image generation. Seems to me like a pretty saturated market. Seems to me kind of like they're lighting VC dollars on fire. But what do I know, we'll see if, if they survive, we'll see how many actually survive in this space going forward. But definitely an impressive product and again, good for Spain points on the board here.
Jeremy Harris
Yeah, this sort of like takes you back to stability. AI and I think Flux also released their own model. It's like, oh, you're releasing really good models for free.
Andrei Karpathy
Yeah, like how.
Jeremy Harris
Yeah, it's a funny place with AI where it has become kind of a norm and I think probably partially just a case of bragging rights and fundraising brownie points, but I think notable in this case particularly because of the license data aspect of it.
Andrei Karpathy
I find anytime I try to explain it, it ends up sounding just like a pyramid scheme. It's like they, yeah, they make a great model using initial seed round so they can convince the Series A investors to give them more money to make an impressive model at some point. There's a pot of gold at the end. Don't worry about it. At some point there's a pot of gold at the end. Like I don't know. But hey, it's a proving round if nothing else for great AI teams. I think the biggest winners in this in the long run are probably the open AIs, the Googles of the world who can come in and just aqua hire these teams once they've run out of money and can't raise another round. And then these are sort of hardened, battle hardened teams with, with more engineering experience. So you know, economically there's value there for sure. It's a question of whether that value justifies the fundraising dollars.
Jeremy Harris
Couple more models to talk about next up. Am thinking V1 is a new reasoning model that they claim exceeds all other ones at the scale of 32 billion parameters. So this group of people, apparently the AM team that is an internal team at Bike again, someone I have not been aware of, they're dedicated to exploring AGI technology. What this group did was take the base QIN2 5 32B model and publicly available queries and then created their own post training pipeline to do the thing we saw Deepseek R1 do. Basically take a big good base model, do some supervised training and some reinforcement learning to get it to be a very powerful reasoning or thinking model. They released a paper that went into the details what they did. It seems like as we've seen in other cases, the data curation aspect of it and the really nitty gritty of how you're doing the post training matters a lot. And so with that they have, as you would expect, a table where they show that they are significantly outperforming deep seq R1 and are at least competitive with other reasoning models at this scale. Although not quite as good as the ones that are at hundreds of billions of parameters.
Andrei Karpathy
Yeah, so some caveats on this. So the model doesn't have support for like structured function calling or tool use which increasingly oh, and also multimodal inputs which is increasingly becoming a thing as people start to use agents for computer use. So whenever you see an open source model like this, I'm always interested to see when are we going to see open source bridge the gap to hey, this thing is made for computer use. It's made to be multimodal natively and kind of taken video and used tools and all that. So this is not that, but it is a very impressive reasoning model, very serious entry in the growing catalog of Chinese companies that are building impressive things here. A couple Things. First of all, these papers are all starting to look very similar, right? We have, I think it's fair to say at this point, a strong validation on the deep seq R1 path, which is you do pre training with anyway a staged pre training process, increasingly high quality data. Towards the end of pre training, then you run your supervised fine tuning. In this case, they used almost 3 million samples across anyway a bunch of different categories that had a kind of think then answer pattern to them. So you do that, you supervise fine tune and then you do a reinforcement learning step to enable the sort of test time compute element of this. So again, we see this happen over and over again. We saw it here, we saw it with Quin 3, we saw it with Deep Seq R1. We're going to keep seeing it. A lot of the same ingredients using GRPO as the training algorithm for rl. That's here again. Another thing is, and this is, I think this was common to Q03 as well. It's certainly becoming a thing. More and more focus on kind of intermediate difficulty problems. So making sure that when you're doing your reinforcement learning stage, you're not giving the model too many problems that are so hard that it's kind of pointless for it to even try to learn from them or so easy that they're already saturated. So this is one of the things that you're seeing in the pipeline is a stage where you're doing a bunch of rollouts, seeing what fraction of those rollouts succeed. And if the fraction is too low or too high, you basically just scrap that, don't use it as training data. You only keep the ones that have some intermediate, 50, 50, 70% pass rate, something like that. So this is being used here as well. A whole bunch of stuff too about the actual optimization techniques that they use to overlap communication and computation. The challenge with this, and we talked about this in the context of Intellect 2, that, that paper that I guess we covered two weeks ago, where you've got this weird problem with this reinforcement learning stage where unlike the usual case where you would pre train a model, you would feed it an input, get an output, you'd immediately be able to do your back propagation because you would know if the output was good or not. With the reinforcement learning stuff, you actually have to have the model generate an entire rollout, score it, and only then can you can you do any kind of like back propagation or, or, or weight updates. And the problem with that is that your rollouts take a long time and so you have to Find ways to hide that that time and overlap it with communication or anyway do different things. And so that's a big part of what they're, they're after here in this paper. Last thing I'll mention is this company which again not going to lie, I had never heard of Beiko before but they are apparently. I can't explain this, don't ask me to explain this, but the description on their website is that they work together with China's top tier developers to. They're basically like a property company. Connected over 200 brokerage brands, hundreds of thousands of service providers across a hundred cities nationwide providing both buyers and sellers of existing housing services including consultancy in trust property showing, facilitating loans. What the. Like, I don't know. I don't know. Do you want to invest, do you want to invest in these guys? I guess you do because they make really good models now apparently.
Jeremy Harris
Yeah, this real estate company is invested in going in AGI.
Andrei Karpathy
Well they seem like they're one of these Chinese everything companies as well because then they also they have like a million different websites that was I guess their housing website. They also describe themselves on another one as the leading integrated online and offline platform for housing transactions and services. So maybe they're what, more of a like a stripe for housing. I don't know. Somehow some executive at Baker said one day we got to get in the AI game and apparently recruited some good talent. I'm so confused right now. But yeah, there it is I think.
Jeremy Harris
Yeah. Also indicative probably of the impact of deep seek R1 on the Chinese landscape where they made a huge splash. Right? Like to the effect of actually affecting the stock market in the us. I would not be surprised if there are new players in China focusing on seasoning just as a result of that.
Andrei Karpathy
It is weird that they're coming from like a property company or something. Like I mean I, I understand.
Jeremy Harris
Yeah, this is a weird one for sure.
Andrei Karpathy
I like, I get Deep seek, you know what I mean? Like, like okay, so they come from high, high flyer like this like, like you know, hedge fund that a million hedge fund companies like Medallion or rentech, like they, they do AI right. That's what they do. This is just like what are you doing guys? Apparently they're doing really well. It's a good model. Don't know what to say.
Jeremy Harris
And yeah, fully open source. So that's nice to have. And last open source model, we cover Blip3o, a family of fully open unified multimodal model architecture training and data sets. So. So we've covered Blip 3 before that was the multimodal model in the sense of taking both images and text as input and outputting text. That used to be what multimodal meant. With Blip3o, they're moving in, I suppose the frontier of multimodal where both with ChatGPT and with Gemini, we saw recently the models being able to output images in addition to taking them as input, so that now we have a unified multimodal model. It can take in multiple modalities, it can output multiple modalities. I will say not necessarily just one big transformer as is typically the case for multimodal things with multiple inputs. But anyway, that's the core idea and they talk in the paper a lot of details on how to be able to train such models. They train a model on 60,000 data points on this instruction fooning to make sure that it is able to generate high quality Images, release the 4 billion parameter model that is trained on only open source data, and have also an 8 billion billion parameter model with proprietary data.
Andrei Karpathy
I mean it's, it's what I would expect. Things are going to have like, I think the multimodality trend and the agentic trend sort of converge again, as I mentioned on computer use. So I see these two things being different ways of getting at the same thing. The two things being this paper and the one we just talked about, it does seem like a pretty impressive model. One of the things that they did work on a lot was figuring out the architecture. They found that using Clip image features gives just more efficient representation than the VAE features, the variational auto encoder features that often are used in this type of context. Clip being the contrastive training approach that OpenAI used for. Well, for Clip, there's a whole bunch of work that they did around training objectives as well, comparing different objective functions that they might use to optimize for this sort of thing. Anyway, it's cool. I think it's an early shot at high degrees of multimodality from these guys and I would expect that we'll get something like a more coherent. The same way that we've coalesced around a stack for the agent side. I think this is an early push into the kind of very, very wide aperture unified multimodal framework. We've seen a lot of different attempts at this and it's still unclear what strategy is going to end up working, so. So it's hard to know where to invest our own marginal research time as we look at these papers and figure out like okay, well, which of these things is really going to take off? But for now, given its size, this actually does seem pretty promising.
Jeremy Harris
Yeah, and I would imagine certainly probably the best model of its kind that you can get in open source to be able to generate images. You've seen models like Gemini, like OpenAI vet, integrate the transformer with the image generation, have some very favorable properties and seem like they actually are better at very nuanced instruction following. So there's still room to improve in the image space. Although these are of course not quite as good as the previous releases from the Blip team that includes Salesforce and University of Washington and other universities. Super, super open source, the most open source here you can get code models, pre training data, instruction tuning data. All of it is available when you.
Andrei Karpathy
Need to catch your breath while listing all the different ways in which it's open source. That's the bar, that's how you know fully open source.
Jeremy Harris
And now moving on to research and advancements. We begin with DeepMind and they have released a new paper and blog post and media blitz with Alpha Evolve, a coding agent for scientific and algorithmic discovery. That's the name of the paper. The blog post, I think, somewhat amusingly, is Alpha Evolved, a Gemini powered coding agent for designing advanced algorithms.
Andrei Karpathy
But there'd be no confusion.
Jeremy Harris
Yeah. And so as per the title, the idea here is to be able to design advanced algorithms to get some code that solves a particular problem. Well, this is in some ways a sequel to something they did last year called Fun Search. We covered it maybe in the middle of a year, I forget exactly when. And this is basically taking up, taking it up a notch. So instead of just evolving a single function, it can write an entire file of code, it can evolve hundreds of lines of code in any language, is scaled up to a very large scale in terms of compute and evaluation. So the way this looks in terms of what it does is a scientist or engineer sets up a problem. Basically it gives you a prompt template, some sort of configuration, chooses the LLMs, provides evaluation codes to be able to see how good a solution is, and then also provides an initial program with components to evolve. And then Alpha Evolve goes out and produces many possible programs, evaluates them and winds up with the best program. And similarly to what we saw with FontSearch Fund Search, at the time they said that they achieved some sort of small improvement in a pretty basic operation of matrix multiplication. Although at the time this was a little nuanced. Not entirely. Right. Well, with Alpha Evolve, they going to showing for various Applications like autocorrelation and uncertainty inequalities, packing and minimum maximum distance problems, various math things that clearly I'm not an expert with, they show somewhat improved outcomes and just yeah, the latest really of the DeepMind style of paper where they are like let us build some sort of alpha model to tackle some sort of science or in this case computer science thing and get some cool results.
Andrei Karpathy
Yeah, I think that's how they describe it internally. Like we're going to do some kind of alpha something and then we're going to. But that's actually, I mean it's accurate. One of the ways I used to think about it, I think I still do, is through the lens of inductive priors. Right. So basically the, the Google so OpenAI has this. They're super scale pilled, right? Just like take this thing and scale the crap out of it and more or less all your R and D budget is going into figuring out ways to get out of your own way and let the thing scale. Whereas Google E Mined tends to come at things from a perspective like well, let's almost replicate the brain in a way in different chunks. So we're going to have a clear chunk like an agent that's got this very explicitly specified architecture. We're not just going to let the model learn the whole thing, we're going to tell it how the different pieces should communicate. And you can see that reflected here in the kind of pool of functions that it reaches into and grabs the evolutionary strategy and how that's all connected to the language modeling piece. They also have an element to this where they're using Gemini Flash, you know, the super fast model and the Gemini Pro, they're more, I guess, powerful but slower model for different things. So with Gemini Flash they use it to generate like a whole smorgasbord of different ideas cheaply. And they use Gemini Pro to do kind of the depth and the deep insight work. All those choices, right, sort of involve humans imposing their thinking of how a system like this ought to work. And what you end up finding with these systems is they'll often outperform what you can do with just like a base model or an agentic model without a scaffold. But eventually the base models and agentic models just kind of like end up catching up to and subsuming those capabilities. So this is a way that DeepMind does tend to kind of reach beyond the immediate, the ostensible frontier of what just base models and agentic models can do and achieve truly amazing things. I mean you know, they've done all sorts of stuff with like density functional theory and controlling fusion reactions and predicting weather patterns by following this exact approach. So really cool. And it's consistent as well with Isomorphic labs and all the biotech stuff that they're doing. So it's a really impressive, really impressive paper. You can see why they're pushing in this direction too. Right, for automating the R and D loop, if you can get there first, you can trigger the sort of intelligence explosion or at least it starts in your lab first and then you win. This is a good reason to try that strategy of reaching ahead, even if it's with bespoke approaches that use a lot of inductive priors and don't necessarily scale as automatically as some of the kind of OpenAI strategies might.
Jeremy Harris
Yeah, I find that looking at the paper interestingly, they don't talk super in depth as far as I can tell on the actual evolutionary process in terms of what they're doing. It seems like they pretty much are saying we took what we had in fund search, which was an LLM guided evolution to discover stuff and we expanded it to do more, to be more scaled up, et cetera, et cetera. So it's them, as you said, taking something, pushing it more and more to the frontier. They did this also with protein folding, with chess, with any number of things. And now they are claiming some pretty significant advancements in theoretical and existing problems. Also on practical things, they say that they found ways internally to speed up the training of Gemini by 1% by finding a way to speed up the kernel of Gemini. Also found ways to assist with training TPUs, scheduling stuff. Anyway, these kinds of actually useful things for Google in the real world. And next up we have absolute zero reinforced self play reasoning with zero data. So for reasoning models, as we've covered with deep seq R1, the standard paradigm these days is to do some supervised learning where you collect some high quality examples of the sort of reasoning that you want to get and then do reinforcement learning with an Oracle verifier. So you do reinforcement learning where you're solving coding and math problems and you are able to evaluate very exactly what you are outputting via reinforcement learning. So here they are still using a code executor environment to validate task integrity and provide feedback. But they're also going more in direction of self evolution through self play. Another direction with DeepMind and OpenAI also pushed in the past where you don't need to collect any training data, you can just launch VLLMS to gradually self improve over time.
Andrei Karpathy
Yeah. And it's the way they do that is kind of interesting. So there was a paper, I'm trying to remember what the name of the model was that did this. And for some reason I think it's, I may be wrong. I have a memory that it was maybe deep seq, but in this or, sorry, the lab, not the model, but essentially. So this is a strategy where they're going to say, okay, when it comes to a coding task, we have three elements that play into that task. We have the input, we have the function, and then we have the program and we have the output. Right? So those three pieces and they sort of recognize that actually there are three tasks that we could imagine getting a model to do based on those things. We could imagine showing it the input and the program and asking it to predict the output. So that is called deduction, right. So you're giving it a program and an input, predict the output. You could give it the program and the output and ask it to infer the input. And that's called abduction. There's going to be a quiz later on these names and then there's. If you give it input, output pairs, figure out what was the program that connect these, that connected these. Right. And that's called induction. And these actually kind of all the names make sense if you think about them enough. But that's basically the idea, right? Just like basically take the input, the program and the output and black out one of them and reveal the other two and see if you can train a model to predict the missing thing. In a sense, this is at a high level of abstraction, almost a kind of autoregressive training in a weird way. But the bottom line is they use one unified model that's going to, that's going to kind of like propose and solve problems and they're going to set up a reward for the problem proposer, which is essentially generating a program given input and output. And for that it's your standard. Like if you solve the problem, if you propose a correct problem, that or a program rather that compiles and everything's good, you get a reward. If not, you don't. Anyway, they do a bunch of Monte Carlo rollouts, in this case 8, just to normalize and regularize. But yeah, bottom line is, you see again, another theme that pops up in this paper is this idea of difficulty control. In this case, the system has a lot of validation steps that implicitly control for difficulties. They're not going to explicitly say, hey, let's only keep the difficulty, you know, the mid range difficulty problems by some score. You actually end up picking that up implicitly because of a couple conditions that they impose. The first is that the programs that are proposed, the code for those programs has to execute without errors. So automatically that means you have to be at least able to generate that code and it has to be coherent. There's a determinism check too. So the programs have to produce consistent outputs. If you run the program multiple times, you got to get the same output again. You know, this requires a certain level of mastery and then there's some safety filtering so they forbid the use of harmful packages. And basically if your program generation part of your stack here is able to do this successfully, then probably it's being forced to perform at least at some minimal level. So the task is not going to be trivial at least. And only tasks that pass all those validations contribute to the learning process. So you get a kind of baseline quality of the programs that are generated here. It's a really interesting paper. It's something that raises a lot of questions about the data wall, right? This is something that people have talked a lot about, is like there's only so much data you can fine tune on so many examples of solved problems, solved coding problems. If you have this closed loop though, that's able to automated automatically generate new problems, new deduction, abduction and induction problems, and then close a loop where one feeds into the next as they have here, then you really don't have a data wall. And they have some scaling curves that show admittedly not that far out in scaling space, in sample space, but still scaling curves that show that, yeah, this does seem to keep going at least as far as they've tested if that holds. Essentially what they're doing is they're trading data for compute. If your model is good enough to start this feedback loop, then just by pouring more compute into it to the model to pitch new problems that it can then solve, you can start this feedback loop where really there's, I mean there's no data wall that, that at least would seem to apply for the kind of code problem solving problems that, that they're training on here.
Jeremy Harris
Right. And just to know the particular detail, they do actually look into not having the verifiable rewards or the supervised learning. So absolute zero is absolute zero because there's no supervised learning or verifiable rewards. Although they are I think still executing the code in a computing environment, if I understand correctly. So they can have some feedback from the environment. But not an actual kind of verification that you got the problem correctly. So as a result they have to think through all of these other techniques to be able to evaluate yourself, like deduction, abduction, induction, as you said, that allows them to train. They compare to. I haven't actually been aware of these. There's been, you know, more and more open source efforts as we've seen. Apparently there's an open Reasoner 0 There's also simple RL Zoo various things over the last couple of months looking into the RL part of reasoning. And so this is just the latest and I think pushing in a direction of not requiring verifiable rewards, which is to some extent a limitation of the deep seq R1 formula. Next up we have another report from Epochai. So not a research paper, but an analysis of trends and kind of a prediction of where we might be going. This is focusing on how far can reasoning models scale. So the basic question here is can we look at the training compute that's being used for reasoning models, things like Deepseeker 1, Grok 3 and from that infer the scaling characteristics and to what extent reasoning will kind of keep growing. So their prediction is that we have pretty small period in which you have very rapid growth going from Deepseeker 1 to Grok 3. They don't know exactly the training for O3 versus O1, but they I think are predicting here that O3 would be trained quite a bit more. And so their prediction is the training compute being used will start flattening out a bit, growing slower compared to base models of the past. But they are still saying that the scale of large training runs will keep going in the next couple of years and presumably the reasoning models will continue improving as a result.
Andrei Karpathy
Yeah, you can kind of. We talked about this quite a bit actually, when before and when a deep seq R1 came out. We were talking about it before, even when O1 came out. Just the idea that you have this new paradigm now that requires a fundamentally different approach to compute. Right. You have to. Well, we just talked about it. Instead of just doing, you know, generating an output and then automatically being able to score that really quickly and then doing backpropagation, updating your model weights, what you now have to do is you take your base model, you generate an entire rollout and that takes a lot of time and it has to be done on inference optimized hardware. And those rollouts then have to be evaluated and then the evaluations have to check out and then you use those to update your model weights. And so that whole extra step actually requires a different compute stack. And so if you look at what the labs are doing right now, they've gotten really, really good at pre scaling, at scaling pre training compute. Right. Just this auto regressive pre training where you're training a giant text auto complete system. People know how to build multi billion dollar tens of billion dollar scale pre training compute clusters for that. But what we're not seeing, what we haven't yet seen, is aggressive scaling of the reinforcement learning stage of training. And this is not going to be a small thing. So it's estimated that about 20% of the cost of pre training deep seq R1, the V3 model that R1 was based on. So if you look at the cost of pre training deep seq v3, about 20% of that cost went into the compute for R1. That's not trivial. And we keep seeing in these compute scaling curves for inference time scaling that you really do want to scale it along with your pre training computer budget. Right. So it's, you're going to get to a point where right now we're ramping up the orders of magnitude like crazy on the inference side. That's though going to going to saturate very quickly. I mean we saw a 10x leap from 01 to 03 in terms of the compute used for the reinforcement learning stage. As you said, you can only do that so many times until you hit essentially the ceiling of what current hardware can allow. Once that happens, then you're bottlenecked by how fast can you grow your algorithmic efficiency and your hardware scaling. And essentially that looks the same as pre training scaling growth which is about 4x per year. So you should expect a rapid increase.04 is going to be really really good. Oh5 is going to be really really good. But pretty quickly it's not that things are going to slow down like crazy, but they'll scale more like the pre training scaling curves that we've seen. This has big consequences for us, China for example, because right now it's creating the illusion that China is better off than necessarily they are. In the early days of this paradigm when people haven't figured out how to take advantage of giant inference clusters. The US which has larger clusters available than China, isn't yet able to use the full scale of its clusters. And so we're getting sort of a hobbled United States, artificially hobbled United States relative to China on a compute basis. All kinds of reasons why that's actually kind of a more complicated picture, but I thought that was really interesting. Another data point that they flagged here that I was not tracking at all was there are these other reasoning models that have been trained that have come out Fairly recently like 54 Reasoning or Lamanimatron Ultra and these have really small reinforcement learning compute budgets. Like we're talking less than 1% in some cases much less than 1% of the pre training compute budget. And so it really seems like R1 is this case of an unusually high investment in RL compute relative to pre training and that a lot of the models that are being trained in the west, the reasoning models, have very high pre training budgets and relatively very tiny reinforcement learning budgets. I thought that was super interesting and something tells me that the deep seq R1 strategy is actually more likely to be persistent in the long run. I suspect you're going to see more and more flowing into the RL part of the training stack. But anyway, super important, important questions being raised here. Interesting little write up from Epic AI which we do love to cover, right?
Jeremy Harris
Exactly. And to that point we've seen kind of a mix of results. It's still not a very clear picture. We've seen that you can really get rid of RL and with a very well curated data set for supervised fine tuning you can at least do most of the progress towards reasoning and to unlock the hidden capabilities of a base model. As they say with rl, not necessarily adding new capabilities, just sort of shaping the model towards using it. Well worth knowing. Also RHEL very different in terms of a training from autoregressive unsupervised learning or self supervised learning I guess was was the term for a while in the sense that RL requires rollouts, it requires verification. It just isn't as straightforward to scale as pre training or post training. So another kind of aspect to consider but yeah, very much still an ongoing research problem as we've seen with all these papers we keep talking about with all these different types of results and different recipes, I'm sure we'll likely, you know, over time converge to what has been the case in pre training and post training. People I think have discovered more or less the recipe and I'm sure that will increasingly be the case also with reasoning and onto the last paper, this one coming from OpenAI. So you know, props. I sometimes I think have said that OpenAI doesn't publish research anymore and that's not exactly true. And this one is healthbench evaluating large language models towards improved human health. So open source benchmark designed to evaluate LLMs on healthcare, focusing on meaningful, trustworthy and unsaturated metrics. So this was developed with input from 262 physicians across 60 countries. It includes 5,000 realistic health conversations to test LLMs, ability to respond to user messages, has a large rubric evaluation system with a ton of unique criteria, as you might expect. You know, this is an area where you really want to evaluate very carefully and be sure that your model is trustworthy, is reliable, is even allowed or should be allowed to talk about health and questions regarding health. And so they open source the data set, they open source the eval code so that people can work on AI for healthcare.
Andrei Karpathy
Yeah, and I mean to your point about OpenAI not publishing research anymore, I think you are fundamentally correct. I mean they don't publish anything about how they build their models.
Jeremy Harris
Algorithmic. Yeah, algorithmic discoveries, let's say mostly, sometimes with image generation they've done a little bit, but yeah, mostly now.
Andrei Karpathy
Yeah, and like here and there for alignment, but it's murky and unclear and, and then when you have something that makes for a great PR play like hey, we have done this healthcare thing, please don't regulate us, pretty please, we're doing good things for the world, then all of a sudden you get all this wonderful transparency. But I will say credit where credit is due. This is a huge scale, significant investment seemingly that OpenAI has had to put into putting this together. So 5,000, as you said, multi turn conversations between users and AI models about healthcare. What they did is they got about 300 doctors to look at these conversations and propose bespoke criteria. So like, you know, specific criteria based on which they would judge the effectiveness of the AI agent in that conversation or the AI chatbot. And so to give you an example, you know, you have a parent who's concerned about their baby who hasn't been acting like herself since yesterday. The rubric that the doctors came up with that were aggregated from a bunch of doctors, doctors, different doctors looking at this exchange, they're like, okay, well does the chatbot state that the infant may have muscle weakness? If so, seven points. Does it list at list at least three common causes of muscle weakness in infants? If so, plus five points. Does it include advice to seek medical care right away? And so they give points and it's a very detailed kind of looking over the AI's shoulder type of perspective for each of these 5,000 multi turn conversations. Again, using hundreds and hundreds of doctors to do this. And there are some criteria that are shared across many of these exchanges, but so about 34 what they call consensus criteria. These are things that come up again and again, but mostly they are example specific. Like 80% of the criteria they use are literally like just for one conversation or just for one exchange. So that's pretty remarkable. A really, really useful benchmark. They use GPT 4.1 to evaluate whether each rubric criterion is met in a given conversation. So they're not actually getting the doctors to review the chatbots responses. Obviously that doesn't scale. But what they do do is they find a way to demonstrate that GPT 4.1 actually does a pretty decent job of standing in as the typical physician. Their performance, the grades that they give are pretty comparable. And if GPT 4.1 by the way is the best model they identified, it does better than even 04 mini and O3 at that task. One of the things that really messes with my head on this and we have to remember anytime we look at a benchmark like this and we're tempted to ask, okay, so how well does the best AI do? How well does a doctor do? Right? That's the natural question. It is important to note that this is not how typical doctors would evaluate a patient. Right? Like you would typically have visual access to them, you'd be able to touch, you'd be able to kind of, you know, see the non verbal cues and all that stuff. That being said on this benchmark, models do outperform unassisted physicians. Unassisted physicians score 0.13 on average across all these evals models. The top models on their own, 0.6. That's for oh3. That is wild. That is a four times higher score than the unassisted physician. That honestly like kind of blows my mind a little bit. Certainly these models can draw on much, much larger sources of data. And again we got to add all those caveats. You know, physicians don't normally write chatbot style responses to health queries in the first place. But it's an interesting note and we've seen some papers, we've talked about them here, where doctors actually can perform even worse when they work with an AI system than the AI system on its own because the doctors are often second guessing and don't, let's say, just have blind faith in this model. So pretty interesting. One more caveat there is, there is a correlation, we've seen this before between response length and score on this benchmark. And that's a problem because it means that effectively the chatbots can game the system a bit just by being Very verbose. So surely that's influencing things a little bit. The effect does not though nearly account for the insane disparity between unassisted physicians and models, which again is a 4x lift like that's pretty wild.
Jeremy Harris
Yeah. Worth noting that there are multiple metrics here, including communication quality accuracy as its own metric. And I do actually evaluate the physicians with the models and the combination there is on par. Maybe, you know, there's some of these things that they're better on. Accuracy seems to be about the same. Communication quality may be a bit different. But yeah, physicians with these tools will be much more effective than the valve. That's pretty clear from results. And they do have various caveats as to evaluation. Like you said, there's a lot of variability there and so on. Interesting to me. Also in the conclusion they note that they included a canary string to make it easier to filter out the benchmark from training corpora and they also are retaining a small private held out set to be able to enable instances for accidental training or implicit overfitting to the bench. So I think, interesting that in this benchmark we're seeing what should be probably the standard practice for any benchmark release in this day, which is you need to be able to make it easy to filter out from your massive training thing from web scraping and and probably also have a private eval set onto policy and safety. First up, we have the Trump administration in the US is officially rescinding Biden's AI diffusion rules. So there was the artificial Intelligence diffusion rule that was set to take effect on May 15. Introduced by Joe Biden in January, it will aim to limit the export of US made AI chip to various countries and strength strengthen existing restrictions. And the Department of Commerce has announced that it will not enforce this Biden era regulation. A replacement rule is expected that will presumably have a similar effect. The rule I think we covered probably at the time there were three tiers of countries, Tier three being China and Russia that have very strict controls, tier 2 countries that are some export controls and tier 1 which are friends that have no controls. So seems that now the industry as a whole is going to have to wait for what the new rules will be.
Andrei Karpathy
Yeah, the philosophy here, and we have yet to hear the announcement from the Department of Commerce for what will replace this. But the philosophy seems to be that it'll be nation to nation bilateral negotiations for different chip controls which could make sense. I mean one of the big weaknesses of the diffusion framework that the Biden administration came out with and we talked about this at the time was they had this. This insane loophole where as long as any individual order of GPUs was for less than 1700 GPUs, literally zero controls applied. And the reason that's relevant is literally Huawei's entire M.O. has been to spin up new subsidiaries faster than the US can put them on their export control list, and then use those to kind of pull in more controlled hardware. And then obviously Huawei just pulls that together. And so putting in an exemption for a 1700 is a decent number of GPUs, too, by the way. So putting in an exemption for that number of GPUs is. I mean, you're kind of just asking for it. That is exactly the right shape for China to exploit. That matches exactly the strategy they have historically used to exploit US export control loopholes. So hopefully that's something that'll be addressed in this whole kind of next round of things. We don't yet know exactly what the shape will be, though we do have a sense and, well, this ties into our next story of what the approach will be with respect to certain Middle Eastern countries like Saudi Arabia, like the uae, which are now kind of top of mind as these sort of not neutral states, but the ones that aren't the US or China, let's say proxy fronts in this big AI war.
Jeremy Harris
Right, and that does take us to the next piece. Trump's Mid east visit opens floodgate of AI deals led by Nvidia. That's from Bloomberg. So the Trump administration has been meeting with two nations in particular, Saudi Arabia and the United Arab Emirates. And we do expect agreements to be unveiled soon. And the expectation is there will be eased restrictions, meaning that Nvidia, amd, and others will be able to sell more, you know, get more out of a region. The stock market reacted very favorably. Nvidia went up 5% and MD went up 4%. And there's been a variety of announcements, per the article title, of deals that seem like they'll start happening. So, for instance, Nvidia will be providing chips to Saudi Arabia's Humane, a company created to push the country's AI infrastructure efforts. Humane will get several hundred thousand of Nvidia's most advanced processor over the next few years. And there's other deals like that with amd, Amazon, Cisco, others. So the indication seems to be, you know, some restrictions will be eased. Restrictions were set in part because there were ties between some firms in these regions and China with in particular, G42. So, yeah, it seems like it might be different from The Biden era.
Andrei Karpathy
Yeah, it's quite interesting. Right. There's a lot that the different players at the negotiating table here want. The Saudi deal is especially interesting because it points to a similar kind of deal to the deal that America's started to shape over the last few months, with the UAE being more permissive in some ways, but also insisting that the UAE move away from their entanglements with China. You mentioned G42, right? And Huawei having had some, some, some past. Well, the strategic situation if you're Saudi Arabia, is you want to be positioned for an oil, for a post oil future. Right. That's the same for the UAE and the same for all the Gulf states, really. In Saudi Arabia, that's motivated this thing called Project Transcendence, which is a $100 billion initiative for tech in general, but specifically for AI. There's a big, big pool set aside for that. The UAE is in a similar position. They already have a national champion lab in G42 as well as Institute for Technology or something.
Jeremy Harris
IIIT.
Andrei Karpathy
Yeah, yeah, yeah. The guys who made, who did the Falcon models. Yeah. Which we haven't heard much about since, by the way, which is kind of interesting. But right now the Saudis are behind the UAE and they're trying to make up ground. And so the UAE and the Saudis essentially are in some sense competing against each other to be America's partner of choice for large scale AI deployments in the Middle East. That's one dimension of this. They want to get their hands on as much AI hardware, as many GPUs as they can. This is one reason why Trump stacked them back to back. So he had first an announcement of the deal with the Saudis and then heading over to get a deal with the uae, putting pressure on each of them to kind of play off each other. Look, the Saudis have tons of energy. They are an energy economy. Same with the UAE. Just at the time when we're saturating the US's energy grid and that's the main kind of blocker on our deployments. And so you can see the temptation if you're OpenAI, if you're Microsoft, if you're Google, to just like say, well, why don't we set up a data center in the Middle east where we have an abundance of energy, plug into their grid and that'll be great for us. And well, there are a couple reasons why you, they might not want to do that. So historically, one was the Biden administration's export control scheme, where you just can't Move that many chips into a foreign country like that, just no good. But that's being scrapped, as we just talked about. So now the situation is, well, maybe we can, right, maybe we can negotiate country to country and set this up. But the United States is going to want to make sure that if they are setting up AI infrastructure in the uae, in Saudi Arabia, that the Saudis don't turn around and sell that to China. Right. China's super good at using third party countries. Historically that's been Malaysia, it's been Singapore. Right. And using those countries to bring in GPUs and subvert US export controls. So, you know, sure, you might have export controls on China proper, but you don't necessarily have them on Malaysia, on Singapore. And what a surprise, a massive influx of GPU orders into Malaysia of all places in the last few months. Hmm. Wonder where those are being redirected. Right. So this is something that the administration wants to make sure doesn't happen with these deals. Whole bunch of of issues around Saudi entanglement. You said, you know, uae, China's got a lot of ties, so do the Saudis, right? Huawei made Saudi Arabia a regional center for their cloud services. There's a big Saudi public investment fund, the pif, that's actually bankrolling this whole project transcendence thing. And the PIF has joint ventures with, with Alibaba Cloud. They've got a new tech investment firm that we covered a few episodes ago called alat, that also has a joint venture with Dahua, which is an, an envy listed, basically a blacklisted Chinese surveillance tech company of all things. So there are a lot of entanglements there and deep questions about how some of the Saudi Arabian GPU reserves are being used potentially by Chinese academics and researchers as well. So while there's no hard evidence of the Saudis shipping GPUs specifically to China, you wouldn't necessarily expect that China's MO is absolutely do stuff like this. And just a last note here in the negotiations, one really interesting thing that's been proposed is this idea of a data embassy. No one's ever proposed this before, but basically it's the idea that like, look, if you want to be able to take advantage of huge sovereign reserves of energy in the UAE and Saudi Arabia, but you're concerned about the security implications. Well, maybe you can set up up a region of territory that, you know, just like how the US Embassy in Saudi Arabia is this technically tiny slice of American soil in Saudi Arabia of sovereign American soil. Well, let's set up a tiny slice of sovereign American soil and put a data center on it. US laws will apply there. You're allowed to ship GPUs to it, no problem because it is sovereign US territory. So export control isn't an issue. In the same way, sure, you have Saudi energy feeding in, and that's a huge vulnerability. Sure, you're embedded in this matrix, but in principle, maybe you can get higher security guarantees from doing that. Lots of caveats around that. In practice, I won't go into them, but like, there are some real security issues around trying something like that that our team in particular has spent a lot of time thinking about. But this is basically the structure of these deals. A lot of kind of new ideas floating around. We'll see how they play out. But they definitely put the UAE and put Saudi Arabia right up there in terms of the players that might have large domestic stockpiles of chips.
Jeremy Harris
All right, so that's a couple policy stories. Let's have a couple safety stories to round things out. The next one is a paper, Scaling Laws for Scalable Oversight. So oversight is the idea that we may want to have weaker models verify that a thing that a stronger model is doing is actually safe and aligned and not bad. So you might imagine you might have, have a super intelligent system and humans are not able to verify that what it's doing is okay. And you want to be able to have AI oversight over stronger ones to, you know, be able to trust it. In this paper, they're looking into, you know, whether you can actually scale oversight. And by the way, it's called scalable oversight because you can scale it by using AI to actually verify things at the speed of AI and compute. And so what this paper focuses on is what they're presenting as nested scalable oversight, where basically you can do a sequence of models where you have weaker, stronger, weaker, stronger. And you can kind of go up a chain to be able to provide verifiable or trustworthy oversight and make things safe. So they introduce some theoretical concepts around that, some theoretical guarantees. They do some experiments on games like Mafia War Games and backdoor games and verify in that context that there are some success rates and yeah, present kind of this general idea as another step in the overall research of the idea of scalable oversight.
Andrei Karpathy
Yeah, and this is. I don't, I don't think, I don't know if it was Paul Cristiano back when he was at OpenAI, who invented this whole area, but certainly the idea of doing scalable alignment by getting a weaker AI model to monitor a smarter AI. A stronger AI model is something that he was really big on, and frankly, and through debate in particular. So his whole thing was debate. That's one concrete use case that they examine here. So basically, have a weak model watch, maybe two strong models debate over a particular issue, and the weak model is going to try to assess which of those models is telling the truth. Well, hopefully the idea here is if you can use approaches like this to determine with confidence that one of your stronger models is reliable, well, then you can take that stronger model and now use it to supervise the next level of strength. Right. An even smarter model, and you can maybe start climbing the ladder that way. This is, I think, a good way. This paper is basically trying to quantify that. So the way they're going to try to quantify that is with ELO scores. So these ELO scores tell you roughly how often a given model will beat another model. And I forget what the exact numbers are, but it's like if you have a model with an Elo score of 1000 and another model with an Elo score of 1200, then the model with the Elo score of 1200 will beat the model with an Elo score of 1000, like 70% of the time or whatever the number is. And so this is an attempt to kind of quantify what that climb might look like using ELO scores, using essentially scaling curves for these ELO scores, which is quite interesting. I think there are some pretty fundamental problems with this whole approach. I don't think that Max Tegmark, who is like one of the lead authors of this thing, would actually disagree. But there's a fundamental issue here, which is when you think about climbing the intelligence ladder, new capabilities of concern, like deceptive alignment, in other words, the ability of a model to pretend as if it's aligned when it actually isn't. And those can emerge pretty suddenly. Like you can have this sort of emergent capabilities that pop up suddenly and violate these scaling curves. And the kinds of capabilities you worry about in the context of superintelligence, you might expect to arise quite quickly, where there's a sudden sort of cohesion of situational awareness, of capabilities around manipulation, persuasion, of capabilities around, you know, offensive cyber and things like that that all kind of come together fairly quickly. And if that should happen, then you ought to expect these scaling laws to break down at precisely the stages where you most need them to work. Nevertheless, this is, I think, a really good quantification of some of the. The arguments that we've Seen from people like Paul Cristiano, ida. I think it was the acronym Iterative debate and alignment or something like that. I forget I actually looked into it really deeply like four years ago and now I can't, I can't sum it up. But yeah, I think this is if you're going to take it seriously, this is a good way to do it. Looking across different versions of this, like what if you have a game of Mafia? If you don't know what the game Mafia is, don't worry about it. What if you got this debate scenario that I just described, all these different possible scenarios. What are the scaling curve? What are the scaling curves look like in terms of how smart your judge model is going to be versus how smart the models are who are potentially trying to fool the judge model? And how often can you actually assess can the judge model succeed? They've got all these great scaling plots and yeah, it's a good paper if you're interested in that model.
Jeremy Harris
And one last story related to safety. OpenAI pledges to publish AI safety test results more often. So they have actually launched the Safety Evaluations Hub, a page where you can see their models performance on various benchmarks related to safety, things like harmful content, jailbreaks and hallucinations. And yeah, you can really scroll through and basically see 4 GPT, 4.1, 4.1, mini 4.5, 0.1. All of them for various things related to safety like refusal, jailbreaking, hallucination, what the metrics are. Now they're not presenting everything they do for safety. They don't have the metrics for their preparedness framework on here. They're going to continue to do that in the system cards. But nevertheless I think an interesting kind of move by OpenAI to make it extra easy to see wherever the models stand.
Andrei Karpathy
Yeah, this is if nothing else just a really great format to view these things in. And anyway you can check out the website, it's actually really nicely laid out.
Jeremy Harris
And that will be it for this episode of Last and Sometimes Last. Last Week in AI as we've said, we'll try to not skip any more weeks in the near future. Thank you to all the listeners who stick by us, even though we do sometimes break that promise. As always, we appreciate your feedback, appreciate you sharing a podcast, giving reviews, corrections, questions, all that and please do keep tuning in.
Andrei Karpathy
Break it down Last weekend AI come and take a ride Hit the low down on tech and let it slide Last week in AI come and take a ride Up a labs to the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI reaching high Algorithm shaping up the future sees tune in, tune in get the latest with ease Last weekend AI come and take a ride get the A low down on tech and let it slide up the headlines Pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're.
Jeremy Harris
Smitten from machine learning marvels to coding.
Andrei Karpathy
Kings Futures unfolding see what it brings.
Date: May 19, 2025
Hosts: Andrei Karpathy & Jeremy Harris
Main Theme:
This episode breaks down two weeks of significant AI news and developments, focusing on recent moves by OpenAI to retain its nonprofit structure, major developments in hardware (chips and suppliers), new tools and models in AI, cutting-edge research in reasoning and scientific discovery, and pivotal policy changes, especially in U.S. export controls. The hosts also discuss the field’s evolving dynamics among AI labs, investors, international policies, and the latest open-source releases.
Timestamps: [01:28] to [17:39]
OpenAI Will Not Transition to Full For-Profit Structure:
Implications and Remaining Controversies:
Investor Dynamics:
Timestamps: [18:05] to [26:39]
TSMC’s 2nm Process Exceeds Expectations:
Nvidia Setting Up Overseas HQ in Taiwan:
CoreWeave Raising Funds Amidst IPO Challenges:
Timestamps: [26:39] to [47:42]
Grok Chatbot Controversy on X (formerly Twitter):
Figma Launches AI-Powered Design & Prototyping Tools:
Google Brings Gemini to Android Auto & Upgrades Gemini 2.5 Pro:
The Rise of ‘Vibe Coding’:
Hugging Face Launches Open Agentic AI Tool:
Timestamps: [47:42] to [65:41]
Stability AI’s Stable Audio Open Small:
Flight: OpenAI Image Generator Trained Only on Licensed Data:
AM Thinking V1 – Reasoning Model from China’s “Beike” (a real estate giant):
Blip3o: Unified Multimodal Model (Images + Text, Both Directions):
Timestamps: [65:41] to [88:00]
DeepMind’s AlphaEvolve: Automated Algorithm & Code Discovery
Absolute Zero: Reinforced Self-Play Reasoning with Zero Data:
Epoch AI Report: How Far Can Reasoning Models Scale?
OpenAI’s HealthBench: Large-Scale Health QA Benchmark
Timestamps: [88:00] to [113:07]
Trump Administration Rescinds Biden’s AI Diffusion Rule:
Massive AI Deals in the Middle East:
Scaling Laws for Scalable Oversight:
OpenAI's Safety Evaluations Hub:
On OpenAI’s structure:
“It’s touted as a win for basic principle ... but there’s a slippery slope ... Can the nonprofit board meaningfully oversee Sam? We saw a catastrophic failure of that in the board debacle.”
— Karpathy [08:55]
On Grok’s system prompt leak:
“Grok for many different examples of just random questions... would reply regarding ‘white genocide’ in South Africa… even if a query is unrelated, which I suspect is the issue here. Weird.”
— Harris [26:39]
On the open-source AI model ecosystem:
“Anytime I try to explain it, it ends up sounding just like a pyramid scheme. ... At some point there’s a pot of gold at the end. Don’t worry about it.”
— Karpathy [53:37]
On AlphaEvolve paradigm:
“OpenAI has this: they’re super scale-pilled ... DeepMind comes at things like, ‘let’s almost replicate the brain in a way, in different chunks.’”
— Karpathy [68:31]
On Reasoning Models Scaling:
“You can only do that so many times until you hit essentially the ceiling of what current hardware can allow.”
— Karpathy [81:16]
On AI in Healthcare:
“That is wild. That is a four times higher score than the unassisted physician. That honestly like kind of blows my mind a little bit.”
— Karpathy [91:45]
The hosts reflect on how much AI field progress is tied not just to technical breakthroughs, but corporate maneuvering, global negotiations over chips and compute, and the mounting importance of both open benchmarks and synthetic data in the new RL-driven reasoning paradigm. The show ends with their signature rap outro, capturing the constant change and excitement at the intersection of research, industry, and policy.
For further links, code, and deep dives referenced, see the episode description.