Andrei Karpathy (3:31)
Yeah, that's itself. Right. The first time that we've said that sentence truly in a, in a confident way. Right. GROK XAI have the frontier AI model. That's a big, big statement. You look across all the metrics, it's not even ambiguous gpqa. Right. Just smashing all the benchmarks. Of course, some of these starting to get saturated and certainly gpqa, we're getting there too. So expect, you know, significant signal to noise on that one's dropping a bit. But, you know, Amy, 25, that math Olympiad qualification benchmark, that had been so, so hard back in the day, again, pretty much saturated, as you mentioned, humanity's last exam. Right. So this one's really interesting. 41% success rate with tools going all the way to 50.7. So more than 50% success rate on this incredibly hard benchmark with a full Grok 4 heavy this is. And they're showing these the usual kind of beautiful training, compute, scaling and test time. Compute, scaling curves with and without tool use. One interesting thing that you can kind of see just a little bit is how the spread between performance with and without tools actually increases as training compute increases too. So it seems as if the model is actually getting more and more leverage as it gets trained more from tool use. So that, that itself is sort of an interesting little sub observation. This, this comes of course, with a whole bunch of predictions and roadmap information which, you know, if you're familiar with how stuff goes at Tesla, it ends up happening at some point. It just may not happen exactly when Elon says it will at first. And he's famous for kind of coming up with these very aggressive deadlines, but, you know, again, things get done. The Falcon Heavy does get launched, the Starship does get launched, but it's, you know, it may take a little longer. Here's a quote from Elon in one of his interviews surrounding the launch. He says he expects that Grok will be able to discover new technologies maybe this year, I think he said, but definitely next year. So new technologies that are useful and new physics, he says, certainly within two years. So, you know, maybe multiply all those things by a factor of PI and you get to the kind of timeline there, but you know, it's hard to know in the space. The roadmap's really interesting. So we have Grok4 released today. They have a coding model that'll be coming out sometime in August. They expect a multimodal agent to be coming out in September and then a video generation model in October. So that's, that's the rough roadmap. We've seen these things get pushed around from all frontier labs because it just, you know, training runs just have to get debugged, weird things happen. But there you go. And then another thing. So, so to kind of to the other benchmark, there are a lot of really impressive, as you said, Andre, like kind of big level up on, on these benchmarks. One of the Most interesting, Arc AGI 2, right. This is the Mac Daddy of like supposedly very hard problems. Essentially every problem is a different rule structure that the model has to adap. It's an extension, let's say a modification to Arc AGI1, which was Francois Chardez kind of famous benchmark where for a long, long time like models were smashing other benchmarks, but this one was kind of stubbornly, stubbornly hard to smash. Now Claude for Opus is the next runner up. It's in second place. Right. It scores just under 10% on RKGI2 Grock 4 almost. I mean it's like, was that 17% or so, something like that, or sorry, 16%. So, so suddenly basically doubling the performance of Claude 4, which you just don't do, right on these benchmarks all of a sudden in one increment, doubling that performance. So this is an unambiguously true frontier model, if you're curious about like concrete real world implications. Vending bench. I don't know if we've talked about this benchmark, but basically, yeah. So every once in a while I come across stuff. I'm like, this is kind of news to me and I'm surprised and like, I'm not gonna lie, a little bit embarrassed because we' to know this stuff. So vending bench is where you have the agent manage a simulated vending machine business. And it's literally. So it's simulated because customer purchases are simulated. They have all kinds of factors that go into the simulation. They simulate price elasticity, reference prices, base sales changes over days of the week and monthly multipliers and then weather impact product. Right. There's all kinds of stuff that's factored in here. But fundamentally, given that complexity, the model is trying to optimize the revenue that it makes. So how does Grok4 do here? Well, the net worth that it ends up accumulating on average across all these simulations is around 4,700 bucks. The next runner up, again, Claude Opus 4, 2,100 bucks. So again, more than doubling that performance. Human performance, by the way, is 800 before we get into like, oh well, you know, it's not a pro. No, no, this is smashing human performance and in fairness is the kind of task you might expect that to happen with. Right. Humans don't have the RAM to remember all these customer interactions and optimize accordingly. But this is a high complexity and frankly starting to get pretty realistic, pretty applied, real world, you know, simulation, blah, blah, blah. But anyway, really, really impressive benchmark scores. So XAI is in the game in a big way, guys. This comes with a bunch of follow up questions about what their responsibility now is, right? On security, on control, and they've spent all this time catching up and you might say fair is fair. That's the, the price of getting up to speed is that now you know, you got to cut corners on safety and security. But now you know, where's the XAI alignment team going to be in six months? How many people are going to be on it and who are they going to be? How are they going to be empowered? How much compute is going to be dedicated to it? What experiments are they going to publish? Like, this is where we start to get into that zone of like, you're no longer in the same position where you were complaining about OpenAI cutting corners. Now it's time to kind of put the chips on the table. So we're going to learn a lot about the future direction of XAI and the philosophy that animates it now that they are truly, truly a frontier AI company.