Elon Musk’s Gives AI Updates!!! - Elon Musk Thinking

Summary5 min read

Podcast Summary: Elon Musk Thinking Episode: Elon Musk’s Gives AI Updates!!!
Host: Astronaut Man
Release Date: July 10, 2025

Introduction to Grok 4 Release and Its Capabilities

In the latest episode of Elon Musk Thinking, host Astronaut Man delves into groundbreaking updates about artificial intelligence, centered around the release of Grok 4, touted as the smartest AI in the world. Elon Musk kicks off the discussion by highlighting Grok 4's unprecedented intelligence and reasoning capabilities.

Elon Musk [00:00]: "Grok 4 is the smartest AI in the world and we're going to show you exactly how and why. It really is remarkable to see the advancement of artificial intelligence, how quickly it is evolving."

Musk compares the rapid development of AI to human cognitive growth, emphasizing Grok 4’s ability to outperform humans in various academic disciplines.

Technical Enhancements and Training Advancements

The conversation shifts to the technical advancements that have propelled Grok 4 to its current stature. Sam Altman and Greg Brockman provide insights into the substantial increase in training compute power and the introduction of advanced reinforcement learning (RL) techniques.

Sam Altman [02:18]: "There are two types of training compute... from Grok 3 to Grok 4, we're actually putting a lot of compute into reasoning and RL."

Greg Brockman elaborates on the scaling process, noting that each iteration—from Grok 2 to Grok 4—involved an order of magnitude increase in training efforts.

Greg Brockman [02:32]: "We've essentially increased the training by an order of magnitude each time... we built Clausis, the world's supercomputer with 100,000 H100."

Benchmark Performance and Comparative Analysis

A significant portion of the episode is dedicated to showcasing Grok 4's performance on challenging benchmarks, particularly the Humanities Last Examination (HLE). This benchmark comprises 2,500 expert-curated problems across diverse subjects, from mathematics to linguistics.

Sam Altman [03:47]: "Most models can only get single-digit accuracy on this benchmark. Grok 4, however, solves over 40% of these problems without any tools."

Elon Musk underscores Grok 4’s superiority by comparing its performance to that of humans, stating that even top graduate students would struggle to achieve Grok 4’s scores.

Elon Musk [05:00]: "Grok 4 is post graduate, like PhD level in everything better than PhD, but most PhDs would fail."

Multimodal Capabilities and Future Enhancements

Despite its strengths, Grok 4 currently faces limitations in multimodal understanding, particularly in image and video processing. The team acknowledges these gaps and discusses impending improvements with version 7 of their foundation model.

Sam Altman [20:51]: "We are improving our multimodal understanding capabilities and expect much higher numbers on these benchmarks soon."

Elon Musk emphasizes the importance of enhancing vision capabilities to enable Grok 4 to interact seamlessly with the physical world, paving the way for applications like humanoid robots and advanced simulations.

Elon Musk [08:23]: "Ultimately, interacting with the real world via humanoid robots will allow Grok to formulate and test hypotheses in reality."

Real-World Applications and Demonstrations

The episode features live demonstrations showcasing Grok 4’s capabilities in practical scenarios. Dan Hendricks introduces the Vending Bench, an AI simulation where Grok 4 manages a virtual vending machine business, outperforming other models by generating double the net worth.

Axel [29:42]: "Grok 4 was able to formulate a strategy and adhere to it over a longer period, achieving double the net worth compared to other models."

Additionally, the team showcases Grok 4’s prowess in generating complex visualizations, such as simulating the collision of black holes with remarkable accuracy.

Axel [15:59]: "Grok 4 generated a visualization of two black holes colliding, accurately referencing real-world data and physical constants."

Future Developments: Video Understanding and Generation

Looking ahead, the team outlines ambitious plans to enhance Grok 4’s video understanding and generation capabilities. They anticipate launching AI-generated video games, TV shows, and even movies within the next year, leveraging improved multimodal skills.

Elon Musk [33:13]: "With version seven of our foundation model, we'll have excellent video understanding and expect the first AI-generated video game next year."

Greg Brockman highlights ongoing developments in coding models and the continuous improvement of multimodal AI, ensuring Grok 4 remains at the forefront of intelligent systems.

Greg Brockman [35:31]: "After coding, our focus is on eliminating multimodal weaknesses, enhancing image and video understanding for diverse applications."

Impact on Economy and Society

Elon Musk reflects on the broader implications of Grok 4's intelligence explosion, envisioning a future where AI-driven economies far surpass current scales. He draws parallels to the Kardashev scale, suggesting that humanity is on the brink of a significant technological leap.

Elon Musk [10:02]: "We are at the beginning of an immense intelligence explosion... The economy could become thousands or even millions of times larger than it is today."

Musk also addresses AI safety, advocating for instilling truth-seeking and honorable values within AI systems to ensure their alignment with human interests.

Elon Musk [09:34]: "The most important thing for AI safety is to be maximally truth-seeking... instill the right values."

Conclusion: The Road Ahead

As the episode wraps up, Greg Brockman summarizes the key takeaways, emphasizing Grok 4’s unparalleled intelligence, its real-world applications, and the exciting advancements on the horizon. The team reiterates their commitment to pushing the boundaries of AI, positioning themselves as leaders in the AGI landscape.

Greg Brockman [34:20]: "We're going to continue developing a model that is both fast and smart... encoding is just one of them."

Elon Musk closes with an optimistic outlook, anticipating transformative changes driven by Grok 4 and subsequent AI advancements.

Elon Musk [37:10]: "Thanks everyone. Good night."

Key Takeaways:

Grok 4 represents a significant leap in AI, outperforming graduate-level humans across multiple disciplines.
Substantial increases in training compute and advanced reinforcement learning are pivotal to Grok 4’s capabilities.
Grok 4 excels in complex benchmarks like the Humanities Last Examination, solving problems beyond human proficiency.
Ongoing enhancements aim to address multimodal limitations, with future updates promising robust image and video understanding.
Real-world applications demonstrated include virtual business management and accurate scientific simulations.
The AI’s potential impact on the economy and society is profound, with expectations of an intelligence-driven economic explosion.
AI safety remains a priority, focusing on instilling truthful and honorable values within intelligent systems.

Grok 4 is poised to revolutionize various sectors, from research and finance to entertainment and beyond, marking an exciting era in artificial intelligence development.

Loading summary

Transcript102 lines

[00:00]
Elon Musk
Welcome to the Grok 4 release here. This is the smartest AI in the world and we're going to show you exactly how and why. It really is remarkable to see the advancement of artificial intelligence, how quickly it is evolving. I sometimes compare it to the growth of a human and how fast a human learns and gains conscious awareness and understanding. And AI is advancing just vastly faster than any human. I mean, we're going to take you through a bunch of benchmarks that, that Grok4 is able to achieve incredible numbers on. But it's, it's, it's actually worth noting that, that Grok 4 if, if given like the SAT, would get perfect SATs every time, even if it's never seen the, the questions before. And if Even going beyond that 10 to say like graduate student exams like the GRE, it will get near perfect results in every discipline of education. So from the humanities to like languages, math, physics, engineering, pick anything. And we're talking about questions that it's never seen before. These are not on the Internet. And it's. Grok4 is smarter than almost all graduate students in all disciplines simultaneously. Like, it's actually just important to appreciate the, like that's really something. And the, the reasoning capabilities of GROK are incredible. So there's some people out there who, who think AI can't reason. And look it, it can reason at superhuman levels. So yeah, and frankly it only gets better from here. So we'll take you through the Grok 4 release and show you the pace of progress here. I guess the first part is like in terms of the training, we're going from Grok 2 to Grok 3 to Grok 4. We've essentially increased the training by an order of magnitude in each case. So it's 100 times more training than Grok 2 and that's only going to increase. So it's. Yeah, frankly, I mean, I don't know, in some ways a little terrifying, but the growth of intelligence here is remarkable.
[02:19]
Sam Altman
Yes. It's important to realize there are two types of training compute. One is the pre training compute that's from graph 2 to graph 3. But from graph 3 to graph 4, we're actually putting a lot of compute in reasoning in rl.
[02:33]
Greg Brockman
Just like you said, this is literally the fastest moving field. And GROQ2 is like the high school student graphics by today's standard. If we look back in the last 12 months, Grok2 was only a concept. We didn't even have Grok2 12 months ago. And then by training Grok2 that was the first time we scale up like the pre training we realized that if you actually do the data ablation really carefully and infra and also the algorithm, we can actually push the pre training quite a lot by amount of 10x to make the model the best practice pre trained based model. And that's why we built Clausis, the world's supercomputer with 100,000 H100 and then with the best pre trained model. And we realized if you can collect these verifiable outcome reward, you can actually train this model to start thinking from the first principle, start to reason correct its own mistakes. And that's where the graphic reasoning comes from. And today we ask the question, what happens if you take the expansion of Claus with all 200,000 GPUs, put all these into RL10X more compute than any of the models out there on reinforcement learning unprecedented scale, what's going to happen? So this is A Story of Grok 4 and Tony, share some insight with the audience.
[03:47]
Sam Altman
Yeah, so yeah, let's just talk about how smart Grok 4 is. So I guess we can start discussing this benchmark called Humanities Last examination. And this benchmark is a very, very challenging benchmark. Every single problem is curated by subject matter experts. It's in total 2,500 problems and it consists of many different subjects. Mathematics, natural sciences, engineering and also all the humanities subjects. So essentially when it was first released actually like earlier this year, most of the models out there can only get single digit accuracy on this benchmark. Yeah, so we can look at some of those examples. There is this mathematical problem which is about natural transformations in category theory. And there's this organic chemistry problem that talks about electro cyclic reactions. And also there's this linguistic problem that tries to ask you about distinguishing between closed and open syllabus from a Hebrew source text. So you can see also it's a very wide range of problems and every Single problem is PhD or even advanced research level problems.
[05:00]
Elon Musk
Yeah, I mean there are no humans that can actually answer these can get a good score. I mean if you actually say like any given human, what's the best that any human could score? I'd say maybe 5%, optimistically. So this is much harder than what any human can do. It's incredibly difficult. And you can see from the types of questions like you might be incredible in linguistics or mathematics or chemistry or physics or any one of a number of subjects, but you're not going to be at a post grad level in everything. And Grok 4 is a postgrad level in everything. Some of these things are just worth repeating. Like Grok4 is post graduate, like PhD level in everything better than PhD, but like most PhDs would fail, so it's better. That said, I mean, at least with respect to academic questions, it. I want to just emphasize this point. With respect to academic questions, Grok4 is better than PhD level in every subject, no exceptions. This doesn't mean that it's, you know, at times it may lack common sense and it has not yet invented new technologies or discovered new physics, but that is just a matter of time. It may discover new technologies as soon as later this year and I would be shocked if it has not done so next year. So I would expect GROK to literally discover new technologies that are actually useful no later than next year. And maybe end of this year it might discover new physics next year and within two years I'd say almost certainly. So just let that sink in.
[06:42]
Sam Altman
Okay, so I guess we can talk about what's behind the scene of graph 4. As Jimmy mentioned, we actually throwing a lot of compute into this training. When it started it's only a single digit number, but as you start putting in more and more training computer, it started to gradually become smarter and smarter and eventually solved a quarter of the HLE problems. And this is without any tools. The next thing we did was to adding a tools capabilities to the model. And unlike Graa3 I think Graph 3 actually is able to use crew as well. But here we actually make it more native in a sense that we put the tools into training. Graal 3 was only relying on generalization. Here we actually put the tools into training and it turns out this significantly improves the model's capability of using those tools.
[07:37]
Greg Brockman
So how is this different?
[07:38]
Sam Altman
Deep search was exactly the graph 3 reasoning model without any specific training. But we only asked it to use those tools. So compared to this it was much weaker in terms of its tool capabilities and unreliable. And unreliable. Yes, yes.
[07:55]
Elon Musk
And to be clear, like these are still, I'd say fairly. This is still fairly primitive tool use. If you compare it to say the tools that are used at Tesla or SpaceX where you're using finite element analysis and computational flow dynamics and you're, you're able to run or say like Tesla does, like crash simulations where the simulations are so close to reality that if the test doesn't match the simulation, you assume that the test article is wrong. That's how good the simulations are.
[08:23]
Sam Altman
So.
[08:23]
Elon Musk
So GROK is not currently using any of the tools that a company would use, but that is something that we will provide it with later this year. So we'll have the tools that a company has and have very accurate physics simulator. Ultimately, the thing that'll make the biggest difference is being able to interact with the real world via humanoid robots. So you combine Grok with Optimus and it can actually interact with the real world and figure out if, if it's hypo, if it has, if it's. It can formulate and hypothesis and then confirm if that hypothesis is true or not. So we're really, you know, I think about like where we are today. We're at the beginning of an immense intelligence explosion. We're in the intelligence big bang right now, and the most interesting time to be alive of any time in history. Now that said, we need to make sure that the AI is a good AI. The thing that I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to be maximally truth seeking. You can think of AI as this super genius child that ultimately will outsmart you, but you can still instill the right values, encourage it to be sort of, you know, truthful, honorable, you know, good things like the values you want to instill in a child ultimately grow up to be incredibly powerful. Yeah, these, I'd say still primitive tools, not the kind of tools that serious commercial companies use, but we will provide it with those tools and I think it will be able to solve real world technology problems.
[09:56]
Sam Altman
Yes, yes, exactly.
[09:58]
Greg Brockman
Or is it just compute all you need, Tony? Is it just compute all you need at this point?
[10:02]
Elon Musk
Well, you need compute plus the right tools and then ultimately to be able to interact with the physical world and then we'll effectively have an economy that is ultimately thousands of times bigger than our current economy or maybe millions of times. If you think of civilization as percentage completion of the Kardashev scale, where Kardashev1 is using all the energy output of a planet and Kardashev2 is using all the energy output of a sun and three is all the energy output of a galaxy. We're only, in my opinion, probably closer to 1% of Kardashev 1 than we are up to 10%. So like maybe a point or 1, 1 or 2% of Kardashev 1. So we will get to most of the weight like 80, 90% Kardashev 1. And then hopefully if civilization doesn't self annihilate, the actual notion of a human economy, assuming civilization continues to progress, will seem very quaint. In retrospect, it will seem like sort of cavemen throwing sticks into a fire level of economy compared to what the future will hold. I mean, it's very exciting. I've been at times kind of worried about, like, well, you know, is this. This seems like it's somewhat unnerving to have intelligence created that is far greater than our own. And will this be bad or good for humanity? I think it'll be good. Most likely it'll be good. But I somewhat reconciled myself to the fact that even if it wasn't going to be good, I'd at least like to be alive to see it happen. Yeah.
[11:39]
Sam Altman
I think one technical problem that we still need to solve besides just compute, is how do we unblock the data bottleneck? Because when we try to scale up the RL, in this case, we did invent a lot of new techniques, innovations to allow us to figure out how to find a lot of challenging RL problems to work on. It's not just a problem itself needs to be challenging, but also it needs to be. You also need to have, like, a reliable signal to tell the model, you did it wrong, you did it right. This is the sort of the principle of reinforcement learning. And as the models get smarter and smarter, the number of cool problem or challenging problems will be less and less. So it's going to be a new type of challenge that we need to surpass besides just compute.
[12:27]
Elon Musk
Yeah, and we actually are running out of actual test questions to ask. So there's like, even questions that are ridiculously hard, if not essentially impossible for humans that are written down, questions are becoming trivial for AI. You know, the one thing that is an excellent judge of things is reality. So because physics is the law, ultimately everything else is a recommendation. You can't break physics. So the ultimate test, I think, for whether an AI is the ultimate reasoning test is reality. So you invent a new technology like, say, improve the design of a car or a rocket or create a new medication. Does it work? Does the rocket get to orbit? Does the car drive? Does the medicine work? Whatever the case may be, reality is the ultimate judge here. So it's going to be a reinforcement learning closing loop around reality.
[13:19]
Sam Altman
We asked the question, how do we even go further? So actually, we are thinking about now with single agent, we're able to solve 40% of the problem. What if we have multiple agents running at the same time? So this is what's called test and compute. And as we scale up the test and compute, actually we are able to solve almost more than 50% of the text only subset of the HRE problems. So it's a remarkable achievement.
[13:47]
Elon Musk
I think this is insanely difficult. What we're saying is a majority of the text based of humanities scarily named Humanity's last exam Grok 4 can solve and you can try it out for yourself with the Grok 4 Heavy. What it does is it spawns multiple agents in parallel and all of those agents do work independently and then they compare their work and they decide which one. It's like a study group. It's not as simple as a majority vote because often only one of the agents actually figures out the trick or figures out the solution. But once they share the trick or figure out what the real nature of the problem is, they share that solution with the other agents and then they compare notes and yield an answer. So that's the heavy part of Grockful is where you scale up the test time, compute by roughly an order of magnitude, have multiple agents tackle the task and then they compare their work and they put forward what they think is the best result. Yeah.
[14:49]
Sam Altman
So we will introduce graph 4 and graph 4 heavy. Sorry, can you click the next slide? So, yeah, so basically graph 4 is a single agent version and graph 4 heavy is the multi agent version. So let's take a look how they actually do on those exam problems and also some real life problems.
[15:09]
Dan Hendricks
Yeah. So we're going to start out here and we're actually going to look at one of those HLE problems. This is actually one of the easier math ones. I don't really understand it very well, I'm not that smart. But I can launch this job here and we can actually see how it's going to go through and start to think about this problem. While we're doing that, I also want to show a little bit more about what this model can do and launch a Grok 4 Heavy as well. So everyone knows Polymarket, it's extremely interesting. It aligns with what reality is most of the time. And with grok, what we're actually looking at is being able to see how we can try to take these markets and see if we can predict the future as well. So as we're letting this run, we'll see how For Heavy goes about predicting the World Series odds for the current teams. And while we're waiting for these to process, we're going to pass it over to Eric and he's going to show you an example of his.
[16:00]
Axel
Yeah, so I guess one of the coolest things about Grok 4 is its ability to understand the world and to solve hard problems by leveraging tools like Tony discussed. And I think one kind of cool example of this, we asked it to generate a visualization of two black holes colliding in many cases actually pretty clear in its thinking trace about what these libraries are. For example, in order for it to actually be visible, you need to really exaggerate the scale of the waves. And yeah, so here's like, you know, this kind of inaction. It exaggerates the scale in like multiple ways. It drops off less in terms of amplitude or distance. But we can see the basic effects that are actually correct. It starts with the inspiral, it merges and then you have the ring down. This is basically largely correct modulo some of the simplifications that need to do. It's actually quite explicit about this, but uses post Newtonian approximations instead of actually computing the general relativistic effects near the center of the black hole. Which is incorrect and you know, will lead to, you know, somewhat incorrect results. But the overall, you know, visualization is, yeah, is basically there and you can actually look at the kinds of resources that it references. So here it actually, you know, it obviously uses search. It gathers results from a bunch of links, but also reads through a undergraduate text in analytic gravitational wave models. It reasons quite a bit about the actual constants that it should use for a realistic simulation. It references existing real world data.
[17:44]
Greg Brockman
It's a pretty good model going forward.
[17:46]
Elon Musk
We can give it the same model that physicists use so it can run the same level of compute that leading physics researchers are using and give you a physics accurate black hole simulation.
[17:57]
Dan Hendricks
Just right now is running in your browser.
[17:58]
Elon Musk
This is just running in your browser. Pretty simple.
[18:00]
Dan Hendricks
Swapping back real quick here we can actually take a look. The math problem is finished. The model was able to. Let's look at its thinking trace here so you can see how it went through the problem. I'll be honest with you guys, I really don't quite fully understand the math. But what I do know is that I looked at the answer ahead of time and it did come to the correct answer here in the final part. We can also come in and actually take a look here at our World Series prediction. And it's still thinking through on this one, but we can actually try some other stuff as well. So we worked very heavily on working with all of our X tools and building out a really great X experience. So we can actually ask the model, you know, find me the XAI employee that has the weirdest profile photo and Then we can actually try out, you know, let's create a timeline based on X post detailing the, you know, changes in the scores over time. And we can see, you know, all the conversation that was taking place at that time as well. So we can see who are the, you know, announcing scores and like, what was the reactions at those times as well. If we go back to this was the Greg Yang photo here. So Greg Yang, of course, who has his favorite photograph that he has on his account. That's actually not how he looks like in real life, by the way.
[19:10]
Elon Musk
But it had to understand that question.
[19:12]
Dan Hendricks
Yeah.
[19:12]
Elon Musk
Which is that. That's the wild part. It's like it understands what is a weird photo. What is a weird photo? Yeah, what is a less or more weird photo?
[19:21]
Dan Hendricks
It goes through. It has to find all the team members, has to figure out who we all are.
[19:25]
Elon Musk
Searches without access to the internal XAI personnel logs. It's literally looking at. Just at the Internet.
[19:32]
Dan Hendricks
Exactly.
[19:32]
Elon Musk
So you could say like the weirdest of any company.
[19:34]
Dan Hendricks
Yeah. And we can also take a look here at the question here for the humanities last exam. So it is still researching all of the historical scores, but it will have that final answer here soon. While it's finishing up, we can take a look at one of the ones that we set up here a second ago. And we can see like, you know, it defines the date that Dan Hendricks had initially announced it. We can go through. We can see, you know, OpenAI announcing their score back in February. And we can see, you know, as progress happens with like, Gemini, we can see like Kimmy and we can also even see, you know, the leaked benchmarks of what people are saying is, you know, if it's right, it's going to be pretty impressive. So pretty cool. But yeah, it's been great.
[20:14]
Elon Musk
Yeah. And we're going to close the loop around usefulness as well. So it's like it's not just book smart, but actually practically smart.
[20:21]
Dan Hendricks
Exactly. And we can go back to the slides here.
[20:24]
Sam Altman
Yeah. So we actually evaluate also on the multimodal subset. So on the full set, this is the number on the HLE exam, you can see there's a little dip on the numbers. This is actually something we're improving on, which is the multimodal understanding capabilities. But I do believe in a very short time we're able to really improve and got much higher numbers on this. Even higher numbers on this benchmark.
[20:51]
Elon Musk
The biggest weakness of GROQ currently is that it's sort of partially blind. It can't. It's image understanding obviously in its image generation needs to be a lot better and that's actually being trained right now. Grok 4 is based on version 6 of our foundation model. We are training version 7 which we'll complete in a few weeks. That'll address the weakness on the vision side.
[21:16]
Dan Hendricks
Just to show off this last here. So the prediction market finished here with a heavy and we can see here we can see all the tools and the process it used to actually go through and find the right answer. It browsed a lot of odd sites, it calculated its own odds comparing to the market to find its own alpha and edge. It walks you through the entire process here and it calculates the odds of the winner being like the the Dodgers and it gives them a 21.6% chance of winning this year. So and it took approximately four and a half minutes to compute.
[21:51]
Elon Musk
That's a lot of thinking.
[21:53]
Sam Altman
We can also look at all the other benchmarks besides HLE. As it turned out, Graph 4 excelled on all the reasoning benchmarks that people usually test on, including GBQA which is a PhD level problem sets. That's easier compared to HLE. On Amy 25America invitation mathematics exam with graph 4 heavy we actually got a perfect score. Also on some of the coding benchmark called live coding bunch and also on hmmt Harvard Math MIT exam and also usaml. You can see actually on all of those benchmarks we often have a very large leap against the second best model out there.
[22:36]
Elon Musk
Yeah, it's really going to get to the point where it's going to get every answer right in every exam and where it doesn't get an answer right, it's going to tell you what's wrong with the question or if the question is ambiguous, disambiguate the question into answers A, B and C and tell you what answers A, B and C would be with a disambiguated question. So the only real test then will be reality. Can it make useful technologies discover new science? That'll actually be the only thing left because human tests will simply not be meaningful.
[23:06]
Greg Brockman
You need to make an update to HLE very soon given the current rate of progress. So yeah, it's super cool to see multiple agents that collaborate with each other solving really challenging problems. So we're going to try this model. It's available right now. If we advance to the next slide, there is a super Groq heavy tiers that we're introducing where you're able to access to both Grok 4 and Grok 4 heavy where you're actually going to be the taskmaster of a bunch of little grok research agent to help you become smarter throughout the research and save hours of times of going through mundane tasks. And it's available right now?
[23:41]
Elon Musk
Yeah, we did limit usage during the demo so it didn't break the demo because all this stuff is happening live. So there's nothing canned about about the test that we're doing. After the demo is done, we'll enable more subscribers for Civic Rock. So if you can't subscribe right now, just try in half an hour. It should work.
[23:59]
Aman
So many of you have been enjoying our voice mode and we've been working hard to improve the experience over the past couple months. We have cut latency in half to make it much snappier. And today we're excited to announce a set of new voices that have exceptional naturalness and prosody. You might have noticed the movie trailer voice that opened up the live stream. That is one of our new voices, Sal, who's got that epically deep tone. And we're also excited to introduce Eve, a beautiful British voice who's capable of rich emotions. Aman, would you like to introduce Eve?
[24:36]
Eve
Absolutely. Let's get into the demo. Hey, Eve, millions of people are watching us right now.
[24:41]
Greg Brockman
Are you excited?
[24:42]
Eve
Oh, I'm thrilled, really. It's like being on stage at the Old Vic with a sea of faces out there. Just imagine all those eyes and ears tuned in. It's a proper buzz, isn't it? What do you reckon we should chat about to keep this lot entertained?
[24:57]
You know, we have a lot to talk about, but I'll be honest, I'm feeling a little nervous. Can you whisper something soothing to calm me down?
[25:03]
Take a deep breath, love. You've got this. Feel that calm wash over you. Then keep talking nice and easy. What's on your mind?
[25:12]
Elon Musk
You know what?
[25:13]
Eve
I'm actually feeling quite relaxed to try this something crazy. So can you give us an opera on Diet Coke?
[25:19]
Oh, an opera on Diet Coke. That's a riot. Picture this, a shimmering silver can glistening under stage. O Diet Coke. Thou elixir divine with bubbles that dance in a sparkling line Thy crisp cool kiss on lips so fine O aspartame ambrosia forever mine. How's that for a mad little aria? Want me to belt out another verse or switch up the tune?
[25:54]
Okay, let's try switching it up and just start singing.
[25:56]
Right then let's have a go. Oh, silver can you gleam in the light? Diet Coke, my heart's delight.
[26:04]
Okay, I won't Touch it. Very, very poetic. Thank you, Eve. Feeling quite nice now.
[26:09]
Greg Brockman
Cool.
[26:10]
Aman
So that's Eve. Something else we do here is a little test each day to check on the natural flow of our conversations against some of our competitors. So we figured why not try that live on the stream today.
[26:24]
Sam Altman
Should I start?
[26:25]
Eve
Hey, I'm going to say a number and can you say the same number back to me? One.
[26:28]
One. Got it. This is like a fun little game of echo. What's next?
[26:32]
2. So you said 2.
[26:34]
3. 4. Nice and steady. What's next?
[26:38]
5.
[26:39]
5. Nailed it. This is pretty fun actually. Feel free to keep going or switch it up if you'd like.
[26:44]
Aman
Hey Gronk.
[26:45]
Eve
Hello there. How can I help you?
[26:47]
Aman
Today we're gonna do a little test. Can you Repeat after me? 1.
[26:51]
Eve
1. Alright, what's next? 2. 2. What's on your mind?
[26:55]
Sam Altman
3.
[26:56]
Eve
3. Need anything else?
[26:58]
Aman
4.
[26:59]
Eve
4. How can I assist you?
[27:01]
Aman
5.
[27:01]
Eve
5. What's next?
[27:03]
So as you can see, Grok was snappier. It didn't interrupt me and the prosody is we made different design choices. I think we were shooting for something more calm, smooth, more natural versus something that's more poppy or artificial. So we'll keep improving on these fronts.
[27:18]
Elon Musk
Thanks guys. Yeah.
[27:19]
Greg Brockman
Yep. So since the launch of the voice model, we actually see the 2x faster end to end latency. In the last 8 weeks, 5 different voices and also 10x the active user. So Grok voice is taking off now if you think about releasing the models this time, we're also releasing Grok 4 through the API at the same time. We're very excited about what all developers out there is going to build. So if I think about myself as a developer, the first thing I'm going to do when I have access to the Grok4API benchmarks, we actually ask around on the X platform what is the most challenging benchmarks out there? That is considered the holy grail for all the AGI models. So turn out AGI is in the name ARC AGI. So the last 12 hours, kudos to Greg over here in the audience. So who answered our call? Take a preview of the Grok4 API and independently verified the Grok4's performance. So initially we thought, hey Grok4, we think it's pretty good, it's pretty smart, it's our next gen reasoning model. Spend 10x more compute, can use all the tools, right? But turned out when we actually verify on the private subset of the RKGI V2, it was like the only model in the last three months. That breaks the 10% barrier. And in fact, was so good, that actually gets a 16%, well, 15.8% accuracy.2x of the second place. That is the Cloud 4 Opus model. It's not just about performance. When you think about intelligence, having the API model drives your automation. It's also the intelligence per dollar.
[28:53]
Elon Musk
Right.
[28:53]
Greg Brockman
If you look at the plots over here, the Groq is just in the league of its own. All right, so enough of benchmarks. Right? So what can Grok do in the real world? We contacted the folks from Endon Labs who were gracious enough to try to grok in the real world to run a business.
[29:09]
Axel
Yeah, thanks for having us. So I'm Axel from Amnon Labs.
[29:12]
Lukas
And I'm Lukas. And we tested Grok 4 on vending bench. Vending bench is an AI simulation of a business scenario where we thought, what is the most simple business an AI could possibly run? And we thought vending machines. In this scenario, the Groq and other models need to do stuff like manage inventory, contact suppliers, set prices. All of these things are super easy, and all the models can do them one by one. But when you do them over very long horizons, most models struggle. But we have a leaderboard and there's a new number one.
[29:43]
Axel
Yeah. So we got early access to the Grok 4 API. We ran it on the vending bench, and we saw some really impressive results. So it ranks definitely at the number one spot. It's even double the net worth, which is the measure that we have on this event. So it's not about the percentage or score you get, but it's more the dollar value in net worth that you generate. So we were impressed by Grokki, was able to formulate a strategy and adhere to that strategy over long period of time, much longer than other models that we have tested, other frontier models. So it managed to run the simulation for double the time and score. Yeah, double the net worth. And it was also really consistent across these runs, which is something that's really important when you want to use this in the real world.
[30:24]
Lukas
And I think as we give more and more power to AI systems in the real world, it's important that we test them in scenarios that either mimic the real world or are in the real world itself, because otherwise we fly blind into some things that might not be great.
[30:39]
Elon Musk
It's great to see that we've now got a way to pay for all those GPUs. We just need a million vending machines. We could make $4.7 billion a year. With a million vending machines.
[30:48]
Axel
100%.
[30:48]
Greg Brockman
Let's go.
[30:49]
Elon Musk
They can be epic vending machines.
[30:50]
Greg Brockman
Yes, yes.
[30:51]
Elon Musk
All right. We are actually going to install vending machines here. Like a lot of them, we're happy to supply them. All right, thank you. All right. I'm looking forward to seeing what amazing things are in this vending machine.
[31:04]
Lukas
That's for you to decide.
[31:06]
Elon Musk
All right, I'll tell the AI. Okay, sounds good.
[31:09]
Greg Brockman
Yeah, I mean so we can see like GROQ is able to become like the co pilot of the business unit. So what else can GROK do? So we're actually releasing this GROK if you want to try it right now to evaluate run the same benchmark as us. It's on the API, has 256k contact lens. So we already actually see some of the early adopters to try Graph 4 API. So our Palo Alto Neighbor ARC Institute, which is a leading biomedical research center is already using seeing like how can they automate their research flows with Grok4. It turned out it performs is able to help the scientists to sniff through millions of experiments logs and then just like pick the best hypothesis within a split of seconds. We see this as being used for their CRISPR research. And also Grok4 independently evaluated scores as the best model to exam the chest X ray. Who would know? And in the financial sector we also see the graph 4 with access to all the tools, real time information is actually one of the most popular AIs out there. Graph 4 is also going to be available on the hyperscalers. So the XAI enterprise sector is only started two months ago and we're open for business. The other thing, we talked a lot about having GROK to make video games. So Danny is actually a video game designer on X. So you know, we mentioned who want to try out some GROK for preview APIs to make games and Danny answered a call. This was actually just made first person shooting game in a span of four hours. Some of the unappreciated hardest problem of making video games is not necessarily encoding the core logic of the game but but actually source all the assets, all the textures of files to create a visually appealing game. So one of the core aspects graph 4 does really well with all the tools out there is actually able to automate these like asset sourcing capabilities. So the developers, you can just focus on the core development itself rather than like, you know, so now you can run a, you know, entire game studios with game of one with like one person and then you can have Grok 4 to go out and source all those slot assets, do all the mending tasks for you.
[33:14]
Elon Musk
The next step, obviously is for Grok be able to play the game. So it has to have very good video understanding so it can play the games and interact with the games and actually assess whether a game is fun and actually have good judgment for whether a game is fun or not. So with version seven of our foundation model, which finishes training this month, and then we'll go through post training, RL and whatnot, that will have excellent video understanding and with video understanding and improved tool use, for example, for video games, you'd want to use Unreal Engine or Unity or one of the main graphics engines, generate the art, apply it to a 3D model, and then create an executable that someone can run on a PC or a console or a phone. We expect that to happen probably this year, and if not this year, certainly next year. It's going to be wild. I would expect the first really good AI video game to be next year and probably the first half hour of watchable TV this year and probably the first watchable AI movie next year. Like, things are really moving at an incredible pace.
[34:20]
Greg Brockman
Yeah. When Grok is 10xing world economy with vending machines, it would just create video games for human.
[34:25]
Elon Musk
Yeah. I mean, it went from not being able to do any of this six months ago to what you're seeing before you hear, and from very primitive a year ago to making a 3D video game with a few hours of prompting.
[34:39]
Greg Brockman
I mean, yeah. Just to recap, in today's livestream, we introduced the most powerful, most intelligent AI models that can actually reason from the first principle, using all the tools, do all the research, go on the journey for 10 minutes, come back with the most correct answer for you. So it's kind of crazy to think about, just like four months ago we had Grok 3 and now we already have Grok 4. And we're going to continue to accelerate as a company. Xai we're going to be the fastest moving AGI companies out there. So what's coming next is that we're going to continue developing the model that's not just intelligent, smart, think for a really long time, spend a lot of compute, but having a model that actually both fast and smart is going to be the core focus. So if you think about what are the applications out there that can really benefit from all those very intelligent, fast and smart models, encoding is actually one of them.
[35:32]
Sam Altman
Yeah. So the team is currently working very heavily on coding models. I think right now the main focus is we actually trained recently a specialized coding model which is going to be both fast and smart. I believe we can share that model with you in a few weeks.
[35:48]
Greg Brockman
Yeah, that's very exciting. The second after coding is we all see the weakness of Grok 4 is the multimodal capability. In fact, it was so bad that GROK effectively just like looking at the world, squinting through the glass and see all the blurry features and trying to make sense of it. The most immediate improvement we're going to see with the next generation pre trained model is that we're going to see a step function improvement on the model's capability in terms of image understanding, video understanding and audios. Now the model is able to hear and see the world just like any of you. And now with all the tools at this command, with all the other agents it can talk to. So we're going to see a huge unlock for many different application layers after the multimodal agents. What's going to come after is the video generation. And we believe that at the end of the day it should just be pixel in, pixel out. Imagine a world where you have this infinite scroll of content in inventory on the X platform, where not only you can actually watch these generated videos, but able to intervene, create your own adventures.
[36:55]
Elon Musk
We expect to be training our video model with over 100,000 GB2 hundreds and to begin that training within the next three or four weeks. So we're confident it's going to be pretty spectacular in video generation and video understanding.
[37:09]
Greg Brockman
We're very excited for you guys to try Grok 4.
[37:11]
Elon Musk
All right, thanks everyone. Good night.