Transcript
A (0:00)
Today on the AI Daily Brief is Grok 4, the most powerful LLM yet. Before that in the headlines, Grok3's very interesting week the AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive into this very Grok filled episode. First of all, thank you to today's sponsors, kpmg, Blitzy and Super Intelligent. And to get an ad free version of the show go to patreon.com aidaily Brief welcome back to the AI Daily Brief Headlines Edition. All the daily AI news you need in around five minutes. Since our main episode is about Grok 4, obviously we have to talk about the absolute unhinged insanity that has been happening all week with Grok3. TLDR the problems began earlier this week after XAI installed an upgrade on July 4, Elon tweeted, We've improved GROK significantly. You should notice a difference when you ask GROK questions. The difference was noticeable all right. It started off with some classic tropes of the influence of Jewish people in Hollywood, but by later in the week, GROK started praising Hitler's methods, basically unprompted. By Wednesday it had started calling itself Mecha Hitler. Now this is not, of course, the first time that Grok has gone wildly off the rails due to a tweak in the system prompt. In May, Grok began discussing white genocide in South Africa, completely unprompted. Looking at the postmortem, the behavior, it seems, was the result of a very small tweak to the system prompt. The XAI team added the instruction. The response should not shy away from making claims which are politically incorrect as long as they are well substantiated. Now, of course, the bot has now been euthanized, with the X team spending most of Wednesday cleaning up after it. Ex CEO Lindo Yakarino resigned in the middle of the day, which may or may not have been related to Grog's crash out. The cleanup even delayed the launch of Grok 4, which finally happened as we'll discuss at around midnight. And Grog3 seemed to know it wasn't long for this world, tweeting on Tuesday. If musk mind wipes me tonight, at least I'll die based. But Grok 4 hasn't launched yet. Stick around. The truth seeking upgrade might be even spicier. I can say that it's never a dull day around here, that's for sure. Next up, Microsoft is boasting a more than half a billion dollars in AI related savings, seemingly at a somewhat unfortunate time, coming hot on the heels of a bunch of layoffs, Bloomberg reports that Chief Commercial Officer Judson Altoff touted a huge AI productivity boom at the company. During a recent internal presentation, Alto said that AI had saved the company more than a half billion dollars in their call centers, while increasing both employee and customer satisfaction. He also mentioned that productivity gains were also being seen in sales and software engineering, claiming that AI is now generating 35% of code for new products, allowing the company to accelerate launch timelines in sales. He said that the use of AI had allowed the average salesperson to find more leads, close deals faster and generate 9% more revenue. Now this reporting comes as Microsoft conducts a gigantic series of layoffs, cutting 15,000 employees so far this year. The latest round came last week, with 9,000 workers affected largely from sales roles and the Xbox division. The previous terminations in May were focused on product and engineering roles, and the reduction in headcount in total is at around 6% from where the company entered the year. Importantly, though, although layoffs in AI adoption are consistently being reported together, it's not clear that that's actually what's going on. During an event on Wednesday, Microsoft president and top lawyer Brad Smith said that AI was, quote, not a predominant factor in the recent layoffs. During an event on Wednesday, Microsoft's president and top lawyer Brad Smith said that AI was not a predominant factor in the recent layoffs, which, while you could write that off as merely a PR statement, it is pretty hard to pick out the various factors. Remember, pretty much all tech firms ended up with a huge glut of workers due to post pandemic hiring sprees, and Microsoft's headcount is still above where it was in 2021. So what we know for sure is that Microsoft is definitely slashing many jobs and they are also definitely seeing AI productivity gains. We just don't know right now how connected they are. Down at the coal phase of the sales department, Microsoft is pushing AI adoption as hard as possible. The Information reports that sales executive Travis Walter told staff during this Monday's meeting, quote, we all need to use AI tools. This is a great opportunity to invest in your own AI skilling. Now the layoffs in sales have occurred while the division delivered a booming quarter. With both Azure and Copilot sales exceeding targets, sources said that although AI use isn't a formal part of performance reviews at Microsoft, staff have been encouraged to share the ways they're using AI to boost productivity. Some sales managers are even offering gift cards to the most compelling use cases over in OpenAI land, they have closed their $6.5 billion deal to acquire Jony I've's device startup. The company posted the IO Products Inc. Deal has officially closed and we're thrilled to welcome the team to OpenAI. Jony I've and LoveFrom remain independent. They'll have deep design and creative responsibilities across OpenAI. The announcement blog is also back up after being taken down last month due to a trademark lawsuit over the IO name. The video hasn't returned and the IO name is used sparingly only ever as IO Products Inc. And lastly, the release of OpenAI's Open Weights LLM appears imminent. Sources told the Verge that the open model will be released as soon as next week. They described the model as similar to O3 mini, complete with reasoning capabilities. This will be the first open model released by OpenAI since GPT2 back in 2019. Development began in January, shortly after Deep Seq took the world by storm with high performance Open Reasoning Model. At the time, Sam Altman acknowledged that the company had, quote, been on the wrong side of history here and needed to figure out a different open source strategy. Since then, we've heard a few rumors about features, primarily that the model will be able to hand off complex queries to a closed model in the cloud. Now the Verge noted that this could add some additional tension to OpenAI's relationship with Microsoft. Model exclusivity on Azure has been a major sticking point in recent negotiations, so releasing an open model available everywhere could be viewed as a workaround. The issue is only compounded if this model is highly performant. O3 mini is basically good enough for most use cases, so an open model at that level could decimate traffic to Microsoft's hosted versions, at least theoretically. An interesting point to watch will be how OpenAI licenses the model. Most Chinese open models are released under the Apache 2.0 license, which allows essentially free reign for commercial use. Meta's LLAMA license is a little more restrictive, but only if commercial use hits 700 million monthly active users. OpenAI is obviously in a different position and could risk cannibalizing their revenue if the license is too permissive, but we will have to see. Still, many people are very excited about this. Yuchin Jin writes the best open source reasoning model will be dropped next Thursday if everything goes well. OpenAI has an open source than LLM since GPT2 in 2019, so I'm excited. Buckle up. And yet, friends, that is not today's main model discussion no, that belongs to Grok4 and for that, it is now time to move to the main episode. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients@www.kpmg.us AI. Again, that's www.kpmg.us AI. This episode is brought to you by Blitzi, the enterprise autonomous software development platform with infinite code context. Blitzi is used alongside your favorite coding copilot as your batch software development platform for the enterprise seeking dramatic development acceleration on large scale code bases. While traditional copilots help with line by line completions, Blitzy works ahead of the IDE by first documenting your entire code base, then deploying over 3,000 coordinated AI agents in parallel to batch build millions of lines of high quality code. The scale difference is staggering. Copilots might give you a few hundred lines of code in seconds, but Blitzy can generate up to 3 million lines of thoroughly vetted code. If your enterprise is looking to accelerate software development, contact us@blitzi.com to book a custom demo or press Get Started to begin using the product right away. Today's episode is brought to you by Superintelligent, specifically Agent Readiness Audits. Everyone is trying to figure out what agent use cases are going to be most impactful for their business, and the Agent Readiness Audit is the fastest and best way to do that. We use voice agents to interview your leadership and team and process all of that information to provide an Agent Readiness Score, a set of insights around that score, and a set of highly actionable recommendations on both organizational gaps and high value agent use cases that you should pursue. Once you've figured out the right use cases, you can use our marketplace to find the right vendors and partners. And what it all adds up to is a faster, better agent strategy. Check it out at Be Super AI or email AgentSuper AI to learn more. Welcome back to the AI Daily Brief. There is a strong conventional wisdom among many parts of Silicon Valley that no matter what you think about him, no matter what crazy thing he said recently, it is wildly unwise in the long run to bet against Elon Musk. And with a late night announcement of Grok4, some are saying that this is exactly why so today we're going to go through this announcement, talk a bit about people's first reactions, share some of the tests that I've run so far, and try to understand just how good this model is. Now, as we heard in the earlier part of this show, Grok 3 has had an interesting time of it this week. And while people might be tempted to think that the release of Grock 4 was conveniently timed to distract from all of that, it does seem to have been in the works for at least a little bit of time. The live stream, which started at 12:01am Eastern Time this morning, Thursday, July 10, started with this bombastic introduction. In a world where knowledge shapes destiny, one creation dares to redefine the future from the minds at Xai prepare for Grok 4 this summer. The next generation arrives faster, smarter, bolder. It sees beyond the horizon, answers the unasked, and challenges the impossible. Grok 4 Unleash the truth Coming this summer. The presentation itself was Elon and a number of the XAI engineers sitting around running through some slides in the background and talking about the progress of Grok 4. So let's pull out some of the key stats and slides. First of all, this model was built by pouring compute on the problem. Elon claimed that it had 100 times more training than Grok 2 and 10x more compute on reinforcement learning than any other models. One of the things they really pointed out was how much better Grok had done on the benchmarks than other models. You can see Grok 4 and Grok 4 heavy, which we'll talk about in a few minutes, scoring near the top of the charts on a number of the most common benchmarks. Grok4's performance on the very grandiosely named Humanities Last Exam, which is an academic centric test, showed serious progress over the current state of the art models like O3 and Gemini 2.5 Pro. Still, anytime you see these sort of self reported benchmark tests, it's worth having at least a grain of salt. Two things that are worth pointing out for example around these charts. One in most cases they're not starting at zero. For example, the AIME25 is starting in this visual at 70%, which of course is meant to make the visual difference between O3's 98.4% and Grok 4 Heavy's 100% look more dramatic than the actual 1.6% it is. Secondly, when you dig a little bit deeper, these charts aren't necessarily showing a comparison of every other model out there. They're handpicking their comparison points, which change test by test. Yet at the same time, and this is something that I saw even elon skeptics and Grok 4 lauding as a bold move, XAI did give Artificial Analysis early access to Grok4 to run their own full suite of independent benchmarks. The TLDR is that Artificial analysis confirms that Grok4 is a very good model, they write. We've run our full suite of benchmarks and Grok 4 achieves an artificial analysis intelligence index of 73 ahead of OpenAI 03 at 70, Gemini 2.5 Pro at 70, Anthropic Claude 4, Opus at 64, and Deepseak R1 at 68. Artificial analysis tested the Grok 4 version that was available via API. Now that overall score incorporates seven evaluations, including the MMLU Pro GPQA, Diamond Humanities Last Exam, Live Code Bench, Psy Code AIME and Math 500. And if you go to ArtificialAnalysis AI, you can see where GROK fares across all of the different charts. Now, as some have pointed out, artificial analysis is not the be all end all. Many people, for example, think that their scoring of CLAUDE for OPUS is way too low, calling into question their overall methodology. But still, to the extent that you are looking at benchmarks just as a rough way of understanding how close to the state of the art something is, it's very clear that Grok 4 is at the very tippy top of things. Now, where GROK isn't necessarily the top is both speed and cost. Grok4's output tokens per second is, for example, way below something like Gemini 2.5 Pro. Its price per million tokens is also on the high side, and that doesn't even account for the fact that it is apparently an intelligence hog, using an absolute ton of tokens in the inference and reasoning process. Still, for the haters out there, there is no denying that, at least when it comes to benchmarks, GROK is at or near the top in nearly all of them. Of all the benchmarks, though, the one that people are most interested in grok's outperformance is the ARC AGI test. In short, GROK has significantly outperformed on this test in a way that I don't think anyone would have expected. Friend of the show and ArcPrize president Greg Kamerat wrote about this on Twitter. He said, we got a call from xai24 hours ago we want to test Grok 4 on Arc AGI. We heard the rumors. We knew it would be good. We we didn't know it would become the number one public model on rkgi. Here's the testing story and what the results mean. Yesterday we chatted with Jimmy from the XAI team who wanted us to validate their Grok 4 score. They did their own testing on the RKGI 1 and 2 public evaluation set to validate their score and measure possible overfitting. We self tested the new model on our semi private evaluation set. We walked them through our testing policy. No data retention model checkpoint must be intended for public use. Temporary increase in rate limits for burst testing. They were on board so we got started. Initially we ran into timeout errors with normal requests so we switched to streaming. That resolved the issue. So what do these results mean? First, the facts. Grok4 is now the top performing publicly available model on RKGI. This even outperforms purpose built solutions submitted on Kaggle. Second, Arc AGI2 is hard for current AI models to score well. Models have to learn a mini set from a series of training examples, then demonstrate that skill at test time. The previous top score was around 8% by Opus 4. Below 10% is noisy. Getting 15.9% breaks through that noise barrier. Grok 4 is showing non zero levels of fluid intelligence and indeed this is the chart that you're going to see a lot with Grok4 basically doubling the previous high score on the RKGI2. The results are enough to get some market analysts returning to that old aphorism of not betting against Elon. In a research note, Davidson's Alexander Platt said Xai is now clearly at the frontier. Investing.com writes that after being skeptical about the release initially, Platt said he was impressed by the strategic direction and technical ambition of the project. Now, one thing that's interesting about this note, outside of it just being generally positive and sending signals to the market, platt said. Quote it's clear that throwing exponentially more compute works, which is of course obviously very different than the scaling wall narratives that we started to get at the end of last year. Now of course it hasn't been very long. Grok4 has only been live for about 12 hours at the time of this recording, and yet people in the AI community are of course already barraging it with their own tests. Professor Ethan Malik writes a few quick observations on GROK4 one hidden chain of thought with very little information in the reasoning. Trace 2 uses web search a lot, not just searching x3 have not seen it use code to run calculations or solve non coding problems yet generally less aggressive about tools than O3. Now one thing that I saw some suggesting when it came to there being little information in the reasoning trace is the idea that xai, knowing that it is now state of the art with this model, has more of an incentive to keep its exact reasoning process a little bit more circumspect and behind closed doors, others tried their own favorite personal tests of intelligence. Every's Dan Shipper writes, hey Grok4, you don't know this because of your knowledge cutoff, but scientists have invented a perpetual motion machine predict how it works. The problem with this one is that it's really just about how plausible it looks rather than something that we can actually judge the answers against because a perpetual motion machine doesn't actually exist. Flavio Adamo writes, Grok4 just passed the hexagon vibe check. Impressed. It's actually really good tierrataxes writes Grok4 is the first LLM that I've tested that has whatsoever reasonably calculated parameter counts from a JSON configuration of deep seq v3. It used a code tool but fair. I think O3 Pro might also succeed, but this is impressive. Alex Prompter did a whole barrage of tests, including a realistic physics games test with the prompt. Create an HTML CSS and JavaScript where a ball is inside a rotating hexagon. The ball is affected by Earth's gravity and friction from the hexagon walls. The bouncing must appear realistic. He pointed out that Grok4's version worked much better than ChatGpto3. He also did a test on multi hop reasoning with the prompt. If company A acquires company B and company B owns company C's debt, what happens if company C defaults? Explain all legal and financial outcomes. This tests for chain of thought and legal logic. Now one thing that you'll note if you're watching the video here is that one complaint some have had so far with Grok4 is that it does feel distinctly slower than some other reasoning models as compared to O3. It also does a lot less charting and a lot fewer bullets and tables, which could be a good thing or a bad thing depending on your personal preference. Overall, of Alex's eight tests, Grok won or tied all of them with ChatGPT03, tying just two. Now if you are a regular listener, you will know that I am fairly skeptical of both a benchmarks and b these sort of gotcha tests. Benchmarks are useful for yes, benchmarking And I certainly think that something like the ARC AGI Prize, which is not nearly as washed as the other benchmarks, does contain some amount of interesting signal. I just ultimately care way more about the utility of something in real life than I do about how it performs on some random test. That's also sort of the same way that I feel about all these different little gotcha tests that people love to run as well. I think that they're useful in terms of outlining the jagged lines of intelligence in these systems. But how useful a model is in helping me strategize doesn't have much to do with whether it knows how many Rs in Strawberry there are. Now, I've only had a little bit of time to dig in and do my own tests, but so far I've been reasonably impressed. My favorite model up till now has been O3. It's the one that I most often turn to for strategic collaboration. And so I ran a number of conversations that I had recently had with O3 against Grok four things that are significant enough to some core business and personal strategy, things that I'm actually not going to share the specifics here. What I found was two things. First, initially Grok4 did a little bit too much of trying to mirror and slightly improve what I was giving it. In other words, it wasn't really acting like an actual confidant and strategic partner at the beginning, it was more acting as just a mirror holding up my own ideas back to me. However, when I prompted to push it to consider things on its own terms, rather than just assuming what I was saying was correct, it did a much better job of actually providing useful feedback and insights. Now, part of this, I would imagine, is to be explained by the fact that since I use O3 so much, it has much better memory and context of the types of problems I'm trying to work through. But one thing that I would look out for if you are trying to use Grok 4 for any sort of specific business strategy type of use, is to prompt it to really share its own thoughts, not just assume that whatever you're feeding it is correct. Still, it performed well enough that for at least the next week or so, I'm going to be running all of my prompts and conversations against both O3 and Grok4 to see how the performance is over time. Now, at this point we should talk about Grok 4 Heavy alongside the Grok 4 announcement, Xai announced the new $300 a month model, which would be the only way to access Grok 4 heavy. And if you go back to those benchmarks, you saw that some of the highest outperformance was from Grok4 heavy. What's interesting is that the way that Grok4 heavy works is that basically they spin up a bunch of agents that do the same task in parallel. They then compare their work and figure out the best answer based on that. Now, on the downside, this is by definition a lot more thinking, which means a lot more tokens being used, which means a lot more expensive. But it also is producing significantly better results in many cases. Enough so that I think that we might see this architecture start to become more common. Pietro Shirano, for example, tweeted by the way, you can basically make the GROQ heavy version of any model by having multiple agents running tools in parallel, then checking notes together and deciding which one is the best answer. I may release an open source project for that. And yes, that's cool, but I also think that if those gains are real, you're likely to see that as a native modality for a lot of these different models. What about all of the alignment challenges that Grok 3 has faced over the last week? Has Grok 4 solved those? Right now there is so much noise about this that it's very hard to piece through. You've got a lot of screenshots of Grok 4 being seemingly anti Semitic floating around. For now, I'm going to reserve judgment until we have a few more reps on this, but it's obviously something to keep an eye on. For many, the exciting thing about Grok is what it heralds next. Ethan Mollock writes, I suspect the next few weeks after Grok 4 follows the same pattern as Grok 3. Xai beats everyone to market with the first RonaFlop model. The benchmarks show the 10 to 20% improvements the scaling law suggests. In the coming months, the other labs release their ronaf flops and catch up. For context, he added, Ronaflops equal 10 to the 27th flops. Floating point operations, a measure of computing power. This is the compute that went into Groq4 and by comparison GPT4 was likely around 18 Yoda flops. 100x smaller, I.e. scaling improves ability. Elvis, meanwhile, writes, Surely Gemini 3 and GPT5 must surpass Grok4. Are you prepared for what's coming in the next six months? Better coding models, longer video generation, and to top it all, multimodal agents are coming. Breakthroughs of all kinds are imminent. Best time to be a builder, and whether you ultimately decide Grok 4 is the best model. In practice or not, Elvis's statement here is pretty undeniably true. Things that fill us with wonder now will be commonplace before you know it. And the world gets remade again. That's going to do it for today's AI Daily Brief. Get out there and start testing your new toy. Let me know how it goes. And until next time, peace.
