Summary5 min read

The AI Daily Brief: Why AI Needs Better Benchmarks

Host: Nathaniel Whittemore (NLW)
Date: March 26, 2026
Episode Theme:
This episode dives into the persistent challenges and evolution of AI benchmarks, exploring why current benchmarks are often inadequate to meaningfully track AI progress, and highlights ARK AGI 3—a new benchmark that aims to address these gaps. Alongside, the episode provides an update on recent AI news in big tech, politics, and global AI policy.

Main Theme Overview

NLW explores the perennial race between AI benchmark design and model evolution. As language models rapidly saturate existing benchmarks, their ability to distinguish between models’ capabilities—and by extension, track the frontier of AI progress—diminishes. The episode contextualizes why “benchmark saturation” and “benchmark maxing” are corrosive to honest evaluation, then reviews historical benchmarks, their limitations, and current attempts to overcome these via harder tasks and more dynamic, agentic challenges. The launch of ARK AGI 3, with its “human score: 100%, AI: less than 1%” headline, is presented as the next step in meaningful measurement.

Key Discussion Points & Insights

1. The State of AI Benchmarks

Purpose of Benchmarks
- Benchmarks allow comparison of AI performance and track model progression over time.
- Two historic categories:
  - Knowledge Benchmarks (e.g., MMLU, GPQA, Humanity’s Last Exam)
  - Functional Benchmarks (e.g., SUI Bench, Terminal Bench)
Benchmark Saturation
- As models like GPT-4O, Opus, and Gemini advanced, they quickly maxed out scores (>80-90%) on traditional benchmarks.
- “Benchmark saturation then, means that benchmarks no longer show particularly meaningful progress between each model generation. They also don’t show meaningful differences between the models.” (36:30)
Benchmark Maxing
- Labs sometimes train models specifically to ace benchmarks, leading to impressive scores that may not reflect real-world ability.
  - “Benchmark maxing refers to when a lab trains the model specifically to beat the benchmark, even if it has little relevance in the real world.” (37:32)
- Chinese labs are noted for this practice, often resulting in much larger gaps between benchmark performance and practical capability.

2. Evolution and Limitations of Benchmarking Methods

Making Benchmarks Harder
- Escalating difficulty (e.g., from GPQA to GPQA Diamond; from Swebench to Swebench Pro) keeps benchmarks viable, but doesn’t address foundational problems.
Transition to Real-World Tasks
- Evolving from synthetic benchmarks to those closer to real-world work (e.g., SuiteLancer for code, GDPVAL for various white-collar tasks).
- The GDPVAL: “It quickly became clear that models were failing tasks not always because they couldn’t do them, but because the tool calls were failing.” (42:21)
- Agent performance benchmarks (e.g., Meter’s long task benchmark): demonstrate models completing increasingly complex long-range tasks, but these are also reaching their limits.
- Illustrative Quote:
  - “We went from agents that could only complete tasks that take humans 5 minutes in the case of GPT4O, to agents that can complete tasks that take humans 10 hours in the case of Opus 4.6.” (44:08)
  - But: “Meter can’t really extend their benchmark without turning it into something fundamentally different…effectively saturated.”

3. The Launch and Significance of ARK AGI 3

Motivation Behind ARK AGI Series
- Founded in response to models “memorizing” benchmarks without genuine reasoning.
- Aims to test genuine reasoning and skill acquisition, not pattern recall.
History: ARC AGI 1 & 2
- Used grid-based logic puzzles, hiding the true problem from the models.
- When O3 exceeded human performance, ARC adapted by adding compositional reasoning and context dependency in ARC AGI 2.
- “As the benchmark got saturated, we needed something new.” (49:05)
What’s New in ARK AGI 3
- Abandons static puzzles for 135 real-time graphical games.
- No instructions: models must explore, adapt, and learn entirely from environment interaction.
- “ARC AGI 3 gives us a formal measure to compare human and AI skill acquisition efficiency. Humans don’t brute force. They build mental models, test ideas and refine quickly. How close is AI to that? Spoiler: not close.” (52:35)
- All current leading models scored less than 1% vs. 100% for humans.
Notable Community Reactions:
- Brandon Hancock (AI researcher, 54:45):
  - “An alien species with zero knowledge of human language could ace ARC AGI 3 on day one. And I think that’s beautiful...At a time when AI is dominated by language models, it’s refreshing to have a Frontier benchmark, the only one that I’m aware of, that requires zero language ability or cultural knowledge to solve.”
- Francois Chollet (ARC AGI creator, 56:11):
  - “Keep in mind, ARC AGI is not a final exam that you pass to claim AGI. The benchmarks target the residual gap between what’s hard for AI and what’s easy for humans. It’s meant to be a tool to measure AGI progress and to drive researchers towards the most important open problems on the way to AGI.”

4. Broader Takeaway: The Moving Target of AGI Measurement

Benchmarks must perpetually evolve—no single test will “solve” measurement challenges for long.
The field needs as much innovation in “how to test” as in “how to build.”
“The idea of trying to, quote, unquote, solve benchmark saturation—probably as simple as not assuming that benchmarks are going to last all that long.” (57:04)

Notable Quotes & Memorable Moments

On the State of Benchmarking (36:30):
“Benchmark saturation then, means that benchmarks no longer show particularly meaningful progress between each model generation. They also don’t show meaningful differences between the models.” —NLW
On Benchmark Maxing (37:32):
“Benchmark maxing refers to when a lab trains the model specifically to beat the benchmark, even if it has little relevance in the real world.”
Brandon Hancock, on ARC AGI 3 (54:45):
“An alien species with zero knowledge of human language could ace ARC AGI 3 on day one. And I think that’s beautiful.”
Francois Chollet, on Benchmark Evolution (56:11):
“Keep in mind, ARC AGI is not a final exam that you pass to claim AGI...As AI evolves, the benchmark evolves to spotlight the exact problems we haven’t solved yet.”

Timestamps for Key Segments

AI Headlines Recap: 00:45–16:30
What Are Benchmarks, Why They Matter: 19:55–23:40
Benchmark Saturation & Maxing Explained: 36:00–39:00
History and Shifts in Benchmarks: 39:00–45:00
ARC AGI 1 & 2 and the Reasoning Challenge: 45:00–50:00
Details of ARC AGI 3 & Community Reaction: 52:00–56:45
Conclusion and Big Picture: 56:45–End

Flow & Tone

NLW combines accessible, well-structured explanations with a conversational tone and references to current events, community reactions, and industry analogies (“TurboQuant is Pied Piper now”). He emphasizes the ongoing, iterative nature of benchmark design and the tension between genuine progress and artificially inflated leaderboard scores.

Final Takeaway

While each new benchmark inevitably becomes “solved” by future models, progress lies in continuously updating evaluation methods to stay ahead of simple memorization, training, and optimization tactics. ARK AGI 3 is emblematic of this frontier—shifting the focus from known-task proficiency to genuine reasoning, adaptation, and skill acquisition—offering, for now, a clear view of just how far leading AIs still have to go.

Loading summary

Transcript1 lines

[00:00]
A
Today on the AI Daily Brief, why AI Needs Better benchmarks and before that in the headlines is Apple planning on distilling Google's Gemini models? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG Robots and Pencils, Blitzy and Super Intelligent. To get an ad free version of the show go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at Sponsorsideailybrief AI. And while you're at AIDAdailybrief AI, check out everything going on in the ecosystem, including the return of our newsletter, which has all the links that I mention in the show. Apple's AI partnership with Google apparently goes much deeper than previously thought, including the ability to distill Gemini into smaller models. The unveiling of the new AI Siri is a little over two months away and we're starting to get a steady drip of information around what the product will look like. On Tuesday, Bloomberg's Apple Insider Mark Gurman ran through what he knows about features in ux. Apple has reportedly backed down on their view that Siri should remain voice only, now building a standard chatbot interface with optional voice controls. Gurman also reported that Siri will be deeply integrated into iOS 27, allowing it to take actions and draw context from apps running on a user's device. It sounds as though Apple will try to launch Siri with full computer use, delivering the features they advertised with the launch of Apple Intelligence two years ago. Now we already knew that Siri would be driven by Google's Gemini models, but new reporting from the information suggests that Apple has much more freedom in how they use Gemini than originally thought. Previous reports said that Apple would fine tune a Gemini model for their purposes, and that the models would be hosted on Apple servers to ensure user privacy. However, sources speaking with the information said that Apple has full access to the Gemini models, meaning they're able to distill large versions of Gemini into their own smaller proprietary models. Model distillation is the process of using the reasoning traces from one model to train another, essentially a cheat code to develop powerful models. Many of the Chinese labs have been accused of distilling models from Anthropic and OpenAI as a way to catch up quickly. The information sources said that the process isn't straightforward as Apple's vision for Siri is very different to the way Gemini works Gemini is optimized for chatbots, enterprise tasks and coding. While the source implied Apple is less interested in these functions, the source was skeptical the models would actually be that much use to Apple's Foundation Models team for that reason. Maybe the main takeaway is that Apple hasn't entirely given up on training their own models and could use the Google partnership to bootstrap their approach. The most obvious target in most people's minds would be training small capable models to run locally on an iPhone, which seems to be the core vision of where Apple wants to go with AI. Ethan Malik sums up the wait and see attitude of most people on this news tweeting huh? I'm not sure distilling Gemini models to run on phones is going to result in the generally capable agents that people will expect soon, but we shall see. Speaking of Google, the company has published a research paper describing a new compression algorithm that could dramatically improve the performance of small models. Called Turboquant, the process allows researchers to quantize model context with almost zero losses during long conversations or long horizon tasks. Context can bloat to use even more memory than model weights. Functionally, quantization means context is stored with less fidelity. For example, 16 bit data might be compressed into 4 bit. Current quantization methods are quite lossy and noticeably reduce performance. Some believe, for example, that this is the reason anthropics models can seem a little off during demand spikes. Google researchers say their new process massively reduces the loss associated with quantization and could make the technique far less of a trade off. They claim their process results in a 6x reduction of the amount of memory a given model uses for storing context while delivering an 8x speed boost. Compared to current methods. This could result in a 50% reduction in inference costs and help ease the bottleneck around memory chips, giving a concrete demonstration of what the algorithm can do. Google's researchers tested it on Llama 3.18B and Mistral 7B with TurboQuant implemented. Both models achieved perfect scores on Needle in a haystack tests. Cloudflare CEO Matthew Prince tried to explain the gravity of this breakthrough, commenting, this is Google's deep seek. So much more room to optimize AI inference for speed memory usage, power consumption and multi tenant utilization. Others reached for a more relatable analogy, comparing this moment to when a scrappy startup cracked middle out compression with a Wiseman score of 5.2. So basically TurboQuant is Pied Piper now. Google isn't just shipping groundbreaking research, they also have a new music model with Lyria 3 Pro. The first version of Lyria 3 was, to some folks, underwhelming. It wasn't that the model was bad, it just couldn't produce production quality music like Suno and was limited to 30 seconds, making it seem like it was more for novelty purposes than professional use cases. Lyria3Pro definitely addresses some of those issues. It can now create full tracks up to three minutes long and seems to have a much better understanding of lyrics and song structure. Roanne Paul writes, the hard part in AI music is not making pleasant sound for 10 seconds. It's keeping a piece coherent as it moves from intro to verse to chorus without collapsing into a loop. Now Rohan noted that Google is also pushing it in Vertex, AI, AI Studio and the Gemini app. So the bigger story is probably less about the model and more the fact that it is available via API, which could mean it finds its way into a lot more use cases over in the world of AI politics, Senator Bernie Sanders has unveiled his data Center Moratorium bill with an assist from aoc. The legislation would pause all data center construction nationwide until strong national safeguards their words are in place. The bill requires Congress to establish protections for workers and consumers, address environmental harms, and defend civil rights. Before lifting the moratorium, Sanders said AI has received far too little serious discussion here in our nation's capital. I fear that Congress is totally unprepared for the magnitude of the changes that are already taking place now. The presence of AOC as a co sponsor seems fairly relevant. Until now, Sanders has been pushing the moratorium largely by himself, with support from certain elements of the AI safety community. It hadn't found meaningful traction among elected progressives. AOC personally has been pretty much silent on the issue. Her ex feed has zero mentions of data centers and only a single post about AI regarding the dangers of deepfakes. By supporting the bill, AOC is declaring a position for the broader progressive movement and could at least theoretically, carry that position into a presidential run in 2028. Meanwhile, the bill seems to have very little support from mainstream Democrats. Senator Mark Warren, for example, said the idea was, in his words, idiocy. He continued, a data center moratorium simply means China is going to move quicker. The idea that we're going to stuff this back into the bottle, that's a ridiculous premise. Now, despite thinking the moratorium is the wrong solution, Warner certainly still has strong views on AI policy. He's currently supporting a bill to codify Anthropic's red lines around using AI for domestic surveillance and autonomous weaponry Referring to Secretary of War Pete Hegseth, he added, those should be policy decisions not left to a single individual. Warner also raised alarm about AI job replacement, commenting the recent college graduate unemployment is 9%. I'll bet anyone in the room it goes to 30 or 35% before 2028. He said he now believes the scope of the economic disruption is going to be exponentially larger than he thought just a few months ago. Commenting on the Sanders AOC policy, James Rosenbertsch writes, I see why it's called populism now. Never liking the term. Every part of this is detrimentally performative. It's arbitrary. AF the ban on upgrades means no energy efficiency or sustainability improvements can happen. There is nothing progressive about it. On the other end of the spectrum, New York Times tech reporter Mike Isaac writes, people can certainly take issue with his positions and plan of action, but Sanders seems to be one of the few members of Congress seriously reckoning with what the labor consequences of the coming AI age could be. Now joined by AOC One of the things that I'm not sure on is the extent to which a moratorium is a something that Bernie and AOC actually think is good policy b something they think is good politics, given increasing American antipathy towards data centers in their community, or 3 a way to anchor the conversation on the far end of one extreme so there's more room to find compromise in the middle. One can certainly hope it's number three, but right now it's not at all clear. Now, speaking of the China boogeyman, in our final story, Manus co founders have been banned from leaving China as the CCP cracks down. The Financial Times reports that Manus CEO and chief scientist have both been barred from leaving the country, and while Meta's $2 billion acquisition is reviewed, we heard rumblings of this earlier in the month as Manus and Meta executives were summoned to Beijing for a meeting with regulators. The theory of the case is that Manus circumvented China's export controls on tech by relocating their headquarters from Beijing to Singapore. CEO Xiao Hong and Chief scientist Ji Yi Chao reportedly attended the meeting and were told after its conclusion that they would be unable to leave China but were free to travel within the country. Sources said that no formal investigation has been opened and no charges have been brought, but Manus is said to be seeking legal representation to help resolve the issue. The entire situation is messy because it deals with the intersection of actual laws and the unspoken rules that govern doing business in China. China has strict laws to control foreign investment and export of technology. However, both Manus and Meta maintain their transaction was in full compliance. The relocation of the headquarters is an obvious gray area, made even more gray by the fact that Manus still maintains an offshore entity which was used to develop early versions of the product. As for the unspoken rules, Chinese officials have become increasingly concerned about losing AI talent and technology to the West. They've even adopted the euphemism of selling young crops to describe the poaching of human capital in strategic industries. Sources suggested that the extreme outcome would be a forced unwind of the Meta deal, but noted that that would be messy because the technology is already being integrated into Meta's platforms. What's more, this isn't the only sign that Beijing is tightening its grip on its domestic AI industry. AI researcher Tao Hu shared that the China Computer Federation has warned researchers not to participate in the Neurips conference. Chinese entrepreneur Alina Hua argued that this was all to be expected, writing, they thought they were being clever for circumventing China's tech export controls. But you don't mess with the CCP like that. You will be made an example of so others don't get tempted to betray the motherland. So what's going to happen? China won't jail them because they don't want to look evil. Instead they're going to freeze the founders assets in China and give them a travel ban. While the quote unquote probe is ongoing, the probe will likely be deliberately prolonged to inflict psychological damage, create uncertainty for potential copycats, and make the public forget about this case. And once the topic is out of the public's mind, CCP gonna strike hard with a financial penalty that wipes out most of their gains and then soft blacklists them in China. Bill Bishop, the host of Sinicism, writes, I didn't think the Manus top execs would be so naive as to go back to the prc. Expect they will have to spit back out a lot of what they made. On the flip side, some Western observers thought the crackdown will probably backfire. Former White House advisor Dean Ball commented, if we were smart, we'd see this as a major cell phone by China. As NATSAC brained public policy so often is, the message the government is sending is if you ever want to found a company, especially one that makes money on software, move to Singapore first. Easier to get GPUs too. Never a dull day in AI, but for now that does it for the headlines. Next up, the main episode. All right folks, quick pause. Here's the uncomfortable if Your enterprise AI strategy is we bought some tools. You don't actually have a strategy. KPMG took the harder route and became their own client 0 they embedded AI and agents across the enterprise. How work gets done, how teams collaborate, how decisions move not as a tech initiative, but as a total operating model shift. And here's the real unlock that shift raised the ceiling on what people could do. Humans stayed firmly at the center while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us AI. That's www.kpmg.us AI. Today's episode is brought to you by Robots and Pencils, a company that is growing fast. Their work as a high growth AWS and Databricks partner means that they're looking for elite talent ready to create real impact at Velocity. Their teams are made up of AI native engineers, strategists and designers who love solving hard problems and pushing how AI shows up in real products. They move quickly using roboworks, their agentic acceleration platform, so teams can deliver meaningful outcomes in weeks, not months. They don't build big teams, they build high impact, nimble ones. The people there are wicked smart with patents, published research and work that's helped shaped entire categories. They work in velocity pods and studios that stay focused and move with intent. If you're ready for career defining, work with peers who challenge you and have your back, Robots and Pencils is the place. Explore open roles@rootsandpencils.com careers. That's robotsandpencils.com careers Want to accelerate enterprise software development velocity by 5x? You need Blitzi, the only autonomous software development platform built for enterprise code bases. Your engineers define the project a new feature, refactor or greenfield build. Blitzi agents first ingest and map your entire codebase. Then the platform generates a bespoke agent action plan for your team to review and approve. Once approved, Blitzi gets to work autonomously, generating hundreds of thousands of lines of validated end to end tested code. More than 80% of the work completed in a single run. Blitzi is not generating code, it's developing software at the speed of compute. Your engineers review, refine and ship. This is how Fortune 500 companies are compressing multi month projects into a single sprint. Accelerating Engineering Velocity by 5x Experience Blitzi firsthand@ Blitzi.com, that's Blitzy.com it is a truth universally acknowledged that if your enterprise AI Strategy is trying to buy the right AI tools. You don't have an enterprise AI strategy. Turns out that AI adoption is complex. It involves not only use cases, but systems integration, data, foundations, outcome tracking, people and skills and governance. My company Superintelligent provides voice agent driven assessments that map your organizational maturity against industry benchmarks against all of these dimensions. If you want to find out more about how that works, go to Besuper AI and when you fill out the Get Started form, mention maturity maps again. That's BeSuper AI. Welcome back to the AI Daily Brief. Today we are looking at the launch of Arc AGI 3. It's a new benchmark from ArcPrize that is specifically designed to test the interactive reasoning capability of AI agents. Now it is the latest in a sequence of benchmarks that are meant to deal with some of the problems of benchmarks, but to better understand what they are trying to respond to. It's worth going back and actually understanding what benchmarks purpose is, what the problems with them are, and how people have tried to address those problems. Benchmarks are effectively two things. They're a way to compare AI's performance in various areas, as well as a way to see how models are progressing over time. Historically, there have been two major categories of benchmarks that you see included with every new model release. The two categories are Benchmarks around knowledge and benchmarks around function. Knowledge was the first big hill to climb with benchmarks like MMLU for general knowledge and gpqa which measures scientific knowledge. Over time, more difficult benchmarks were introduced like Humanity's last exam, which features obscure knowledge not typically found in the training data. As models developed, however, function became more important. SUI Bench is one of the best known of the functional benchmarks testing the knowledge required to solve typical coding problems from GitHub as agentic coding has risen in importance in the AI space, Terminal Bench has arguably overtaken SUI Bench as the most important coding benchmark. Terminal bench tests not only coding reasoning but also the model's ability to use a terminal. And interestingly, a lot of benchmarks have followed this pattern, starting off as a test of knowledge and then implicitly or explicitly also adding an element of testing functional capacity. Humanities last exam for example began as a pure test of pre trained knowledge, but now it's typically measured with web search tools enabled, making it a proxy for competency and tool use as well. Now, very early on in the modern post ChatGPT era of AI, benchmark saturation became a problem all the way back In May of 2024 with the release of GPT4O, all major models were already above 80% on MMLU, with GPT4O scoring 88.7%. Now at the time, some other benchmarks were a little bit less saturated. 4o was a big breakout, for example in GPQA, scoring 53.6%. But of course, with all of these benchmarks it was only a matter of time. By last summer, the saturation problem had gotten much worse. At the time, 03 was OpenAI's daily driver. More difficult questions had been added to gpqa diamond, and O3 still achieved 83.3% without using tools. By that stage, most of the 2024 benchmarks had been abandoned or updated because of saturation. For example, the math benchmark was long gone, replaced by the AIME Math Test, which uses questions from a real world math competition. O3 would score 88.9% on AIME math, foreshadowing that a specifically trained OpenAI model would achieve a gold medal performance at the International Math Olympiad a few months later. Fast forward to today and once again many of these benchmarks are getting saturated. GPT 5.4 is now up to 52.1% on humanity's last exam with tools and 39.8% without, which is very close to Opus 4.6's 53 and 40% respectively. SUI bench was once again upgraded with GPT 5.4, scoring 57.7% on Swebench Pro for Opus 4.6. Anthropic reported 81.4% on SW bench verified, but chose to highlight Terminalbench 2.0 more prominently where they scored a 65.4%. Now, it's difficult to keep track of all these numbers, but this chart shows the example of how performance on Sweep Bench Verified progressed over the past year. Models From Anthropic, Google, OpenAI and Minimax who produced the chart are basically all up into the right they each began at different points in the middle of 2025, ranging from 55 to 70%. However, they've all arrived now at up near 80%. Benchmark saturation then, means that benchmarks no longer show particularly meaningful progress between each model generation. They also don't show meaningful differences between the models. And making this problem worse is the issue of benchmark maxing. Benchmark maxing refers to when a lab trains the model specifically to beat the benchmark, even if it has little relevance in the real world. This happens because the benchmarks are either completely known or semi public, meaning model labs can train specifically for the test in order to have more impressive numbers when they come out. One common perception and critique of Chinese labs is benchmark faxing in the extreme, which frequently leaves their models with a huge gap between their benchmark scores and real world performance. In February, a variant coding benchmark called SUI Rebench was released containing a different set of problems and most of the Chinese models dived in the rankings, suggesting they were specifically trained against the narrow set of SUI bench verified problems. The Western models did drop as well, but not by nearly as much. Another example was Meta. With the release of Llama4Maverick last April, Meta was accused of testing multiple model variants on ALA Marina, which is a crowdsourced taste test platform for LLM performance. Platform users are presented with two samples and vote for the best one. Meta was accused of having tested models until they found the one that clicked most with users and launched as the second ranked model on LM Arena. You will recall that when people got their hands on llama 4, they did not in almost any case think it was the second best model available. Between benchmark maxing and benchmark saturation, the net effect is the diminished significance of benchmarks as a tool for people to understand which models are good for and at what. Now, on top of all of that, there is just an inherent problem with traditional benchmarks. Most of these benchmarks tend to be narrowly focused on solving one particular type of task. Some are about recalling knowledge and some are about more complex skills, but they are focused on doing one thing within a very narrowly defined set. We've talked in some episodes this week about the idea of task AGI that at this point AI is really good at a huge array of knowledge work tasks, but where it struggles is in bringing tasks together. And and in that light it would be reasonable, I think, to say that while benchmarks might be good at demonstrating task AGI, they're not particularly useful in helping understand how AI does outside of that very narrow task. Math is a particularly good example of this, with last year's models basically solving the very narrow field of competition mathematics. This was demonstrated in the IMO gold medal performances from OpenAI and Google. That is of course a completely different skill set than real world mathematics today. To the extent the practical reality of deploying AI is understanding and dealing with its jagged frontiers, most traditional benchmarks just aren't all that helpful. With that now, everything I'm discussing today are known long standing problems and there have been a ton of attempts to fix benchmarks over the years. One of the brute force methods is simply making the questions harder. We've seen this with SweetBench and GPQA, which remained relevant deep into 2025 by simply changing the difficulty level. This gave the benchmarks at least a little more life and kept them relevant for hill climbing performance. But but it didn't really address the core underlying problem. There are also benchmarks that were switched out for more practical tests. A key example here is the transition from Sweebench to Terminal bench as the major coding benchmark. Terminal Bench was intended to be a closer match to the way people actually use the models. It put models in a standard harness and tested their ability to use a terminal and other tools to solve coding problems. On some level it was an improvement, but it still is dealing with saturation issues and it also adds more complex variables, particularly early on. For example, good coding models would fail tasks because they couldn't execute the tool calls properly. Another approach has been trying to simulate real world tasks. An early version of this idea was the Suite Lancer benchmark developed by OpenAI last February. It tested coding ability against real world tasks from upwork that paid an aggregate of a million dollars. This allowed OpenAI to express their model's coding ability in dollar terms. The spiritual successor was GDPVAL, released by OpenAI last September. It extended the real world problem set beyond coding to encompass various types of white collar work like making spreadsheets and slide decks. One of the interesting quirks of GDPVAL was that it required the agent to build and deliver a polished work product. It quickly became clear that models were failing tasks not always because they couldn't do them, but because the tool calls were failing. Now GDPVAL also has other challenges. For example, OpenAI went out and actually worked with experienced professionals. But to do a combination of AI and human review. Other evaluators like artificial analysis have gone and modified GDP VAL to be a strictly automated AI only version, and it remains one of the benchmarks that I think people are most interested in relative to all the others. Now another major approach was looking at continuous agent performance, with Meter's long task benchmark being the most well known. This is that chart that we joked during a lot of last year as the bubble talk was increasing was effectively holding up the entire global market. The way that this test works involves giving models a set of coding problems that human coders could complete in a set interval of time ranging from a few minutes to several hours. The resulting chart has become one of the clearest demonstrations of model improvement in the space of two years. We went from agents that could only complete tasks that take humans 5 minutes in the case of GPT4O to agents that can complete tasks that take humans 10 hours in the case of Opus 4.6. Now, the big problem with Meter's test is, and one that they've fully admitted, is that they're running out of tasks to test against. Their original task set included very few tasks that take more than a few hours. Now that agents can complete complex tasks that take 10 hours, meter is struggling to find a useful test set realistically. Tasks that take human developers 10 hours aren't really tasks anymore. They're full on software builds that introduce far more complexity into the test. In other words, Meter can't really extend their benchmark without turning it into something fundamentally different, meaning that this test even is effectively saturated. Which brings us to ARC AGI. It began as the ARC Prize in the summer of 2024, based on former Google computer scientist Francois Chollet's approach to measuring machine intelligence. Introducing the prize ARC wrote at the time AGI progress has stalled. New ideas are needed modern LLMs have shown to be great memorization engines. They are able to memorize high dimension patterns in their training data and apply those patterns into adjacent contexts. This is also how their apparent reasoning capability works. LLMs are not actually reasoning. Instead, they memorize reasoning patterns and apply those reasoning patterns into adjacent contexts, but they cannot generate new reasoning based on novel situations. More training data lets you buy performance on memorization based benchmarks, but memorization alone is not general intelligence. General intelligence is the ability to efficiently acquire new skills. Arcprize's answer to this is a test that contains a series of abstract visual logic puzzles. The tasks are presented as a series of colored squares on a grid, with squares added or removed according to a particular pattern. Two clues are given to teach the pattern. Then the task is to apply that pattern to a problem square. For example, the problem might require a yellow square to be placed next to a line of blue squares in various orientations. These are problems that are relatively easy for humans to solve, but prove to be difficult for LLMs. The tasks were also kept hidden so the logic couldn't be trained into the models. Instead, the test was trying to measure an LLM's ability to learn new logic within context and apply it to a novel problem. Basically, it set out to be a pure test of reasoning ability rather than memorization of how to reason. Early results were pretty compelling that this was a solid approach. At the time that Arc AGI 1 was released, no models had come within 50% of human performance. Subsequent releases improved on this score, but the models seemed to be making genuine progress through reasoning. Then In December of 2024, OpenAI dropped a bombshell. A preview version of their O3 model had achieved a 76% score on low inference settings, exceeding the human score for the first time. On high settings, the score was 88%. The O3 model had been trained on the public dataset but tested on a private dataset to achieve this score, so there was no risk. The logic was trained into the model. ARC wrote at the time. This is a surprising and important step. Function increase in AI capabilities showing novel task adaptation ability never before seen in the GPT family models. At the same time, ARK announced that they would be updating their benchmark for 2025 with RKGI2. The new benchmark looked superficially similar to the first. It contained the same colored squares and was once again designed to be easy for humans and harder for LLMs. The key change was made to counteract the innovation that allowed the O series models to outperform, which is test time compute. It kind of seems quaint now, but at the time the idea of making a model reason for longer was a paradigm shifting innovation. With Zero3, OpenAI had extended test time compute enough to maintain context between problems and learn iteratively throughout the test. In order to pressure test this approach, ARC added a new twist to the problems. Rather than simply adding a square according to the pattern, there were now three new styles of tests. Symbolic interpretation, where the LLM was tasked with interpreting more meaning within the symbols. For example, tasks where shapes needed to be colored differently according to how many holes they have. A second new set of tasks required applying multiple rules within the same problem set, which they called compositional reasoning. And a final new set of tasks added context to the problem, where logic was no longer universally applied but depended on context. For example, shapes with a red border need to be shifted to the right, while shapes with a blue border needed to be shifted to the left. Again, all of these problems remained fairly simple for humans, but the additional complexities were designed to overload LLM context and test pure reasoning ability. The test held up well for most of 2025. Most model releases scored below 30% at the very end of the year, and as this year got underway, things escalated dramatically. Gemini 3.1 Pro scored 77.1% at $0.96 per task in February. In March, Opus 4.6 achieved a 68.8% score, GPT5.4 Pro achieved 83.3%, and Gemini 3 DeepThink is the current leader at 84.6% and 13.62 per task. Basically, once again, as the benchmark got saturated, we needed something new. Which gets us to Ark AGI 3. In an X post introducing the test on Wednesday, ARK writes announcing Arc AGI 3, the only unsaturated agentic general intelligence benchmark in the world, humans score 100% AI less than 1%. This human AI gap demonstrates we do not yet have AGI. Most benchmarks test what models already know. ARC AGI 3 tests how they learn. Now the test is a complete rethink on the ARC AGI formula. The static grids of colored squares are gone. In their place, ARK has designed A series of 135 simple graphical games that require the LLM to manipulate the grid in real time. They have no instructions, so the model needs to explore the environment, figure out how it works, execute a plan, and adapt on the fly to what it sees. In their early testing, ARK observed models failing by mistaking one game for another, carrying over theories between games and failing to forecast cause and effect. Ark wrote, ARC AGI 3 gives us a formal measure to compare human and AI skill acquisition efficiency. Humans don't brute force. They build mental models, test ideas and refine quickly how close AI is to that spoiler. Not close. And unlike ARC AGI 2, we are starting at ground zero. None of the Frontier models can complete this test with any level of competency, each scoring less than 1%. Google DeepMind's Xiaoma shared one of Gemini's playbacks, which are all publicly available in the Replay section of the ARK website. She wrote, poor Gemini straight thought it was playing Activision tennis. Now, not everyone is a fan of how this is set up. Lahsan Al Ghaib Scaling01 writes, the scoring of RKGI3 doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans actually using squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps, then the model gets a score of 1%. The implication? They write that this means scores are not comparable to the first two ARC tests. On the other end of the spectrum, AI researcher Brandon Hancock, who commented on the elegance of the benchmark. He writes, an alien species with zero knowledge of human language could ace Arcagi 3 on day one. And I think that's beautiful. At a time when AI is dominated by language models, it's refreshing to have a Frontier benchmark, the only one that I'm aware of, that requires zero language ability or cultural knowledge to solve. Intelligent does not mean speaks English or speaks Python. I'm reminded of classic first encounter sci fi storylines where intelligent species are able to communicate well before they hash out a common spoken or written language. Simply based on universal math, science and reasoning concepts, AI has gotten complex enough that it behaves much more like an alien species than a next token predictor. At this point, Francois Chollet, one of the creators of ARC AGI, warned that this won't be the one benchmark to rule them all, commenting Keep in mind, ARK AGI is not a final exam that you pass to claim AGI. The benchmarks target the residual gap between what's hard for AI and what's easy for humans. It's meant to be a tool to measure AGI progress and to drive researchers towards the most important open problems on the way to AGI. So it's a moving target designed to track the frontier. As AI evolves, the benchmark evolves to spotlight the exact problems we haven't solved yet. And I think maybe that's the big takeaway. The idea of trying to, quote, unquote, solve benchmark saturation. Probably as simple as not assuming that benchmarks are going to last all that long. Just as we need innovation in the way that we build these models, we're going to need innovation in the way that we measure them. It'll be interesting to see how fast we have models that actually jump from 1 to some meaningful percent on ARC AGI 3. But of course, before long we'll need some other new thing to measure, some other new capability. For now, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching. As always. And until next time, peace.