Summary7 min read

Podcast Summary: The AI Daily Brief – "GPT-5 is 58% AGI"

Podcast: The AI Daily Brief: Artificial Intelligence News and Analysis
Host: Nathaniel Whittemore (NLW)
Episode: GPT-5 is 58% AGI
Date: October 21, 2025

Episode Overview

In this episode, Nathaniel Whittemore explores a provocative new framework for measuring artificial general intelligence (AGI), which finds that GPT-5 is "58% of the way there." NLW dissects the meaning and importance of AGI definitions, why they matter in industry and finance, and how the new scoring method shifts the conversation from philosophical debate to measurable progress. The episode also covers rapid developments in AI coding tools, AI-powered medical assistants, generative music startups, and real-world AI deployments at companies like Starbucks.

Key Discussion Points and Insights

1. AI Headlines and Industry Updates

(00:45 – 13:00)

Anthropic Claude Code Expansion (01:10)
- Anthropic’s Claude code tool, previously accessible only via terminals and IDEs, is now on the web and in their iOS app.
- Implication: Developers can "spin up background agents...run multiple tasks in parallel," making coding workflows more efficient.
- Quote: Kat Wu, Product Manager: "As we look forward, one of our key focuses is making sure the command line interface product is the most intelligent and customizable... But we're continuing to put Claude code everywhere." (02:10)
Replit’s Explosive Growth (03:15)
- CEO Amjad Massad reports $240M ARR, expected to quadruple next year, driven by adoption in mid-sized companies.
- Business Model: Free users drive corporate adoption, with strong margins on the enterprise side.
- Quote: Massad: "Replit is kind of replacing a lot of the no code, low code tools which never really work very well... get initial productivity boosts, but... ended up actually slowing down a lot of companies." (04:10)
Meta's AI App Traction (05:30)
- Stand-alone AI app climbs to 300,000 downloads/day and 2.7 million DAUs, possibly bolstered by the Vibes AI-generated feed and as an alternative to OpenAI’s invite-only Sora.
Open Evidence Fundraising (07:05)
- Medical AI assistant for doctors raises $200M at $6B valuation.
- Product Detail: Free for professionals, monetized by ads; now supports 15M monthly clinical consultations.
- Quote: Daniel Nadler, Co-founder: "No one else in the world has that data." (08:10)
- Quote: Zangin Zeb, Google Ventures: "It's reaching verb-like status." (08:35)
- Insight: Real-world usage data is becoming a unique competitive "moat" for AI companies.
Suno's Music Gen Raise & Legal Update (09:40)
- Startup raises at $2B valuation; music labels might settle legal disputes and even take equity.
Starbucks' AI Adoption (11:25)
- CEO Brian Nicol details the "Green Dot" in-store knowledge assistant and early pilots in inventory and scheduling.
- Quote: Nicol: "We're still in the early days of this, but I believe there is definitely opportunity... to get things done faster and more efficiently." (11:55)
- Memorable Moment: Swats away fears of robot baristas: "We're not near that right now." (12:30)

2. Redefining Artificial General Intelligence (AGI)

(13:15 – 37:45)

Why AGI Definitions Matter

Practical Impact: NLW argues AGI definitions are "useless" for daily enterprise AI use, but become critical as progress toward AGI shapes market and investment decisions. (13:25)

Current & Historical Definitions

Andrej Karpathy: Sets a high bar – AGI can do any economically valuable task at human or better performance, not just knowledge work. (14:55)
- Quote: "AGI was a system...that could do any economically valuable task at human performance or better." (15:05)
OpenAI: Evolving definitions—from "AI systems that are generally smarter than humans" (2023), to the "five levels of AI" framework, ranging from chatbots up to organizational-level performance.
Other Entities:
- Gartner: AGI as "intelligence of a machine that can accomplish any intellectual task a human can perform."
- Google: AGI as the hypothetical ability to "understand or learn any intellectual task a human can."
- Amazon: Software able to "perform tasks not necessarily trained or developed for."
ARC AGI Prize: Focuses on generalization and skill acquisition, not just task performance. "AGI is a system that can efficiently acquire new skills outside of its training data." (17:20)

The Impact of Definitions

AGI definitions, once just "nebulous," now influence "how markets should treat AI stocks." (15:50)
Vague definitions have triggered contract disputes, e.g., Microsoft's deal with OpenAI.

3. The New Quantifiable AGI Framework

(20:40 – 37:45)

Center for AI Safety’s Paper: "Definition of AGI" (20:40)
- Proposes a measurable framework, benchmarking a model against the cognitive abilities of a well-educated adult.
- Grounded in Psychological Theory: Cattell-Horn-Carroll model of cognition.
- 10 Cognitive Categories: Reading/writing, math, reasoning, working memory, memory (storage and retrieval), visual, auditory, speech, knowledge.
Scoring Results
- GPT-4: 27%
- GPT-5: 58%
  - Major gains in reading/writing, math.
  - New competence in reasoning, memory retrieval, visual, and auditory.
  - Still major deficiencies, especially in memory.
Analysis of Current Shortfalls
- Quote (Dan Hendricks, Center for AI Safety): "People who are bullish about AGI timelines rightly point to rapid advancements like math. The skeptics are correct to point out...AIs have many basic cognitive flaws...There are many barriers to AGI, but they each seem tractable." (26:30)
- Quote (Lewis Gleason, content creator): "For the first time, we have a framework that turns AGI from a buzzword into a measurable spectrum." (28:05)
- Quote (Rohan Paul, on memory): "[Today's systems] fake memory by stuffing huge context windows and fake precise recall by leaning on retrieval from external tools, which hides real gaps in storing new facts...Both GPT-4 and GPT-5 fail to form lasting memories across sessions and still mix in wrong facts when retrieving, which limits dependable learning and personalization over days or weeks." (33:00)
- Memory as Bottleneck: No AI yet matches humans in storing/retrieving persistent information over time; even state-of-the-art models "forget" between sessions.
Utility and Limitations
- Framework is functional, not economic—high cognitive scoring does not guarantee business value.
- Some companies, like OpenAI and Microsoft, tie AGI definitions to financial performance (e.g., $100B in profits) for contractual clarity.
Economic vs. Cognitive AGI
- Quote (Elon Musk): "AGI is...capable of doing anything a human with a computer can do, but not smarter than all humans and computers combined. ... probably three to five years away." (Elon on X, paraphrased at 36:05)
- Even without AGI, current models are already "having and will have a profound impact on the economy exactly as they are right now." (37:05)

Notable Quotes & Memorable Moments

“All of a sudden progress toward AGI is going to be considered a meaningful factor when it comes to how markets should treat AI stocks.” – NLW (13:45)
"Each category was equally weighted and given a score out of 10...GPT-4 scored 27% while GPT-5 achieved 58%." – NLW, summarizing the new paper (24:40)
"For the first time, we have a framework that turns AGI from a buzzword into a measurable spectrum.” – Lewis Gleason (28:05)
"The biggest hole by a mile is around memory." – NLW (31:10)
"You hear when people critique...models don't have memory and they can't learn in the way that humans do." – NLW (32:55)
Elon Musk: “AGI is...capable of doing anything a human with a computer can do, but not smarter than all humans and computers combined...probably three to five years away.” (36:05)
“An incredibly powerful model, whether it's AGI or not, could have a profound impact on the economy.” – NLW (37:10)

Important Segment Timestamps

Anthropic Claude Code News: 01:10
Replit Growth & Business Model: 03:15
Meta AI App Surge: 05:30
Open Evidence Funding & Strategy: 07:05
Suno Funding & Music Label Negotiations: 09:40
Starbucks AI Deployment: 11:25
AGI Definitions & Context: 13:15–20:40
Center for AI Safety AGI Framework: 20:40–37:45

Tone and Closing Thoughts

Whittemore adopts an analytical, occasionally skeptical tone, repeatedly noting that "AGI" is more a matter for market sentiment and philosophical debate than daily enterprise impact. But with the introduction of a quantitative framework, the conversation is poised to shift from nebulous speculation to something that "turns AGI from a buzzword into a measurable spectrum." He cautions, however, that business impact and market value may not map cleanly onto cognitive test scores, and memory remains the critical obstacle to true AGI as defined in this new research.

"Ultimately I think this is a extremely useful contribution to the field. I hope that more people dig in and if nothing else, it creates a useful heuristic for the future when inevitably we rage and scream and kick with every new model release about how some big wall has been hit." – NLW (37:28)

Summary Takeaway

This episode brings clarity to a long-murky debate, spotlighting a new, rigorous approach to measuring AGI. While GPT-5 may be 58% 'there' by cognitive benchmarks, massive gaps in persistent memory and "continual learning" remain. For now, AI’s march toward AGI is no longer a matter of pure guesswork; the field finally has a scorecard—even if the economic implications are still up for debate.

Loading summary

Transcript1 lines

[00:01]
A
Today on the AI Daily Brief, a new definition of AGI that suggests that GPT5 is 58% of the way there. Before that in the headlines, Claude Code comes to the web. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Super Intelligent KPMG and Robots and Pencils. To get an ad free version of the show, go to patreon.com aidaily brief or you can sign up on Apple Podcasts. One note about Apple Podcasts because this has been coming up a bit. Apple Podcasts is set up on a very different system to Patreon. With Patreon, I upload a distinct file. I can schedule it the same way as I can with my normal episodes, and so it's always set to come out at the same time as the main ad version. With Apple Podcasts it's a little bit different. I have to wait for the episode to post to Apple Podcasts, meaning that there's a short delay after I publish in general, and then I have to go in and manually replace the file. What this means is that if for any reason I can't immediately replace that file, you will still see the normal ad version on your feed. It only gets replaced when I add that manual file. This is normally fine, it just means about 15 minutes of waiting around after I press publish on the normal episode. But that lag creates more possibilities for problems. For example, yesterday Apple's Podcast Connect system was down for about 12 hours, which meant that pretty much overnight, even subscribers on Apple still saw only the ad version. I promise you that I will always try to get the ad free version up on Apple as fast as I can, but sometimes it's going to be out of my control. I apologize. I really wish it was a better system, but that is just the way that it is. Lastly, of course, for any information about the show, sponsorship, speaking, job opportunities, go to aidailybrief AI. And with that all out of the way, let's dive in. Welcome back to the AI Daily Brief Headlines edition. All the daily AI news you need in around five minutes. First up today, Anthropic is making Claude code available through a web app and within the Claude iOS app. Previously the feature was only available through terminals and IDEs and the big unlock is being able to spin up background agents with Claude code running in the cloud. You can now run multiple tasks in parallel across different repositories from a single interface. And ship faster. With automatic PR creation and clear change summaries, this asynchronous workflow is quickly becoming a powerful tool for AI enhanced coders, cloud code product manager Kat Wu said. As we look forward, one of our key focuses is making sure the command line interface product is the most intelligent and customizable way for you to use coding agents. But we're continuing to put Claude code everywhere. Helping it meet developers where they are. Web and mobile is a big step in this direction. Certainly there is a lot of excitement about this. Josh JDJ Kelly on Twitter wrote, I can work with Claude code while out on a walk. Speaking of agent decoding, Replit is projecting massive growth to reach $1 billion in revenue by the end of next year. Speaking with business insider, CEO Amjad Massad said that the AI coding startup has reached 240 million in ARR and expects that to quadruple next year. The company's growth this year has been absolutely skyrocketing, gaining more than 10x from their 16 million in ARR at the end of 2024. The company now has over 150,000 paying customers and over 40 million free users. And while at this stage all those free users mean that the consumer segment is unprofitable, Mossad boasted that enterprise margins are close to 80%. This follows the same profit model that other AI companies are currently pursuing. The consumer segment is a loss leader due to large volumes of free users, but building familiarity with consumers means they demand access to the same tools at work. Mossad said that the surging revenue was largely due to adoption in mid sized companies including Duolingo and Zillow. He said Replit is kind of replacing a lot of the no code, low code tools which never really work very well. They get initial productivity boosts, but a lot of times that ended up actually slowing down a lot of companies. Whatever the case, they are seeing enough growth that they are pushing forward their expectations. This article came about after Business Insider saw a leaked investor memo that gave the billion dollar projection for 2027. Speaking of growth, after a spike in downloads this month, could Meta's AI app actually be gaining traction? According to similar web data, Neta's standalone AI app now has over 300,000 downloads per day, up from around 100,000 in mid September. In addition, the app now has 2.7 million daily active users, up from 775,000 last month. And while SimilarWeb said they hadn't seen any meaningful collation with either search or advertising volume, however, they noted that Meta could be promoting the platform on Facebook or Instagram, which aren't included in SimilarWeb's data. The other possible explanation is that Meta's new Vibes feed has been more of a success than people gave it credit for. The AI generated image and video feed was released September 25th and decried by many as the introduction of infinite feeds of AI Slop. However, the spike in downloads and daily active users both do line up with the introduction of that feed. OpenAI's launch of the Sora app a week later could also be boosting Meta's platform as an alternative. Sora still requires an invite code, while Meta's platform is freely available now. Obviously, these numbers in aggregate are still quite low relative to the billions of users that mainstream social apps have, but the growth is notable nonetheless. Next up, some fundraising news Open Evidence, the AI assistant for Doctors has raised $200 million at a $6 billion valuation. This is the second large fundraising round for the company this year. They raised 210 million at a 3.5 billion valuation back in July, and with the level of growth they've displayed recently, it's not hard to see why the valuation has almost doubled. Open Evidence now supports around 15 million clinical consultations a month, up from 8.5 million in July. The product is free to use for registered medical professionals and monetized through advertising rather than subscription. That unconventional approach for a professional tool has allowed Open Evidence to expand into 10,000 medical centers. Open Evidence only began commercializing their app three months ago and is already halfway to their target of 100 million in advertising revenue for next year. The assistant is trained on leading medical journals like the New England Journal of Medicine and is designed to help doctors quickly access the literature for diagnosis and treatment options. The system is also designed to reject low confidence outputs, reducing hallucination risk alongside medical journals. The model is also being fine tuned on the 100 million clinical consultations assisted by the tool. Co founder Daniel Nadler said that this is one of the company's largest moats, adding, no one else in the world has that data. Speaking to adoption among doctors, Zangin Zeb of Google Ventures, the lead investor in the round, said it's reaching verb like status. Now this data type of moat where companies in verticals have have access to actual real world data based on the usage of their tool is one of the most interesting themes and questions so far in the history of LLMs. We've seen that the bitter lesson applies, in other words, that mass access to data beats out specialized data when it comes to pre training. However, where a lot of people are looking in the future is that the data that's left that the foundation model labs don't have is the data exhaust that comes from real world usage, and that could in and of itself be extremely valuable. That's certainly the argument that open evidence is making, and we'll have to see how it plays out. Staying on Fundraising Music gen startup Suno is said to be in talks to raise 100 million at a $2 billion valuation. Sources speaking with Bloomberg said the deal would quadruple the company's valuation since their last raise. That last round closed in May of last year and brought in 125 million, although the valuation was not disclosed at the time. Importantly, the startup is now generating 100 million in ARR, according to sources familiar with the numbers. And what's more, Suno may be able to settle their legal disputes very shortly. In June of last year, Universal and Warner Music filed a lawsuit for copyright infringement against Suno and competitor Unio. But this June, Bloomberg reported that the labels are in talks to settle the litigation and establish a licensing framework for generated music. The labels are also rumored to be looking to take an equity stake in both of those companies, reinforcing the idea of a truce between the music industry and AI startups. Last week Spotify announced plans to work with the record labels on AI powered features. Universal Music Group CEO Lucian Grange is boosting a pro AI message internally. Last week he sent a memo to staff re emphasizing his interest in partnering on AI products as long as they respect artists copyrights and likenesses. Now, for anyone who has watched the history of the record labels all the way going back to Napster, this should be no surprise at all. There is no industry, frankly more adept at figuring out how to monetize the new thing. Lastly, today, the latest company to make some big AI pronouncement is Starbucks. Starbucks CEO Brian Nicol said that they're all in on AI. Appearing on a Yahoo Finance podcast recorded at the Dreamforce conference last week, Nicol discussed a wide range of AI deployments at the company. A major scaled use case is an in store knowledge assistant referred to as the Green Dot. It helps store leaders manage daily operations including troubleshooting equipment and providing drink recipes. Nicole also said that Starbucks has pilots for inventory, supply chain forecasting, and scheduling, although none of those use cases are at scale. Speaking to roi, he commented, we're still in the early days of this, but I believe there is definitely opportunity here to help us get things done faster and more efficiently. To what scale that is to be determined. We're definitely already seeing a big impact in our technology area. The ability to get code done so much faster is real. One thing he did reject is the idea of robot baristas anytime soon, commenting we're not near that right now. Some folks tried to dig into the specifics about what that would mean, while others just let it be vibes. Sophie etc writes okay, yeah, whatever. Eff it. Starbucks AI. And that is going to do it for today's headlines. Next up, the main episode. Today's episode is brought to you by my company, Superintelligent. Look guys, buying or building agents without a plan is how you end up in pilot Purgatory. Superintelligent is the agent planning platform that saves you from stalling out on AI. We interview teams at scale, translate real work into prioritized agent opportunities, and deliver recommendations that you can execute on what to build, what success looks like, how fast you'll get results, and even what platforms and tools you should consider, all customized for you. Instead of shopping for hype, you get to deploy with confidence. Visit BeSuper AI and book your AI planning demo today. AI isn't a one off project. It's a partnership that has to evolve as the technology does. Robots and Pencils work side by side with clients to bring practical AI into every phase. Automation, personalization, decision support and optimization. They prove what works through applied experimentation and build systems that amplify human potential. As an AWS Certified Partner with Global Delivery Centers, Robots and Pencils combines reach with high touch service where others hand off. They stay engaged because partnership isn't a project plan, it's a commitment. As AI advances, so will their solutions. That's long term value. Progress starts with the right partner. Start with Robots and pencils@rootsandpencils.com aidaily Brief what if AI wasn't just a buzzword but a business imperative? On you can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward thinking enterprises. Hosted by me, Nathaniel Whittemore and powered by kpmg, this seven part series delivers real world insights from leaders who are scaling AI with purpose. From aligning culture and leadership to building trust, data readiness and deploying AI agents. Whether you're a C suite, executive strategist or innovator, this podcast is your front row seat to the Future of Enterprise AI. So go check it out at www.kpmg.us aipodcasts or search you can with AI on Spotify, Apple Podcast or wherever you get your podcasts. Welcome back to the AI Daily Brief. One of the things that I have said frequently on this show, including being effectively the entire theme of yesterday's show is, is that when it comes to the practical, lived, applied experience of AI inside a work setting, I don't think that AGI matters. In fact, I think it is one of the more useless terms when it comes to how you think about applying AI in your daily life or your company thinks about applying it at work. So why do definitions of AGI matter then? And the short answer is, it's the exact same reason that we had that entire conversation in the show yesterday, which is that all of a sudden progress towards AGI is going to be considered a meaningful factor when it comes to how markets should treat AI stocks. Given how much AI stocks are at the core of the entire economy right now, these otherwise nebulous definitions start to take on a greater importance. Now, of course, for those who haven't listened to yesterday's episode, AGI timelines are back in the news this week, specifically because OpenAI co founder Andrej Karpathy said that he believes the technology is still a decade away, as opposed to estimates that have it more in a year or now. One critical point that came out of that conversation is that Andre actually has an extremely high bar for how he defines AGI. He said, when people talk about AI in the original AGI and how we spoke about it when OpenAI started, AGI was a system that you could go to that could do any economically valuable task at human performance or better. That was the definition. He noted that since then the definition has been watered down to just covering knowledge work, certainly nothing like physical work. Now knowledge work is certainly a huge part of the global economy, but at 10 to 20% of all the work in the world, at least as per his estimates, that leaves a lot off the table. Now this is far from the only definition floating around. Way back In February of 2023, OpenAI laid out their framework for thinking about the approach of AGI. They gave a very basic definition, AI systems that are generally smarter than humans. Since then, Sam Altman has updated his thoughts. He acknowledged in February of this year that AGI is a quote, weakly defined term, but generally speaking we mean it to be a system that can tackle increasingly complex problems at human level in many fields. You might also hear Altman talking about AGI in reference to the five levels of AI framework. Now this built off of something that Google DeepMind scientists had introduced in a November 2023 paper. But then OpenAI expanded into these five stages. Level one chatbots which were AI with conversational language level. Level two which were reasoners with human level problem solving. Level three was agents with systems that can take actions. Level four were innovators, AI that can aid in invention. Level five organizations or AI that can do the work of an organization. As we discussed a lot at this show, we are somewhere in the 3 to 4 range right now. Beyond that, there are a range of other definitions you might come across. Stalwart Old Gardner defines AGI as, quote, the intelligence of a machine that can accomplish any intellectual task that a human can perform. Google leans into a different aspect, describing AGI as hypothetical intelligence of a machine that possesses the ability to understand or learn any intellectual task that a human being can. Amazon has another distinct focus, describing AGI as software that is, quote, able to perform tasks that it is not necessarily trained or developed for. Now if these are one off definitions for blog posts, One of the more prominent attempts to define and test AGI capabilities is of course the ARC AGI Prize. On their website they write the consensus definition of AGI A system that can automate the majority of economically valuable work. While a useful goal is an incorrect measure of intelligence. Measuring task specific skills is not a good proxy for intelligence. Skill is heavily influenced by prior knowledge and experience. Unlimited priors and unlimited training data allow developers to buy levels of skill for a system. This masks the system's own generalization power. Intelligence lies in broader general purpose abilities. It is marked by skill acquisition and generalization rather than skill itself. So they propose a better definition for AGI is AGI is a system that can efficiently acquire new skills outside of its training data. The ARC AGI test then seeks to test two elements of AGI contained in the definition. The ability to acquire new skills by ensuring the tests have internal logic that can be learned, and the ability to complete tasks outside of training data by by ensuring the tasks are not generally available. So these are all the things that are floating around and you can see while they broadly get us in the right category, there are a lot of different definitions which lead to a lot of debates and a lot of AGI is in the eye of the beholder kind of conversations which as I said I don't think really matters for our day to day, but does matter when it comes to whether giant funds are going to press the sell button because they think things are overbought because we're not making enough progress towards AGI. Which means all these contracts aren't going to play out the way that they want to so this is the context into which a group of researchers working with the center for AI Safety have attempted to nail down a common definition and a metric for assessing models as they progress. The group has produced a paper called Definition of AGI, which you can find at agidefinition AI. In the Abstract, they write, the lack of a concrete definition for artificial general intelligence obscures the gap between today's specialized AI and human level cognition. The this paper introduces a quantifiable framework to address this defining AGI as matching the cognitive versatility and proficiency of a well educated adult. This group then has grounded their analysis in Cattell, Horn Carroll theory, one of the more well accepted models of human cognition. Applying the theory, the researchers split AI performance into 10 distinct reading and writing, math, reasoning, working memory, memory storage, memory retrieval, visual, auditory, speech and knowledge. Now you'll note that these categories cover some of the general performance categories, things like reading and writing or math. But it also addresses a model's ability to learn and apply its intelligence to topics outside of its training data. Each of these categories has multiple subcategories that can be assessed individually. In fact, assessment was one of the main focuses of this paper, researchers wrote. Applications of this framework reveal a highly jagged cognitive profile in contemporary models. While proficient in knowledge intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long term memory storage. Each category was equally weighted and given a score out of 10, and researchers measured GPT4 and GPT5 to demonstrate the framework. GPT4 scored 27% while GPT5 achieved a 58%. You can see from the two sets of results mapped out on a chart that while GPT5 only made minor progress in knowledge, it made significantly more progress in reading and writing as well as math. What's more, GPT5 scored in multiple categories where GPT4 was entirely deficient. This included reasoning, working memory, memory retrieval, visual and auditory. And while those areas of intelligence are developing in the latest models, they're still very nascent compared to, for example, math. Dan Hendricks, the director of the center for AI Safety, commented, people who are bullish about AGI timelines rightly point to rapid advancements like math. The skeptics are correct to point out that AIs have many basic cognitive flaws. Hallucinations, limited inductive reasoning, limited world models, no continual learning. There are many barriers to AGI, but they each seem tractable. It seems like AGI won't arrive in a year, but it could easily arrive this decade. Content creator Lewis Gleason wrote. What's powerful here is that this framework lets us track AGI like a scorecard. For the first time, we have a framework that turns AGI from a buzzword into a measurable spectrum. Instead of arguing, are we close to AGI? We can now ask how much cognitive ground remains before parity. Now, one of the interesting things about this framework is to focus on what's missing rather than highlighting a model's frontier abilities. Over the summer, for example, GPT5 and Gemini 2.5 Pro achieved gold medal performances in the International Mathematical Olympiad and the International Collegiate Programming Contest. The leading models then are already at a human level, a very advanced human level when it comes to math or coding. Importantly, though, while achieving that level was a huge milestone on the path to AGI, based on this center's approach to an AGI definition, further progress in those areas isn't going to make a big difference. In contrast, audio and visual understanding is still very nascent and needs to improve dramatically before AI models could be considered anywhere close to AGI. Of course, those areas are arguably on the way. Google has made incredible strides with their multimodal models over the past year, and visual understanding seems to be developing quickly. The VO3 set of models in Sora 2 are also able to add appropriate audio to generated videos, implying strong auditory understanding. The big area that is so clearly missing the biggest hole by a mile is around memory. The paper in fact describes this as perhaps the most significant bottleneck. Now, of course, this is a huge area of focus for the labs. Anthropic recently introduced their Skills feature, which introduces a more efficient way of storing and accessing memory. But we're yet to see a model that can intelligently store and retrieve information at anywhere close to a human level. In fact, one of the things that you hear when people critique how far ahead the hype may have gotten in their estimation than where the capabilities of models are. It tends to come around to this part of cognition where models don't have memory and they can't learn in the way that humans do. Commenting on the study's exploration of memory from the paper, Rohan Paul noted, they show that today's systems often fake memory by stuffing huge context windows and fake precise recall by leaning on retrieval from external tools, which hides real gaps in storing new facts and recalling them without hallucinations. They emphasize that both GPT4 and GPT5 fail to form lasting memories across sessions and still mix in wrong facts when retrieving, which limits dependable learning and personalization over days or weeks. Anyone who has thought that they had locked in core knowledge and context about themselves with an LLM, only to have it feed you back a response that has none of that understanding built in will understand what a big problem this actually is. Now, what's valuable about this paper is, as Gleason put it, having a framework where there's an actual trackable numeric score that people can assess progress on. For example, if all market actors accepted this framework, which of course won't happen. And then they went and looked and GPT6 came out. Instead of the inevitable endless debates about whether we had hit a wall again, theoretically you could just look and see how much it had improved from GPT4's 27% and GPT5's 58%. And yet, at the same time, there is one highly problematic shortfall that could be very important. Again, as Rohan Paul put it, the scope is cognitive ability, not motor control or economic output. So a high score does not guarantee business value. In fact, increasingly other AGI definitions have fallen back on economic value as the most important proxy for intelligence. Sometimes that's because more complex notions like continuous learning or performing tasks outside of the training set are too difficult to define. One prominent example came from OpenAI's contract dispute with Microsoft. Their agreement originally had Microsoft losing access to OpenAI's technology once AGI was achieved. The problem was, of course, that the definition of AGI from OpenAI was pretty vague. It defined AGI as, quote, highly autonomous systems that outperform humans at most economically valuable work. The OpenAI board also had sole discretion to declare that AGI had been achieved. This was viewed as an unfalsifiable claim that could cost Microsoft tens of billions of dollars. The two companies ultimately settled on changing the definition of AGI to use a financial measurement as a proxy. They decided that AGI would be deemed to have been achieved when OpenAI developed software that could generate 100 billion in profits. Earlier this week, during the controversy around the Andre interview, Elon Musk revealed that he has a similar definition. He posted on X that AGI is, quote, capable of doing anything a human with a computer can do, but not smarter than all humans and computers combined. He said, it's probably three to five years away. He also put forward his belief that Grok 5 has a 10% chance to meet this definition and the odds are rising now, I think. There are, of course, merits to both economic and functional definitions of AGI. The functional definition is laid out in the new paper establishes the areas where current models are lacking and the new capabilities they will need to achieve AGI. In some ways it functions almost like a checklist. So we're all clear that incredibly intelligent models that forget everything at the end of the context window aren't really AGI but at the same time an incredibly powerful model like Elon Musk is predicting Grok 5 will be whether it's AGI or not could have a profound impact on the economy and in fact as I've said numerous times I think that these models are having and will have a profound impact on the economy exactly as they are right now. Ultimately I think this is a extremely useful contribution to the field. I hope that more people dig in and if nothing else it creates a useful heuristic for the future when inevitably we rage and scream and kick with every new model release about how some big wall has been hit. For now that's going to do it for today's AI Daily Brief. Appreciate you listening or watching as always and I'll and until next time peace.