Transcript
A (0:00)
This podcast is sponsored by Google. Hey folks, I'm Amar, Product and Design lead at Google DeepMind. Have you ever wanted to build an app for yourself, your friends, or finally launch that side project you've been dreaming about? Now you can bring any idea to life. No coding background required with Gemini 3 in Google AI Studio. It's called Vibe Coding and we're making it dead simple. Just describe your app and Gemini will wire up the right models for you so you can focus on your creative vision. Head to AI Studio, build to create.
B (0:28)
Your first app today on the AI Daily Brief. GPT 5.2 is here and OpenAI wants you to know it is for professionals. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Gemini, kpmg, Blitzy, Rovo and Robots and Pencils. To get an ad free version of the show, go to patreon.com aidaily Brief and if you are interested in sponsoring the show, lock in those 2025 rates by emailing us at sponsorsidailybrief AI welcome back to the AI Daily Brief. This is actually the second AI Daily Brief I've recorded today because of course in the early afternoon we got GPT 5.2, which means the episode I recorded earlier in the day will become tomorrow's episode, because obviously we have to talk about this new model. Now. There have been indications for the past week that GPT 5.2 was on the way. This is of course, part and parcel of OpenAI's declared code red in the lead up to the release of Gemini 3 OpenAI, Sam Altman had sent a memo to his team, basically expecting there to be some rough vibes, in his words, as Google released their best ever model. Then on top of that, we got Opus 4.5, which has just continued to impress and frankly, if anything, grow in people's esteem. And yet, chatter has been over the last week that OpenAI's forthcoming response model, codenamed Garlic, was likely to be a capable response. Well, today we got the model, and at least at first glance, it's a banger in the benchmarks they shared. It represented a significant improvement on the coding benchmark SW Bench Pro, hitting 55.6 compared to Opus 4.5's 52%. It scored a 52.9% on the ARC AGI2 examination, ahead of Opus 4.5's 37.6%. And on GDP VAL, which is OpenAI's internal measure of economically valuable knowledge work tasks. It scored a massive 70.9%, up from 38.8% with GPT5. GDP VAL is in some ways the most relevant of the benchmarks, at least in terms of what it seems the goal of GPT5.2 is for OpenAI more so frankly, than any model release I've seen from them. There is a clear, clear messaging directive. This is a real world business model to help professionals get more value. In a Briefing with reporters, OpenAI's CEO of Applications, Fiji, Simo said that 5.2 was about unlocking even more economic value for people. In her announcement tweet, she reiterated this GPT 5.2 is here and it's the best model out there for everyday professional work. Greg Brockman writes, five. Two is here, the most advanced frontier model for professional work and long running agents. It's a big step forward on enterprise tasks including spreadsheets and slides. Head of ChatGPT Nick Turley writes, Today we're introducing GPT 5.2, our most advanced model series for professional work. GPT 5.2 thinking is designed to help with real economically valuable tasks, the kind of work professionals do every day, building spreadsheets and presentations, writing and reviewing production code, analyzing long documents, coordination tools and executing complex projects from start to finish. And indeed you can tell that of all the benchmarks, the one that they really care about is that GDP VAL measure of success with professional knowledge work tasks. Simo again writes, On GDP VAL, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations and document creation. Gnome Brown wrote, In my opinion, GDP VAL is the most important result from our 5.2 launch. We outperform in domain experts and are state of the art among all models on GDP valid which measures performance on self contained tasks like making spreadsheets and PowerPoint presentations. Truly, you have never seen a company as excited about spreadsheets and PowerPoints as OpenAI is with the launch of 5.2. All of this was the theme of the announcement post as well. Right at the top, OpenAI harkens back to the ChatGPT Enterprise survey that we discussed earlier in the week. Quoting that number where enterprise users were saving between 40 and 60 minutes a day. OpenAI writes, we designed 5.2 to unlock even more economic value for people. It's better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long contexts, using tools and handling complex multi step projects. And honestly, when you see the difference between 5.1 thinking and 5.2 thinking. On some of these economic tasks, the difference could not be more stark. The examples they give are a workforce planning model including headcount, hiring plan attrition and budget impact. The spreadsheet is so massively improved. Again in their cherry picked example they also give an example of two different cap tables. While the visual is pretty similar, they note that 5.1 incorrectly calculated seed series A and series B liquidation preferences and left the majority of those rows blank which led to an incorrect final equity payout calculation. 5.2 got all those calculations correct. They also gave an example of project management where 5.2thinking produced this really professional looking Gantt chart to help describe and summarize progress over the course of a month. Now these broad based economically valuable tasks are, like I said, the thing that OpenAI chose to put right at the top of this blog post. Even ahead, it is notable of coding, yet as I said with that 55.6 on SU bench Pro, there are definitely coding improvements here as well. And once again they connected this to professional users for everyday professional use. They write this translates into a model that can more reliably debug production code, implement feature requests, refactor large code bases and ship fixes end to end with less manual intervention. They also note that it's better at front end, giving examples of an ocean wave simulation, a holiday card builder, and a typing rain game where you have to type the words before they hit the bottom of the screen. A couple things to call out that were a little bit farther down in the announcement post, but were still really interesting. The first is that 5.2 seems to do really well with long context on needles. In a haystack test where the performance of 5.1 degraded from about 90% at 8k context to less than 50% at 256k context. With 5.2 thinking, it barely nudged down from 108k to something it appears is above 90 on the 256k context. Now going back to professional use, this matters I think, because a lot of the next generation of value is going to be unlocked by being able to handle lots and lots of enterprise context all at once. Another important change, they found that GPT5.2 had roughly 30 to 40% less hallucination. Again, when you're thinking about professional business users, one of the great enemies of reliance on AI is hallucinations. So seeing a meaningful decrease in hallucinations again means a big difference for professional users. But so far we've just talked about what OpenAI said about their own model. What about some of the folks who had early access? Medical professor Daria Anutmaz writes, I had early access to GPT5.2 and tested mostly the Pro version. Let me just say this, relative to 5.1 Pro, it has stronger abstraction, clearer, more realistic, balanced and strategic responses and shows deeper conceptual insights and vibes. And I would say this represents one theme that I saw in a lot of these initial early responses that yeah, this is just a good model. That is a meaningful improvement. Ethan Malik writes, had early access to 5. 2. It's an impressive model. He asked it to build him a graph of humanity's last exam scores over time, which, as he points out, involved looking up and cross referencing a lot of material and then generating something useful in one shot. Which it did when Box began testing 5.2 with their reasoning tests, CEO Aaron Levy writes, we asked the model to perform a series of enterprise tasks that approximate real world knowledge work that we see in industries ranging from financial services to healthcare and life sciences. These tasks require a high degree of analytical capabilities, math, reasoning and more. Aaron noted that with this expanded task set with broader and harder tasks than before, 5.2 scored seven points better than 5.1 and performed the majority of the tasks far faster than previous models. The coding first impressions are likewise pretty good. Ella Marina's Peter Gostev writes, I've spent a lot of time testing this model on the arena and it's an excellent bump from the 5 and 5.1 versions for coding and a big challenger to Gemini 3 Pro and Opus 4.5. Pietro Chirano writes, 5.2 is a serious leap forward in complex reasoning, math, coding and simulations. It built a full 3D graphics engine in a single file, interactive controls 4K export one shot. The pace of progress is unreal. He also argued that it's quote, the best agentic model OpenAI has shipped, runs tons of tools in a row without issues and is faster than its predecessor. 5. 2 calls tools with no preamble and doesn't get lost in long sessions. Flavio Adamo wrote a short post called what actually Changed? And found that the model was noticeably better at creating presentations, generating spreadsheets, producing cleaner tables. He also found a significant improvement in visual design and front end. Overall, he writes, five. Two isn't a revolution, but the upgrades are hard to miss. It's more accurate, more consistent and a lot more dependable in tasks that actually matter. Now, not everyone was universally positive. In fact, there were a number of early testers who did point out some of the challenges of 5.2 Dan Shipper from every said it's not as good a writer as Opus on our internal benchmarks, and that it's in his estimation mostly an incremental upgrade, saying that he hasn't found himself explicitly switching to it for day to day tasks. That idea of this being an incremental upgrade is Every's big banner headline, they said. While it excels that instruction following and extended tasks, don't expect it to surprise you. Now one thing that's notable about Every's tests is that they have a more sophisticated test that they built for writing quality than many others that uses about 50 requests and scores them on things like reader engagement and AI ism avoidance. Meaning, in other words, that although they're calling this a vibe check, they're actually one of the best, if not the best source of early feedback when it comes to the quality of writing out of a new model. 5.2 certainly wasn't bad, matching Sonnet 4.5 at 74% on their tests, but it was below Opus 4.5's 80%. One bright spot they pointed to was that it was less prone to tired AI constructions like it's not X, it's Y. So summing up Every's critique isn't so much a critique, it's just a cap on how hyped to get. Again, calling this an incremental upgrade, others pointed out things that did well versus not so well. Simon Smith verified that 5.2 is a lot better for professional deliverables, saying that the biggest leap is in structured business outputs like multi sheet Excel workbooks with proper formatting and PowerPoint decks with better structure and concise bullets. He said, this is the first time ChatGPT has made spreadsheets and presentations I'd consider remotely client ready. He also argued that 5.2has better concision of thinking. He argues that 5.1 sometimes rambles, producing a sprawl, whereas 5.2is more deliberate and better calibrated to the task complexity. However, he argues that this isn't universally a good thing. He compares 5.1 thinking to a brilliant, slightly chaotic freelancer and 5.2 thinking to a polished professional. He agrees with every that 5.2 is less likely to surprise you, whereas the upside of five. One's slightly chaotic nature is that while, in his words, you never let it talk to a client, sometimes it surprises you with an outstanding idea or turn of phrase. Ultimately, Simon comes down on the side of this being a big upgrade. Ali Miller had similar findings in her test the Thinking and Problem Solving felt noticeably stronger. She said that it gave her deeper explanations than she's used to seeing. In fact, she writes, at one point, it literally wrote code to improve its own OCR in the middle of a task. She also found that idea exploration feels a little bit richer even than what she's seen from Opus 4.5. However, like Simon, she found the tone to be different and for her, a downside. She said the default voice felt a little bit more rigid and the length and markdown behavior is extreme. A simple question turned into 58 bullets and numbered points. Ultimately, she argues that this version is optimized for deeper problem solving, structured analysis, and power users who who want to sift through all of those options. 5.2, she says, feels like a step towards AI as serious analyst and less AI as friendly companion. Hello friends. If you've been enjoying what we've been discussing on the show, you'll want to check out another podcast that I've had the privilege to host, which is called you can with AI from kpmg. Season one was designed to be a set of real stories from real leaders making AI work in their organizations and and now season two is coming and we're back with even bigger conversations. This show is entirely focused on what it's like to actually drive AI change inside your enterprise and as case studies, expert panels and a lot more practical goodness that I hope will be extremely valuable for you as the listener. Search you can with AI on Apple, Spotify or YouTube and subscribe today. This episode is brought to you by Blitzy, the enterprise autonomous software development platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzi platform, bringing in their development requirements. The Blitzi platform provides a plan, then generates and pre compiles code for each task. Blitzi delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the Sprint Public Companies are achieving a 5x engineering velocity increase when incorporating Blitzi as their pre IDE development tool, pairing it with their coding pilot of choice. To bring an AI native SDLC into their org, visit blitzi.com and press get a demo to learn how Blitzy transforms your SDLC from AI assisted to AI native. Meet Rovo, your AI powered teammate. Rovo unleashes the potential of your team with AI powered search, chat and agents or build your own agent with studio. Rovo is powered by your organization's knowledge and lives on Atlassian's trusted and secure platform, so it's always working in the context of your work. Connect Robo to your favorite SaaS app so no knowledge gets left behind. Robo runs on the Teamwork Graph, Atlassian's intelligence layer that unifies data across all of your apps and delivers personalized AI insights from day one. Rovo is already built into Jira Confluence and Jira Service Management Standard, Premium and enterprise subscriptions. Know the feeling when AI turns from tool to teammate? If you Rovo, you know Discover Rovo, your new AI teammate powered by Atlassian get started at ROV as in victory o.comai changes fast. You need a partner built for the long game. Robots and Pencils work side by side with organizations to turn AI ambition into real human impact. As an AWS Certified Partner, they modernize infrastructure, design cloud native systems and apply AI to create business value. And their partnerships don't end at launch. As AI changes robots and Pencils stays by your side so you keep pace. The difference is close partnership that builds value and compounds over time. Plus, with delivery centers across the us, Canada, Europe and Latin America, clients get local expertise and global scale. For AI that delivers progress, not promises, Visit robots and pencils.com AIDAILY Brief. Now One person who agrees with all of what has been said before, but found that 5.2 Pro is so uniquely better at what it does that it has become indispensable for him is Matt Schumer. Now, interestingly, Matt says that he's had access to these models since November 25, which is a lot longer than most of these folks who are sometimes going on days or even just a couple of hours of early access. His Overall review of 5. 2 was summed up as incredibly impressive, but too slow. He said 5. 2 thinking is a meaningful step forward in instruction, following and willingness to attempt hard tasks. Code generation is a lot better than five. 1. Vision and long context are much improved, but speed is a big downside and speed can be a big deal. He expands the thought here's something that affects my daily usage standard 5. 2 thinking is slow. In my experience, it's been very, very slow for most questions, even straightforward ones I almost never use. Instant thinking is much better and Pro is insanely better, but it means I'm usually paying a speed penalty. In practice, this means I barely use GPT5.2 thinking. My actual workflow has become quick questions go to cloud opus 4. 5 and when I need deep reasoning, I go straight to 5.2Pro. The standard thinking model sits in an awkward middle ground, slower than Opus but without the full reasoning benefits of Pro. However, Matt, more than anyone else that I've seen so far, really extols Pro as something fundamentally different. He writes, more than raw intelligence, what sets Pro apart is its willingness to think. It will spend far longer than previous Pro models working through a problem for research tasks. It will research for an absurdly long time if that's what the task requires. Now one example he gave to capture what Pro does uniquely among models, he said, I asked it for meal planning help, emphasizing that I have no time to cook. I wanted a seven day plan with three meals and two snacks per day. Pro came back with amazing recipe plans, but what stood out was the ingredients list, much simpler than what the other models suggested. It understood that I have no time wasn't just a constraint on cooking time, it was a constraint on shopping, complexity, prep work, and mental overhead. It grasped my mentality, not just my literal request. I had sent the same prompt to all of the other Frontier models and none of them accounted for this. This is the kind of understanding that makes Pro feel different. Indeed, so enthused was he that he wrote another full review, called his 5.2 Pro Deep Dive, where he said, this is undoubtedly the world's best model. I can't live without it now. Again, he warns of the cost of speed, which isn't just that you have to wait around for an answer. But as he points out, every so often it will think for a long time and still make a big mistake, wasting a lot of time. That means, in his estimation, that prompting matters more than ever. Be explicit, add constraints, and refine prompts before you send them. Still, ultimately, he writes, after using Pro for two weeks, I can't live without it. It's my go to for everything I do that requires deep thinking, research or coding, or almost any prompt I run that doesn't require an instant answer. I think Allie Miller actually has a pretty good rundown of what this amounts to for different user profiles. For general users, she writes, I think they'll be incrementally more pleased. She writes, the idea space that five. Two explores is better than five. One, so they might like problem solving a little bit more for devs. She wasn't sure. She said that while the model seemed to fare well on One Shot, asks if she suspected that the max code models within Codex are still better and that Claude and Gemini are either right up there or even ahead still when it came to business users, although she said that she didn't feel all that big of a leap. Everything else around the benchmarks suggests a huge jump. Researchers, however, she suggested, were going to be the most pleased group overall, which comports with Matt Schumer's argument that this is a slow genius. Now going back to that question of coding and direct head to head comparison, we do of course have some ways to see but what people prefer in direct head to head ways, LM arena shared that 5.2 was at number 6 in Web Dev and that 5.2 high had jumped all the way up to number 2, ahead of Opus 4.5 in Gemini 3 Pro but behind Opus 4.5. Thinking on front end in the design arena, 5.2 high remains behind Gemini 3 Pro and Opus 4.5, but came in at third. So let's talk now about some of the larger implications of this release. Some pointed out that there are implications for what we believe around training. Ben Pouladian writes, GPT5.2 is the clearest signal yet that pre training scaling isn't slowing down. Bigger corpuses, longer contexts, hotter training run. Every jump like this means one thing. Nvidia's curve is nowhere near flattening. We're still early in the compute super cycle. Now. If this becomes conventional wisdom, it could have meaningful impacts on the spectrum of boom to bubble in the same way that GPT3 being released pushed people more towards boom, at least for the moment. TDM on Twitter also noted that in the Our Partner section of the OpenAI announcement They said that 5.2 was built on Nvidia GPUs, including H1 hundreds, H2 hundreds and GB2 hundreds. Another interesting implication I think just has to do with the pace of change. You heard before in the benchmarks that 5.2 scored very high on the ARC AGI exams both 1 and 2. While ArcPrize tweeted a year ago we verified a preview of an unreleased version of OpenAI03 that scored 88% on RKGI. The catch was of course that that version cost $4,500 a task. Today they write, we verified a new 5.2 pro extra high state of the art score of 90.5% at $11.64 a task. For those quickly doing the math or asking ChatGPT to do the math, they point out that that represents a 390x efficiency improvement in one year. Now, in terms of what it means in their battle with Anthropic and Google, it's too early to know for sure how people are going to feel as they really get their hands on it, but it does seem likely to me to stem some of the bleeding, even though it isn't claiming to be universally better in all ways. And a lot of the first responses that we shared have some caveating or at least nuance to where and how and in what ways it's good. It's clearly a really good model that is a big step up from what OpenAI offered before, and is likely going to compete with Gemini 3 Pro and Opus 4.5 on a lot of different use cases. Sam Altman also tweeted just after the release, Also, we have a few little Christmas presents for you next week, which many are speculating right away means the next version of images. Given all the rumors that we've shared and talked about in previous episodes this week about new image models being tested under pseudonymous names, summing up and touching on something that we haven't even had a chance to talk about yet, Rohit writes Code Red to the best model and a Partnership with Disney in one week. Damn. What they're referring to, of course, is a new partnership that was announced this morning where The Walt Disney Company is not only not going after OpenAI in court, but is instead granting them a three year license to use something like 200 Disney characters in Sora Generations. The details are one it's a three year licensing agreement, including one year exclusivity where Sora users will be able to generate videos that use more than 200 different Disney, Marvel, Pixar and Star wars characters, creating an incentive for people to actually go do that. Some number of those Sora videos will then actually stream on Disney. And to top it all off, Disney is also going to become a major customer of OpenAI, both deploying ChatGPT for its employees and using API to build new products. And finally, Disney's going to make a billion dollar equity investment into OpenAI as well. Now I gotta give a big shout out to Andrew Curran, who's one of the best AI news aggregators on Twitter X. But all the way back in August, when Sam Altman tweeted an image of a fated death star, Andrew wrote, sometimes I read too much into things. It's my nature. However, seeing as I think we're getting a Sora TO announcement, I'm predicting the mouse has finally made up its mind. And it wasn't just that one prediction back in November, he also wrote, Disney is becoming an AI company. At this point, it's simply a matter of who they choose as a partner. A deal between OpenAI and Disney has seemed close many times over the last year, but it looks like it's coming down to this week. To me, this decision is a huge signal for who will be leading the race a year from now. It's far bigger than the ip. It's also the fact that as soon as Disney forms this partnership and starts using AI for user created content, we which will begin with video shorts on Disney plus, it will use its immense media power to broadcast that AI is a legitimate creative tool and will actively encourage its use. To me, this is the biggest decision of the year and whoever wins it will have immense main Character Energy in 2026. Adding some validity to that argument, on the same day that they announced this OpenAI deal, Disney sent a cease and desist letter to Google accusing them of copyright infringement on a massive scale. Now there'll be a lot more to get into on that particular deal, but bringing it back to GPT5 I was someone who felt like the 5.1 update was very meaningful for my use cases of iterative brainstorming and business strategy collaboration 5.2in general, but 5.2pro especially felt like a major upgrade. In fact, like Matt Schumer, although in the context of a different model, I found myself much more than I was before, skipping the thinking model and going straight to 5.1 pro, I was finding myself just naturally redesigning my work so so that I could let that process take place while I was doing other things and then come back to it when it was ready. However, it feels to me like in most ways OpenAI sees 5.2 as the big next step post GPT5. It almost feels as though 5.1in personality and capabilities was kind of what they wanted 5 to be, and 5.2 at least at first glance appears to be what they wanted that next intermediate model to be. 5. Two should be rolling out to all paid subscribers over the next day or so, so we will have some fun figuring out what it does. Well for now that is going to do it for the AI Daily Brief. Appreciate you listening or watching as always. And until next time, peace.
