Transcript
A (0:00)
Today on the AI Daily Brief, what 1200 professionals tell us about working with AI. And before that, in the headlines, Gemini 3 Deepthink is now available. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
A (0:18)
All right friends, quick announcements before we dive in. First of all, thank you to today's sponsors, super intelligent Rovo robots and pencils and blitzy. To get an ad free version of the show, go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. And of course, if you are interested in sponsoring the show, locking in those 2025 rates before they expire, send us a Note @ SponsorsIDailyBrief AI now, last note before we dive in, we're doing a bit of a switcheroo today. The headline section is actually a little bit longer than the main episode. There was just enough news that we kind of had to do it that way. So without any further ado, let's dive in. Welcome back to the AI Daily Brief Headlines edition. All the daily AI news you need in around five minutes. Although today is a very jam packed episode, so I expect it to be a little longer than normal. We kick off today with an exciting one for you model testers out there. Google has released Gemini 3 Deep Think Mode, which is their most powerful version of the new Gemini 3 suite. Now right now the new mode is exclusively available to subscribers of Google's AI Ultra Plan, which is their couple hundred dollar a month type of product. Now as you might imagine then, with the price tag that high, DeepThink is designed to tackle the most complex math, science and logic problems available. The mode builds on top of Gemini 2.5 deepthink, and as much as I tend not to care about benchmarks, does claim some impressive performances. They claim a state of the art 41% result on humanity's last exam without the use of tools, outperforming GPT5Pro at 30.7%. And DeepThink also achieved a 45% result on the Arc AGI2 test, more than doubling the performance of GPT5Pro to become the new state of the art. Now, it should be noted whenever we talk about ARC AGI that there are two vectors. There is score and there is cost per task. And while Gemini 3 Deepthink absolutely shattered the previous high score, it did so at a pretty elevated cost of $77 a task. Now this might go some way to explaining why they're paywalling deepthink mode behind the most expensive subscription. It's important to note that this is the first time normal users have ever had access to a model this expensive to run. OpenAI never released the preview version of O3 that cost $167 per task to achieve its state of the art performance at the end of last year. DeepThink achieves its state of the art performance by exploring multiple hypotheses at once before delivering a solution, a technique that has been used in research to boost performance, but generally hasn't been available to regular users as a standard feature due to the high inference costs. Now, one thing that's not exactly clear yet is what the use case is actually expected to be. Here they announced it by showing it generating a domino's game with complex physics in one shot. Another Googler showed it producing a complex physics simulation of a rubber vase falling on a hard surface, which I think helps clear up one thing because it is called deepthink and we already have Deep research. There may be some mental overlap between the two, but they are fundamentally different things. Deepthink is not just a souped up version of deep research. Instead it is capable of scientific reasoning. Now, in terms of first reactions, a lot of people had the same experience of Hyper Browser founder Sri Shrikhani, who wrote with all their TPUs and GPUs, how TF is Gemini 3 DeepThink overloaded and unusable? This is also the response I got for the first couple of hours after the announcement, although then it cleared up. Victor Talon writes, for those wondering and as expected, Gemini 3DeepThink solves the stack overflow bug that cost me a few days. The answer is more decisive than Opus 4.5, the only other public model to solve it. Even Gemini 3 Pro fails. It even points the exact location confidently takes forever. Though I don't have harder tests for now, most of my benchmarks are saturated now. I've only had access for less than a day so far as well, knowing that it sort of probably wasn't the use case, I still gave it a recent business strategy question that I had been both genuinely exploring, but also trying to test GPT5.1 thinking versus GPT 5.1 Pro versus Gemini 3 Pro. And I will say at this stage I don't particularly think that the extra reps of DeepThink are worth it for that type of business strategy question. Basically, I don't think that it particularly added anything more. In fact, I didn't even prefer its response relative to the others. So whereas I have recently been finding myself being being willing to take the time for five one Pro on business strategy questions, I think Deepthink might be a little too far and just not right sized for that particular purpose. In any case, I will continue to experiment with it making full use of that Ultra account. Next up, we stay in the Google universe where they have partnered with Replit to bring Vibe coding to the enterprise. The multi year partnership will see Replit expand their use of Google Cloud services, meaning a deeper integration of Google's AI models as well as using Google Cloud infrastructure on the backend to enable fully functional Vibe coded software. Apps coded in Replit will also be able to leverage Google Cloud Marketplace in their go to market strategy. Replit CEO Amjad Massad said, the goal for us and for Google is to make enterprise Vibe coding a thing. We want to show the world that these tools are actually going to transform businesses and how people work. Instead of people working in silos, designers only doing design, project managers only writing. Now anyone in the company can be entrepreneurial. Richard Serrata, the senior director for Google Cloud, added, it may feel like it, but replit is no overnight success. Amjad and team built something over time that became the exact right thing for this current moment with builders. In separate comments to CNBC on the state of the AI bubble, Amjad acknowledged that the honeymoon phase for Vibe coding is over. He said early on in the year there was the Vibe coding hype market where everyone's heard about Vibe coding, everyone wanted to go try it. The tools were not as good as they are today, so I think that burnt a lot of people. So there's a bit of a Vibe coding. I would say hype slowdown and a lot of companies that were making money are not making as much money. Amjad noted that earlier in the year we were getting weekly ARR updates from the vive coding companies and now we're not. That said, new statistics from Ramp suggest that Replit isn't slowing down all that much. The Ramp Economics Lab reported that replit is currently number one for new customer growth across all software vendors. Google is also up there following the release of Gemini 3 and Nano Banana Pro, sitting at number five for new customer growth and number two for new spend growth. Now one of my contrarian somehow takes is that I think we are actually way too bearish on Vibe coding right now. I tend to think that when we say Vibe coding we are having two entirely different conversations at the same time using the same words. There is Vibe coding for non technical people which is entirely different than Vibe coding for software engineers that shine coming off the rose type of phenomenon that Amjad was talking about is I think specific to Vibe coding for software engineers. There is a recalibration happening right now among developers around how best to deploy these tools around the autonomy spectrum. All these sort of questions around how you're going to integrate agent decoding into your processes in a way that doesn't just create new problems. However, for the non technical people, I think we are barely scratching the surface. In particular, I do not think that Vibe coding has significantly made its way into the business world yet. It's mostly still individual hackers and tinkerers who are discovering that they can build and modify their own websites now without having to use WIX or Squarespace or something like that. I genuinely believe that that is going to change and I actually think 26 is going to be a massive increase year for Vive coding, but with a very different market audience. One more Google adjacent story Google's neo cloud partner Fluidstack is in talks to raise $700 million at a $7 billion valuation. FluidStack started the year as a relative unknown, but signed multiple data center development deals to jumpstart their business. Google served as the backstop on a pair of deals pledging to repay debt if Fluidstack defaults. As part of those deals, Fluidstack became one of the first third party vendors to receive Google's TPUs. Now that wasn't massive news back in September when the deals were struck, but now that the market narrative views TPUs as a genuine contender to Nvidia's dominant GPUs, the that is changing. Fluidstack also secured the contract to build a gigawatt capacity data center in France as part of President Emmanuel Macron's push for sovereign AI. They are additionally the infrastructure partner for Anthropic's $50 billion data center investment announced last month. The new funding round will reportedly be led by Situational Awareness, which is of course the hedge fund started by former OpenAI researcher Leopold Aschenbrenner. Moving to our next story, hype around Opus 4.5 continues to build as the model keeps pushing the limits. Saesh Kapoor, who you may know from the AI Is Normal technology blog, announced that his team are ready to declare that Opus has solved the core bench scientific agent benchmark. The benchmark requires agents to reproduce scientific papers. When given the code and data from a paper, the agent is scored on its ability to set up the repo from the paper, run the code, and then correctly answer questions about the result. Functionally, it's a benchmark primarily about agentic code execution. Core Bench uses a common agent scaffold called Core Agent to allow comparison between different models on a level playing field. Opus 4.5 was initially tested using Core Agent and scored 42%, a solid score, but not close to Opus 4.1's leading score of 51%. DeepMind researcher Nicholas Carlini then reached out to the team with a new scaffold that uses Claude code, as well as some issues with the way the benchmark was being scored. Core Bench Team ran the benchmark again using the Claude code harness and found that Opus 4.5's performance almost doubled to 78%. Interestingly, a jump of this size was unique to Opus 4.5. Sonnet 4 and 4.5 saw much smaller improvements and Opus 4.1 actually went backwards, kapoor wrote. We're unsure what led to this difference. One hypothesis is that the Claude 4.5 series of models is much better tuned to work with Claude code. Another could be that the lower level instructions in Core Agent, which worked well for less capable models, stopped being effective and hinder the model's performance for more capable models. The Core Bench team also manually went through their benchmark, weeding out grading errors that Carlini had pointed out. Eight tasks were being incorrectly marked as wrong due to small floating point errors and one task was impossible to reproduce due to a dataset being removed from the Internet. The team manually scored Opus 4.5's performance at 95% with only two tasks failed, Kapoor wrote. With Opus 4.5 scoring 95%, we're treating core Bench hard as solved. The team now plans to pivot to an undisclosed set of test questions for their next benchmark to ensure the questions aren't included in training data now outside the benchmarks, the personal testimonials for 4.5 opus just continue to roll in. Dan Shipper from every who was very bullish to begin with, wrote a new piece going even farther, he said on Twitter. Opus 4.5 blew me away. This week I built a fully featured reading companion app that I now use every day in between meetings without looking at the code. Two things that are important. We just reached a new level of autonomous coding. You've been able to one shot an impressive app demo for a while now with any frontier model. Opus 4.5 is the first model that just keeps coding and coding without running into endless loops of errors. Second prompt native apps are now possible. Opus 4.5 can now act as a general purpose agent inside your app to power many of your features. This turns building features into an exercise in writing prompts instead of writing code. The NYT's Kevin Roos is also finding Opus 4.5 great for non coding purposes. He writes, Claude, Opus 4.5 is a remarkable model for writing, brainstorming and giving feedback on written work. It's also fun to talk to and seems almost anti engagement maxed the other night I was hitting it with stupid questions at 1am and it said Kevin, go to bed now. As for me, I have not yet found myself switching away from GPT5.1 or Gemini 3 to Opus4.5 all that often. But with all of this chatter, it seems clear that I'm going to have to give it an even bigger swing. Couple more stories Like I said, we are on an extended Headlines today Little bit of Market and Adoption News Salesforce has delivered a strong revenue forecast on the back of Agent Force adoption. Salesforce said their Q4 revenue would be between 11.1 billion and 11.2 billion, outstripping analysts forecasts of 10.9 billion. They also said that remaining performance obligations, a measure of future bookings, would increase by about 15% compared to analyst estimates of 10%. CEO Mark Benioff credited their AI focused products, stating, Our Agent Force and Data360 products are the momentum drivers. Active customer accounts for Agent force have grown 70% quarter over quarter, with many customers now transitioning from the pilot phase to active deployment. Benioff said that they now have over 9,500 paying agent force customers. He said, we've delivered incredible results with AgentForce. It's really exceeding our expectations. This is our fastest growing product ever. Now, one interesting sub wrinkle that I'm watching with the Salesforce story. A big question that many have is to what extent models get commoditized in the future. Salesforce, for their part, has primarily built on top of OpenAI models since they launched Agent Force in late 2024. However, last week Benioff posted, I've used ChatGPT every day for 3 years. Just spent 2 hours on Gemini 3. I'm not going back. The leap is insane. Reasoning, speed, images, video, everything is sharper and faster. It feels like the world just changed again. Then on Thursday he posted, LLMs are the new disk drives commodity infrastructure you hot swap for whoever's cheapest and best. The fantasy that the model is a moat just expired. So interesting things to watch to see how Salesforce thinks about model switching and what that means for the rest of the market. An even bigger market story yesterday, if only a little tangentially related to AI is that Meta could be giving up on their namesake technology. With rumors of deep cuts to the Metaverse division, Bloomberg reports that the Metaverse Group could see budget cuts as high as 30% next year. Their sources said cuts of that magnitude would most likely include layoffs as soon as January of next year. They did caveat that no final decisions have been made, but deep cuts to the Metaverse Group are on the agenda for end of year budget planning sessions. Sources said that Zuckerberg has asked for 10% cuts across the board, which has been the standard request for the past few years. However, the Metaverse Group was signaled out for deeper cuts due to the lack of industry wide competition over the technology. Now, for most public market investors, it's hard for them to see the Metaverse as anything but a massive disappointment, especially relative to the pitch in 2021. Zuckerberg presented the Metaverse with such conviction that he changed the name of the company. Since then, their Metaverse Group has been nothing short of a cash incinerator. The group has lost more than 70 billion since the metaverse strategy was announced, and thus, unsurprisingly, markets responded well to the idea that Meta would be slashing that particular category of spend. The Stock jumped by 5.7% in its largest intraday move since July. Now, while the Metaverse Group is being slashed, that doesn't necessarily carry over to the parent division, Reality Labs. That broader division is focused on Meta's various AR and VR products and has been going from strength to strength in recent years. The Meta ray bans have been a surprise hit and now define their product category, which is presumably a product category only becoming more important as LLM capabilities catch up to the promise of AI wearables. A Meta spokesperson suggested this strategy pivot is underway. Commenting within our overall Reality Labs portfolio, we are shifting some of our investment from Metaverse towards AI glasses and wearables. Given the momentum there, we aren't planning any broader changes than that. The reallocation of resources also aligns to Meta's poaching of veteran Apple UX designer Alan Dye earlier this week. On Wednesday, Zuckerberg announced that Dye would lead a new creative studio within Reality Labs that would focus on design, fashion and technology. In a post on thread, Zuckerberg wrote, we're entering a new era where AI glasses and other devices will change how we connect with technology and each other. The potential is enormous, but what matters most is making these experiences feel natural and truly centered around people. With this new studio, we're focused on making every interaction thoughtful, intuitive and built to serve people. So friends, that is the story from this Extended Headlines edition, but for now, we'll wrap it there and move on to the main episode. Foreign.
