Transcript
A (0:00)
Today on the AI Daily Brief, are we entering the era of vertical AI models? Before that in the headlines, a big leak with Anthropic confirming the existence of Claude Mythos, what they call by far the most powerful AI model we've ever developed. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG Blitzy assembly and Robots and Pencils. To get an ad free version of the show go to patreon.com aidaily Brief and if you are interested in sponsoring the show, send us a note at sponsorsidailybrief. AI Late breaking one Last night, a data leak revealed that Anthropic is testing a new model referred to as Claude Mythos. Anthropic has confirmed the existence of this model with a spokesperson saying that it was a step change. Their words in performance and the most capable we've built to date, they said. The model is currently being trialed by early access customers. So here's what happened. On Thursday evening, a draft blog post describing the model was left in an unsecured, publicly searchable database. The blog post says we've finished training a new AI model, Claude Mythos. It's by far the most powerful AI model we've ever developed. Mythos, they write, is a new name for a new tier of model, larger and more intelligent than our Opus models, which were until now our most powerful. We chose the name to evoke the deep connective tissue that links together knowledge and ideas. Compared to our previous best model, Claude Opus 4.6, mythos gets dramatically higher scores on tests of software coding, academic reasoning and cybersecurity, among others. In preparing to release Claude Mythos, however, they say we want to act with extra caution and understand the risks it poses, even beyond what we learn in our own testing. In particular, we want to understand the model's potential near term risks in the realm of cybersecurity and share the results to help cyber defenders prepare. Mythos is also a large, compute intensive model. It's very expensive for us to serve and will be very expensive for our customers to use. We're working to make the model much more efficient before any general release. For those reasons, we're taking a slower, more gradual approach to releasing Mythos than we have with our other models. We're beginning with a small number of early access customers who will explore the model cybersecurity applications and report back what they find. Now, this blog post is very undercooked. It ends not too long after that. Now if you hear the term Capybara thrown around, apparently the model was also referred to as that. I'm not sure if Capybara was the code name and Mythos is the intended launch name, but regardless, this draft blog post was in a cache of unsecured documents. In total, Fortune reports there appear to be close to 3,000 assets linked to Anthropic's blog that had not previously been published. Now there is a lot of chatter about this one, not least of which is the choice of name which many people associate with the Cthulhu mythos, which, given how much the AI safety folks use those sort of literary reference points to describe their concerns about AI, may not be the most advised name. People also compared it to the recently revealed Spud from OpenAI with Jason Botterell writing, I like how anthropic's mysterious, spooky new model is codenamed mythos, while OpenAI named theirs after a frickin potato. Though the broader sentiment was captured by Gavin Purcell who says it will only go faster from here. Obviously there will be a lot to watch with this one. Unfortunately for those of us who want to get our hands on the most powerful models at any given time, it kind of looks like the blog post was not even an announcement of the release of the model, just an advanced warning about it. So who knows how long it'll take before we actually see it in practice. Now one model that is available now Google has dropped a small voice model that could have big implications. The model is Gemini 3.1 flash live, which brings real time dialogue to voice models. Up until now most voice models have been turn based, causing awkward stumbles and terrible interruption handling. Flash Live is designed to work more like a human conversation with a continuous back and forth rather than a jarring stilted experience. The model apparently shows a step change improvement on multiple audio benchmarks, including one designed to measure multi step function calling. That's the feature that converts voice commands into complex agentic actions. Some customers like Home Depot have already deployed the model and Google noted a big improvement in handling complex details like alphanumeric product codes and noisy environments. Still, the obvious implication is the quality of personal voice agents on mobile devices, and especially given that Apple is looking to Gemini to power the new version of Siri. The long winter of our discontent of Siri not understanding a single damn word we say may finally be coming to an end. One small product announcement from Shopify that I actually think could be fairly significant. One of my weirder or more out there predictions for 2026 was that I thought that Shopify has kind of an outsized role to play in the positive normalization of AI. The reason for that is that Shopify is where a ton of small business entrepreneurship lives. Shopify's tools have already, even in the pre AI era, given people who felt overwhelmed by what they needed to do to start a business enough help to get over the hump. Although, as you well know, I am not a jobs doomer, I do think that we're going to see a lot of shifts in the average way that people get employed and make money. One piece of that, I believe will be an increase in small business entrepreneurship. If Shopify is the home of where a lot of that new energy goes, the way that they use AI to provide value for their people could make a big difference in people's perceptions of it. It's one thing when the only thing you hear about AI is that it's going to take your job and it uses all the water. It's another thing when you see Your income rise 30% from the month before because of the tools you were able to use through your store's hosting platform. So what Tinker is is a free mobile app with more than 100 AI tools for e commerce. Merchants can generate logos, product photos, advertising videos, and much more. It's an iterative, experimental, playful canvas where you can try out all sorts of different brand identities, product placements, and more. The entire concept is about flattening the learning curve. Apps are arranged by outcome, so merchants only need to select what they want to create. Once inside an app, they can see a range of examples demonstrating what it can do and how to use it. They can then describe a desired outcome in natural language, drop in a reference image, and Tynker automatically turns those inputs into high quality prompts on the back end. Shopify's Director of Product, Rousseau Kazi, said, if you want more artists, lower the cost of paint. And cost isn't just money, it's the time spent keeping up the friction of signing up for everything separately and the learning curve of figuring it all out. We wanted to lower all of it, like I said, may seem small, but I really do believe that Shopify potentially has an outsized role to play in the positive integration of AI into the broader economy. And I think Tinker, from my first glances, looks awesome by the way. Hopefully this goes without saying, but this is a completely unsponsored opinion. Over in OpenAI Land, Codex gets a big upgrade with the integration of plugins. The OpenAI Devs account writes. With plugins, Codex can now support more real work, including the planning, research and coordination that happens before you write code and the workflows that follow. The team at OpenAI also used the occasion of the plugin's launch to go for Anthropic's throat around some controversy of recent changes from Claude. Tariq from the Claude Code team writes, to manage growing demand for Claude, we're adjusting our five hour session limits for free Pro Max subs during peak hours during weekdays between 5am and 11am Pacific Time. You'll move through your five hour session limits faster than before. People were not happy about that and hopeitai took full advantage. Thibaut from the Codex team writes, hello, we have reset codecs usage limits across all plans to let everyone experiment with the magnificent plugins we just launched. You can just build unlimited things with Codex. Have fun. Speaking of OpenAI, the company has made a decision which I think is extremely the right one putting their erotica plans on hold The Financial Times reports that OpenAI has decided to shelve plans for Adult Mode indefinitely as they consolidate resources around coding and enterprise sales. This is, to put it mildly, not all that surprising. Earlier this month, the Wall Street Journal reported that OpenAI's independent advisory council was unanimously against the feature. Reportedly their age detection System had a 12% failure rate and the experts on the council weren't even satisfied Adult Mode would be safe for adults, warning it could encourage an unhealthy emotional dependence on ChatGPT. The feature was also controversial among staff, with some departing the company over the issue. Speaking with the Financial Times, sources said that OpenAI wanted to have more long term research on the effects of sexually explicit chatbots and emotional attachment to AI before they released the product. Now my feeling about this, as I said last fall, is that on the one hand I have a very socially libertarian bent that basically thinks that adults should be able to do whatever they want as long as it's not hurting other people. That said, viewing this question from an entrepreneur's lens, it did not make sense to me for OpenAI to be the one to offer this. There is going to be, I promise you, no shortage of adult AI experiences that are available to any adults who want them. And I just think that all of the costs of going down this route were so obviously going to be higher than the upside for OpenAI. So one other thing that I did want to note about OpenAI's recent moves. There is a lot of chatter right now about how many products are being killed by OpenAI instant checkout Sora, the erotic chatbot with people seeming to suggest that it's the company flailing. I think in many ways it's the opposite. It would be the worst business decision that OpenAI could make to stick with something that wasn't the right move, even if it looked like the right move just a couple of months ago. Nothing will kill a business faster than sunk cost fallacy and OpenAI being willing to scrap efforts even where a lot of effort went in is net net a good thing for that company. And it couldn't come at a better time because boy oh boy is the competition going to do nothing but heat up. Latest rumors suggest that Anthropic is discussing going public as soon as the fourth quarter, with follow up Bloomberg reporting saying that they might be looking to IPO as soon as October. That of course puts OpenAI on the clock. As Sam Altman has reportedly said he would prefer to go first. Meaning all in all, I think my prediction that we actually don't get IPOs this year might be one that is wrong. Noel Moldvay writes, according to the zodiac, 2026 is the year of the mega IPO. Indeed. For now though, that is going to do it for the headlines. Next up, the main episode. Alright folks, quick pause. Here's the uncomfortable if your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client zero. They embedded AI and agents across the enterprise. How work gets done, how teams collaborate, how decisions move not as a tech initiative, but as a total operating model shift. And here's the real unlock that shift raised the ceiling on what people could do. Humans stayed firmly at the center while AI reduced friction, surfaced insight and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us AI. That's www.kpmg.usa AI. Blitzi is driving over 5x engineering velocity for large scale enterprises. A publicly traded insurance provider leveraged Blitzi to build a bespoke payments processing application, an estimated 13 month project. And with Blitzi, the application was completed and live in production in six weeks. A publicly traded vertical SaaS provider used Blitzi to extract services from a 500,000 line monolith without disrupting production 21 times faster than their pre Blitzy estimates. These aren't experiments. This is how the world's most innovative enterprises are shipping software in 2026. You can hear directly about Blitzi from other Fortune 500 ctos on the Modern CTO or CIO classified podcasts. To learn more about how Blitzi can impact your SDLC, book a meeting with an AI solutions consultant@blitzi.com that's blitzy.com you've heard me talk about Assembly AI and their insanely accurate voice AI models, but they just ship something big. Universal 3 Pro is a first of its kind class of speech language model that lets you prompt speech recognition with your own domain context and vocabulary instead of fixing transcripts and post processing. It's more flexible than traditional ASR and more deterministic than LLMs, so you get accurate output at the source and can capture the emotion behind human speech that transcripts often miss, all without custom models or post processing hacks. And to celebrate the launch, they're making it free to try for all of February. If you're building anything with voice, this one's worth a look. Head to AssemblyAI.com freeoffer to check it out. Most companies don't struggle with ideas, they struggle with turning them into real AI systems that deliver value. Robots and Pencils is a company built to close that gap. They design and deliver intelligent cloud native systems powered by generative and agentic AI with focus, speed and clear outcomes. Robots and Pencils works in small, high impact pods. Engineers, strategists, designers and applied AI specialists working together to move from idea to production without unnecessary friction. Powered by RoboWorks, their agentic acceleration platform teams deliver meaningful results including initial launches in as little as 45 days to depending on scope. If your organization is ready to move faster, reduce complexity and turn AI ambition into real results, Robots and Pencils is built for that moment. Start the conversation@rootsandpencils.com aidaily brief that's robotsandpencils.com aidDaily Brief Robots and Pencils Impact at Velocity. Welcome back to the AI Daily Brief. I noticed this really interesting story yesterday where Intercom announced that their new dedicated customer service focused model Fin had achieved something very significant. CEO Ewan McCabe called it objectively the highest performing, fastest and cheapest model for customer service, beating the very best models in the industry including GPT 5.4 and Opus 4.5. Now it has been a persistent question in AI about how much custom models would matter. You might remember way back in the immediate post ChatGPT fever, a number of companies figured well, since we have such unique proprietary data, training our own model on that data surely will outperform Maybe the best known of those efforts was Bloomberg GPT, which they called a 50 billion parameter large language model purpose built from scratch for finance. Now it turned out that in practice, that model got absolutely smoked by the general models, reminding everyone once again of the bitter lesson. The bitter lesson is a very famous essay from computer scientist Rich Sutton from back in 2019. He writes the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective. And by a large margin. He gave as one of his first examples computer chess. He says in computer chess, the methods that defeated the world champion Kasparov in 1997 were based on massive deep search. At the time, this was looked upon with dismay by the majority of computer chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler search based approach with special hardware and software proved vastly more effective, these human knowledge based chess researchers were not good losers. They said that brute force search may have won this time, but it was not a general strategy and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not. So basically what this essay is arguing is that throughout AI history, and as a reminder, because this often surprises people, AI as a field, at least as a named field, is older than computer science. If you go back to the 50s and look at the laboratories at places like MIT, there were already back then artificial intelligence labs. But the idea of computer science as a field wouldn't come until a little bit later. In any case, what the bitter lesson is arguing is that throughout AI history, researchers have tried two basic approaches. The first is encoding human knowledge and clever tricks into systems, essentially trying to teach computers how humans think. The second is giving computers massive amounts of data and compute and letting them figure things out on their own through search and learning. The bitter lesson is that the second approach wins every single time. It's bitter because it's a blow to human ego. Researchers spend years crafting elegant domain specific solutions, encoding chess strategy or linguistic rules or visual perception models. And then a brute force method powered by more compute just steamrolls all that careful work. Now we have that example from chess, but that example repeated across go speech recognition, computer vision and now language. The systems that scale with more computation always eventually beat the systems built on human design shortcuts. And so taking the bitter lesson and applying it to LLMs kind of explains why Bloomberg's highly specialized model ultimately got beat by much bigger and more computationally intensive models. And yet coming into 2026, there was an interesting question specifically of whether a specific type of data might actually change this equation. That data that people were interested in is last mile usage data, basically user interaction data. At the very edge of the experience and the specific place where many were watching this was around AI coding. The question was whether a company like Cursor could ultimately have some advantage in their own proprietary model because they had such a tremendous amount of experiential data around the actual interaction point. Now, it wasn't really so much a question of whether that data is valuable. Obviously it is, but there's a difference in it being valuable for product design versus model design. Inevitably that data is extremely useful in figuring out the right products or the right harnesses for models. That was never in question. What was a question is whether all that information could actually change the destiny of customized vertical models. Latent Space wrote about this last year in November in their piece titled the Agent Labs Thesis. The point that SWIX and Latent Space were making was that if it is the case that we are close to hitting the limits of pre training data, that perhaps shifts the future of model performance to post training. The Agent Lab's thesis asked can post training make up the gap between the best open models and the best frontier models, and how long until they start exceeding? In other words, the tweak here is that a company like Cursor isn't training a model from scratch. They are taking the best available open weights models that are out there, which are admittedly a little bit behind the state of the art and adding in this post training process with the idea of actually performing better in a specific domain than the general state of the art model Can. Now, Cursor placed a pretty high importance on this. The company had said explicitly that they needed to train state of the art coding models to keep up with competitors, which some reports suggested was a financial imperative with Cursor burning too much money reselling API access to OpenAI and anthropic. Now earlier this month we got the release of their Composer 2 model. The model was in the same ballpark as GPT 5.4 and actually beat Opus 4.6 on coding benchmarks while being much cheaper to run. Meaning of course that it fit Cursor's needs extremely well. However, an X user called Flynn triggered a controversy, revealing that Composer 2 was just and boy is this just doing a lot of heavy lifting. Kimik 2.5 with some extra reinforcement learning applied Cursor themselves did not deny this dev relations Rep. Lee Robinson commented, Yup, Composer 2 started from an open source base. We will do full pre training in the future. Only a quarter of the compute spent on the final model came from the base. The rest is from our training. This is why evals are very different now. Some amount of the controversy was about cursor in the eyes of some failing to disclose their use of an open source base model, but others seemed genuinely dismissive of the practice. As Flynn had done. They wrote off the model as just Kimik 2.5 without a second thought. Others thought though that maybe something important was going on here. Leetllm writes, as someone who basically lives in Opus 4.6, seeing an open weight QEMI 2.5 fine tune actually beat it on coding benchmarks is wild. If Composer 2 could really perform that well. Cursors seem to have demonstrated that reinforcement learning on a quality dataset can actually go quite a long way vaulting an adequate base model into the top tier this of course in some ways seems to run counter to the bitter lesson, but if it's correct, would suggest that there's a lot of fertile ground for training models around particular verticals. Which gets us to the announcement yesterday from Intercom. Intercom's Chief Product Officer Paul Adams tweets, We have a very significant announcement here that will change how we think about the AI landscape. We have built a brand new model for Fin called Apex, which has a higher resolution rate, fewer hallucinations and is far cheaper than any other model provided by any other company in the world. And it isn't close. This is an incredibly hard thing to achieve and is only possible with the domain specific proprietary evals from our billions of human and agent customer service interaction data points. We also have a flywheel here where we will continue to get better at the edges. This is you might recognize exactly what we were talking about in my 2026 predictions when we talked about the lab loop and the importance of this last mile usage data. Paul continues, so what does this mean? It means that vertical models can and will outperform general models. It means that many successful companies in the future will need to be full stack, app layer, AI layer and model layer. And critically, as it becomes much easier to copy and clone at the app layer, durable differentiation will move down the stack and ultimately to the model layer. Now this got a ton of chatter. Bnafog the story isn't that Apex beat frontier models, it's that domain specific post training closed the gap this fast. Any vertical SASS with enough labeled interaction data is sitting on an untapped fine tuning asset. The infrastructure moat is eroding faster than most realize Abhijit, who's on the board of Intercom but does new products at OpenAI Model quality depends a lot on judgment, and that judgment lives in proprietary evals, real world usage and fast feedback loops. Being close to the work, this creates all kinds of opportunities for companies that are willing to think big and bet on themselves. Now, while he doesn't seem worried for his main employer OpenAI, the implications for them is certainly where many people's heads went. Theoblaucher writes, Very cool feat from Intercom, though reading this makes me wonder what value the Frontier Lab companies actually deliver long term if every industry cursor for coding now fin4cs can build better and cheaper specialized models from open source bases. And interestingly, this wasn't the only story around these themes. Decagon Co founder Ashwin Srinivas writes, Over 80% of model traffic at Decagon now runs on models we've trained in house, structured as a network of specialized models handling different parts of the interaction. Now this is a little bit different because there is actually an architectural change here. In their announcement post they write, instead of relying on a single model, we built a network of specialized models, each responsible for a specific part of the interaction detection, orchestration, response generation and evaluation. That separation lets us optimize each layer independently and drive better speed and quality across the system. Regardless though, the point is that here you have another company that is shifting off reliance on the major closed foundation models and towards models that they've trained at least in part themselves. Shekhar says, I think this is a trend we'll see going forward. The reliance on general purpose frontier models will hit a wall for domain specific tasks. Custom post training pipelines will be the way forward. Clem Delang from Hugging Face agrees. Writing after Pinterest Airbnb Notion Cursor Today it's Yuen and Intercom publicly sharing that they're finding it better, cheaper, faster to use and train open models themselves rather than use APIs for many tasks. And hundreds of other companies are doing the same without sharing. Ultimately, I believe the majority of AI workflows will be in house based on open source versus API. It took much more time than we anticipated, but it's happening now. Now obviously if this is the case, there are significant business model implications. Adriana Sabatta writes, the API tax is starting to look like the cloud markup of 10 years ago. Once teams realize they can run fine tuned open models for a fraction of the cost the switch becomes obvious. Ewan from Intercom agrees that this is the beginning of something bigger. Writing a companion post called the Age of Vertical Models is here, he reinforces that the model just is better across numerous dimensions. It has a 2.8% higher resolution rate. But he writes importantly, it's also dramatically faster, has fewer hallucinations, in fact a 65% reduction in hallucinations, and is far cheaper than all other available models. In his post, Ewan referenced a recent interview with Andrej Karpathy where Karpathy said, I do think we should expect more speciation in the intelligences. The animal kingdom is extremely diverse in the brains that exist and there's lots of different niches of nature and I think we should be able to see more speciation. And you don't need this oracle that knows everything. You kind of speciate it and then you put it on a specific task and we should be seeing some of that because you should be able to have much smaller models that still have the cognitive core. From there, Ewan picks up the frontier. Labs still have the very best models, but the open weight models are not that far behind. So it's not hard to see pre training as a commodity of sorts. Where we think the frontier will move next is to post training. Karpathy's prediction is exactly what we're seeing with Apex and Cursor's Composer 2 and what we're going to see significantly going forward. As such, the labs are in an interesting position where on one hand the horizontal general purpose models are actually over serving the market for specific use cases. For example, their models are more generally intelligent than is needed for customer service. And on the other hand the open weight models are more than good enough where high quality domain specific post training can make the resulting models superior at the special purpose jobs and in the way that matters to that particular job. Personally I'm still very bullish on the labs. We remain very heavy customers of anthropic, yet classic disruption is now at their door. The only way out is to disrupt themselves by building cheaper specialized models too. And the only way to do that is to acquire the evals or the companies with the evals needed for that specific task. Which means there will be some interesting data partnerships or M and A consolidation and you're going to see some hyper specific model providers who go it alone and compete with the labs head to head. Likely all of the above. Now going back to the bitter lesson, it kind of feels at first glance like this would run counter to that, right? That in the Long run, the sheer additional volume of computational data should beat out the specialized knowledge and data of the edge providers. Except the bitter lesson isn't just about the amount of data, it's about brute force data and compute as opposed to human knowledge. But we're not exactly talking about human knowledge here. Instead, we're talking about experience. The data that a cursor has or an intercom has is not the data of some human expert. Instead, it's millions of interactions which show how things actually happen in the real world. It turns out that Richard Sutton himself actually discussed this very thing as an example of the next phase of the bitter lesson on the Dwarkesh podcast last year.
