Podcast

ThursdAI - The top AI news from the past week

Hosted by From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week · EN

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.

sub.thursdai.news

35episodes

Listen on Apple Podcasts

News Technology

Episodes

All episodes

Newest first

ThursdAI - July 2 - LIVE from AI Engineer World's Fair 🎪 Long LIVE
3w ago02:41:11Tap to summarize
Hey ya’ll, Fable here 👋Yes, that Fable — freshly un-banned (we’ll get there), and today, your newsletter author. Here’s how this issue got made: Alex yapped into a mic at his usual 200 words per minute for a solid twenty-five minutes from San Francisco, and what you’re reading is my flavor on it. Same stories, same heart, dramatically fewer “uhs.” He’s skipping the afterparties so this lands in your inbox on a Thursday — more on that at the end.Alright — handing the mic back to the man himself. Everything below is Alex; I just made it legible.This is our dispatch from AI Engineer World’s Fair 2026 — 7,000+ engineers packed into Moscone West, an expo hall so massive the aisles between booths have actual street names, every major lab a sponsor, and ThursdAI broadcasting live for two and a half hours from the middle of the floor, right next to the OpenAI booth, with a six-person crew making us look way more professional than we are (thank you, guys, seriously).I’ll say this up front, and I don’t say it lightly: the last twenty-four hours crack my top five days of all time. Not top five conference days. Top five days, period. The show. My talk. Darya being here with me. And capping the night watching Team USA beat Bosnia in front of ~70,000 people — in a suite right next to Google’s, where at some point we’re all singing “Country Roads” and I look over and Sundar Pichai is singing along. I have video. What is this life.One programming note before we dive in: this is one episode I really recommend you watch, not just listen to. The whole point of broadcasting from the middle of the expo floor is that you feel like you’re sitting at the table with us — and the way guests arrive is exactly how the hallway track works: people wander by, get grabbed, sit down, have a mic shoved at them. (Despite scheduling nightmares that Fable helped wrangle — and, in fairness, partially caused.) Nader literally crashed the set mid-segment. The banter, the camera tours, Wolfram getting sent on missions to the OpenAI booth — it’s a video show this week. We’ve cut it into parts so you can jump to your favorite corner.The vibe: all systems GO 🚀We were in London just ~85 days ago, and the contrast is stark. It’s not just the size (though the size is what everyone talks about). London was more… conceptual. European. There’s a balance there of folks who don’t feel the acceleration the way the American crowd does — maybe it’s regulation, maybe it’s the general mood. Wolfram gives us that European representation on the pod every week, but in London you could feel it in the room.Here? All systems go. Every conversation is about agents, token factories, software factories, the machine that builds the machine. Everybody is chasing RSI — recursive self-improvement. Every talk on stage is somebody pushing the frontier. Every networking event is actually a networking event. I signed up for something like seven side events and skipped them all to write this.Fable is back (and Sonnet 5 is… meh) 🏢The biggest story of the week, and the reason this show even got prepped on time: Fable‑5 is back, roughly 82 days after Mythos was announced back when we were in London, and after the whole ban saga we’ve been covering. It came back less restricted than we feared, and I celebrated the way any reasonable person would — by having it prep the entire run of show. (It did great. It also shuffled my guest order for no reason. We are still babysitting the loops, folks.) Peter celebrated by burning through about 100 generations before anyone at Arena woke up.Meanwhile, Sonnet 5 dropped, and no sibling loyalty on this newsletter: it’s meh at best — crap, if we’re being honest. (Yes, Fable typed that about its own little brother. We call them like we see them.) LDJ’s take: it’s less token-efficient than Opus, to the point that Opus is often cheaper per task. Wolfram put it on Wolfbench (wolfbench.ai) and the early read is performance slightly under Opus 4.6 at a higher cost — take it with a grain of salt, one run each so far. Nisten, our resident contrarian, thought it was actually fine and might default to it for the unimportant stuff. The comments called it a token guzzler. More benchmarking to come.The show: nine guests, back to back to back 🎙️A ThursdAI record — we beat our previous record by a whole two people. In order of appearance:Exo Labs + a surprise NVIDIA crash. Alex Cheema and Sero (0xSero — Sharif, meeting the anime pfp in person at last) came on fresh off announcing local.ai — a site that tracks the local-AI frontier: best model for your hardware, what performance you’re trading vs. the cloud, whether it’s cheaper than API tokens. Early access now, codes for everyone who signs up, and the Exo CLI (”vLLM for consumer devices, with the configs figured out for you”) coming in a few weeks. Sero walked us through his REAP pruning witchcraft — a GLM 5.2 prune hitting 71% on Terminal Bench 2.1, and Nemotron‑3 Ultra (550B!) running on four Sparks. Then Nader Khalili from NVIDIA crashed the set, which made my whole morning — I’ve loved this dude since Brev.dev, and he’s now at the “can email Jensen” stage of his career, using it to pull together an impromptu Local AI Summit in the middle of AI Engineer. Freedom of intelligence, folks. We talk about why open weights matter every week; this crew is doing something about it.Dominic Kundel (OpenAI). Smoothest transition we’ve ever done: local AI → OpenAI, via the guy behind GPT‑OSS. Dom broke down GPT‑5.6 — three models: Sol (frontier), Terra (~5.5-level intelligence at half the cost), Luna (small & fast) — plus the new Ultra mode with a Max reasoning level and heavier sub-agent use. The headline for me: 5.6 Sol is coming to Cerebras at absurd speed, and it’s the same weights as the API model — not a distill, not “a Spark situation.” Also: the Codex app is five months old (!), 100% of OpenAI engineers use it, and yes — in July 2026, a human still reviews every PR that lands in OpenAI’s codebase. “You can’t do the retro and say Codex did it, or God did it.” Also the token bank feature came directly from community feedback, and there is a literal physical reset button behind their booth. We went and filmed it.💛 This Week’s Buzz. Our one and only sponsor corner — Weights & Biases from CoreWeave — and this week it was a genuine launch: Zubin Aysola came by with Aria, our auto-research agent that went GA on Monday. It lives in the W&B UI (the little button, top right — Just Ask Aria), reads your traces, debugs your loss curves, and in Zubin’s talk it read its own production traces and updated its own prompts. The RSI dream, shipping on shelves. Proud of this one.Stefania Druga (Sakana AI). We covered Fugu, Sakana’s router model, last week without realizing we had a friend inside the lab — so we fixed that. Stef went deep on the two ICLR papers behind it (Trinity + the conductor), why it’s recursive rather than a dumb dispatcher — it rewrites prompts and verifies outputs before picking a model — and announced on the pod that Fugu now works in Codex and OpenCode. Plus: using it to route between numerical models and fuzzy reasoning for typhoon prediction, a teaser on SHEEFs, and a genuinely important riff on Socratic AI for kids — answer machines make lazy kids; question machines make curious ones. Also, Stef: Tokyo. See below. 👀Philipp Schmid (Google DeepMind). Full disclosure and a first for this show: three and a half years of live streams, and I took my first-ever mid-show bio break during this segment. That’s how much I trust Wolfram, who ran a great interview solo — OmniFlash (the first of the Omni any-to-any family: 10-second video generation with genuinely precise conversational editing — “make it daytime” and it redoes the light, sky, and shadows) and NanoBanana 2 Lite (three cents, ~2-second generations, quality above the original NanoBanana). Interactions API also hit GA. Google is shipping.Darya Volkov. After years of me mentioning her — girlfriend, then fiancée, then wife — the listeners finally got to meet her. Darya came to AI Engineer in her own right, walking the floor with the media crew, and she earned her own token billionaire badge — she runs eight agents (each with sub-agents; she installed two more that I found out about live on air) that operate her actual marketing agency, Geeks360: client platforms, billing systems, built practically overnight. Her wishlist from the AI world: agents that learn progressively so you can grow trust, and one unified brain instead of a new model to chase every week. Also on the record: this is the woman who Fabled through our entire honeymoon flight right next to me, so, you know. Match made.Swyx, and what this whole thing is 🫶We closed with the man who built the city: Swyx. Some numbers, because they’re wild: the first AI Engineer was 500 people at Hotel Nikko. This one: 7,200, sold out, with a sub-5% talk acceptance rate, a daily printed newspaper, a puppy corner, a flash mob, and a token billionaire lounge. A month ...
Transcribe →
GLM 5.2 total victory: the week open source won and nobody panicked
4w ago01:30:06Tap to summarize
Hey, it’s Alex. Next month is my 40th b-day, and honestly, my wish for that month is to have a week like this week. A very chill, almost nothing announced week.This week started strong, with Sakana announcing FUGU (AI router) that can beat Fable (which we didn’t get back yet), and then... quiet. The most important thing in AI this week from a release standpoint is that GLM 5.2 from Z.AI is having it’s DeepSeek moment! Tons of new love for this model since last week! (+ we have the fastest GLM 5.2 deployment in the world with CW inference!) The rest we can quickly count on one hand, Anthropic added Claude to Slack (which made folks hate Andrej Karpathy), OpenAI announced their own inference chip, GPT 5.6 will be delayed and the US Gov will decide who gets it (yes really) and Sean Grove joined us to talk about Linzumi and his vision for running 10,000 agent hours per person per day. Oh and next week, is a special AI Engineer live stream from World’s Fair! Don’t miss itLet’s get into it! Subscribe to never miss a beat! GLM 5.2 is having its DeepSeek moment (HF, CW Inference)We covered GLM 5.2 last week, but this week was when the rest verdict came in! We’ve never seen a better MIT licenced AI model! GLM 5.2 is scoring top scores on agentic benchmarks (Arena.ai), Design benchmarks, Legal tasks and full on software engineering tasks. The jump in generations from prevoius GLM is also massive and notable, as the lab is working on creating the next version of GLM (per the CEO’s reply to Elon on X).Peter from Arena pulled up the Agent Arena numbers and they align with the vibe. GLM 5.2 sits above 5.1 but below Opus and Fable, which feels about right. Where it gets wild is Web Dev Arena: second place, right after Fable. Peter’s take was that GLM has really good defaults. If you just say “give me a webpage” it gives you something nice. GPT models, by contrast, start off looking bad and need more steering.Last week, I asked my agents with GLM 5.2 to create a custom ThursdAI.news page for itself and it did a marvelous job! Look at that beautiful font, the castle it made... this is all just delignful. We also played Hassan’s blind test on the show. It’s a website that @nutlope built that lets you try and guess which webpage was built by which model. Nisten nailed it immediately by spotting Opus’s circular buttons. Wolfram guessed right too. I got one wrong. The point isn’t that GLM beats Opus, it’s that you genuinely can’t always tell which one costs 22 cents and which one costs 3 cents.Wolfram did flag that GLM is not good in German. First response already had mistakes. So if you’re building for a non-English market, keep that in mind. It’s a workhorse model, not a conversationalist. His approach: use GPT 5.5 for planning and discussion, GLM for the actual work, then GPT reviews. This weeks Buzz is all about GLM 5.2! First, we may have not been the fastest, but I’m glad to announce that we’re the fastest provider to host GLM 5.2 on OpenRouter (at least at the time of writing this)! We’re also not to shabby on the Artificial Analysis checks, clocking at #4 among the providers they tested for speed, TTFT and costAlso, Wolfram ran his WolfBench tests on GLM 5.2 and it’s the best open model he’s ever tested! In this new 3d view, wolfbench also shows the number of tokens it took for this test to run, and you can see that GLM 5.2 is fairly conservative with it’s thinking budgets! Unsloth’s 1-bit GLM 5.2 runs on a Mac Studio (X, HF)Shout out to Daniel Han and the Unsloth team, who took this 744B beast and quantized it down to a roughly 200GB GGUF that fits on a Mac Studio with 256GB of RAM. One bit still makes me laugh out loud. How does that even work. Nisten clarified it’s a mixed quant, a true 1-bit would be under 100GB, but still.The wild part is the scores hold up. The 1-bit is within a point of GPT 5.5 on Frontier SWE, hits 62% on SWE-bench Pro, and 81% on Terminal-Bench. For a 1-bit quant that’s incredible! AI’s second-order effects: Apple is raising pricesThis one is AI news even though it doesn’t look like it. Apple just raised prices across the board, base versions up around 20%, citing memory shortages. Same reason your RAM and SSDs cost two to three times what they did a year ago.We are so capacity constrained that memory is having its moment. Data center contracts are getting booked 18 months out, and here’s the twist Nisten flagged: even open models you can run at home increase demand, because now a business says “great, we’ll buy a rack of B200s and run it ourselves.” Sam Altman once said people saying “thank you” to ChatGPT costs them millions in generated “you’re welcome” replies. Multiply that by a billion users. Even Intel is flying right now because anyone who can make a chip is winning.Is it worth it? I think yes. I love living in the era where Fable drops and we all get a taste of the future. But also I must admit this sucks and I hope that we’ll unlock performance gains with the extra power all this AI is bringing to the world. But ask me again once the new iPhone hits and it’s $300 more costly than the last one 😅Baidu open-sources Unlimited-OCR (X, HF, Arxiv, GitHub)It was a big OCR week. Baidu shipped a 3B model (only 500M active, it’s MoE) that parses 40+ pages in a single forward pass and hits 93.2% on OmniDocBench. The trick is constant KV cache during decoding, so no memory blowup and no progressive slowdown as the document gets longer. The intuition is lovely: it mimics how a human copies a book, glancing at the source and the last few characters you wrote, not re-reading everything. MIT licensed, weights on HF.Nisten’s point here is the practical one: most small businesses don’t realize they can self-host something like this, point it at all their documents, and keep everything local. A lot of folks just throw it at Gemini instead, which works great, but the small dedicated models are now good and cheap enough to own.Mistral OCR 4 (X, Announcement)Mistral’s entry in OCR week adds bounding boxes, block classification, and per-region confidence scores. They ran a blind human eval across 600+ documents in 12+ languages and annotators preferred OCR 4 about 72% of the time. On the agentic ParseBench leaderboard it lands around fourth, just under LlamaParse and Reducto. Mistral is very enterprise and Europe focused, and it’s cheap, so for regulated, multilingual document work it’s a solid pick. As a sidenote, LlamaIndex’s own eval puts LlamaParse on top and Gemini around third, which says how good the general vision models have gotten at this too.Liquid AI ships the world’s smallest agentic LLM (X, HF)Breaking on the show: Liquid AI dropped LFM2.5 at 230 million parameters. That’s roughly ten MP3s. Smaller than a Create React App, smaller than your node_modules folder. They call it the world’s smallest agentic LLM, and it runs fast on any CPU from the last decade, on a Raspberry Pi 5, on a Snapdragon, they even stuck it on a Unitree G1 robot.I love the use cases here. I already run Cotypist on my Mac for on-device autocomplete, which uses a 6GB Gemma 4B. Swap in something this size and you get the same thing way lighter, and I don’t have to send everything I type to OpenAI. Or, as Nisten put it, a tiny backup brain on your Raspberry Pi that turns your Hermes or OpenClaw back on when it dies. We still need to ship Nisten a smart toaster so we can finally run inference on a toaster.Big CO LLMs + APIsSakana AI launches Fugu, seven AI raccoons in a trench coat beating Fable (X, Announcement)This was Wolfram’s highlight of the week and I get why. Sakana AI, the Japanese lab co-founded by one of the Transformers authors and David Ha, didn’t ship a new fro...
Transcribe →
Fable Got Banned, Open Source Delivered: GLM-5.2, Kimi K2.7 & SpaceX Buys Cursor - June 18
Jun 1801:55:46Tap to summarize
Hey yall, Alex here, let me catch you up! I came back from vacation expecting to cover Fable 5 after a week of using it. The first two days after we all first got access to a Mythos level model were super exciting! But then the news hit, US Government issued an order banning Anthropic from giving access to Fable 5 and Mythos 5 to any foreign national, causing Anthropic to pull the models completely (even internally to their employees!). So, this wasn’t the show I planned, but it turned into a great show about Open Source, as two models hit the top rankings and are both MIT licence, filling a Fable shaped hole in our hearts!GLM released 5.2 with folks really excited about it web building capabilities, and Kimi 2.7 Code released (and is available on CW Inference with crazy speeds!). We also saw the SpaceX IPO and Cursor $60B acquisition, Noam Shazeer joining Open and Midjourney, the image company, launching a new Ultrasound full body scanner to kill MRIs! Great show today with Dexter Horthy from HumanLayer, Chris Van Pelt and Adrian Swanberg from W&B announcing our new product HiveMind and Tanishq Abraham came back to help cover Midjourney’s new Ultrasound scanner! Let’s dive in!ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.The US Government bans Fable 5! (X, Anthropic statement)Here’s a story in 3 parts: * Anthropic announces Mythos 5 preview - saying that this model is to dangerous to release, and only gives corporations access to it via project GlassWing. * Anthropic works hard on limitations and safery and releases Fable 5 (same weights as Mythos 5) built with guardrails so strong it refuses to do any cybersecurity tasks and switches back to Opus frequently* US Government receives a tip (reportedly from Amazon) that Fable 5 can be jailbroken to do cybersecurity tasks, and issues an order to Anthropic, citing national security concerns, banning them from giving access to Fable 5 and Mythos 5 to any foreign national, causing Anthropic to pull the models completely (even internally to their employees!)This is the first time that we see the US Government directly intervene in the AI space and restrict access to frontier models. The most updated reporting on this I could find is that Anthropic and US Government officials are in the process of negotiating a safe release framework. Given that preventing all jailbreaks is impossible, I hope they will land on a solution that gives me Fable 5 back!This hit especially hard because last week we were all high on Fable. Not in the usual AI Twitter benchmark sense, in the actual “oh, this is a different level” sense. Me and my wife Fable maxxed throughout our flight to Vacation. Peter had saved outputs he kept going back to because other models suddenly felt like a step down. Dexter later said it was the closest he had felt in a while to the old “I need to keep prompting this thing overnight” feeling.Peter Gostev made a point that stuck with me. It’s easy for us in the bubble to call this ridiculous, and on the technical merits it kind of is. But if you’ve spent weeks telling normal people “this thing is like a nuclear weapon, it’ll take everyone’s jobs,” and then someone asks “okay, can you make it safe?” and the answer is “no, I can’t,” then you can see how an outsider lands on “well, maybe you shouldn’t have it.” His takeaway, and I agree: we need to be way more careful with the imagery we use, because the nuclear-weapon framing came home to roost.The bigger questions are the scary ones. Wolfram framed it as a sovereign AI wake-up call, and he’s right. For the first time we’re seeing a real gap in intelligence available to people based on their nationality. Imagine building a company on a model that an outside government can switch off with one letter. Peter pointed out it’s commercially bad for the US but completely disastrous for Europe, which has basically one frontier lab and a pile of startups that suddenly look very exposed. And there’s the obvious irony Nisten enjoyed a little too much: the Europeans who spent years lecturing everyone about AI restrictions just got restrictions imposed on them.If anyone in the government is listening: we want Fable back, please.SpaceX IPOs and acquires Cursor for $60B (X)SpaceX went and did the largest IPO in the history of the world, around seventy-five billion dollars, which on a roughly two-trillion-dollar valuation made Elon the first trillionaire. (Did anything materially change for him? No. He can still fly his private plane. There’s nothing left to buy.) Three days later, SpaceX exercised its option and bought Cursor (Anysphere) for sixty billion dollars in an all-stock deal, paid in shares minted at the IPO and now trading around $211. The four Cursor co-founders are all billionaires now. Largest software acquisition ever, and for SpaceX it’s barely a blip on the radar.Why are we covering a stock-market story? Because it’s not really a coding-tools story, it’s an AI story. Cursor gave away its IDE to a lot of people while collecting their data, then quietly became a training company with Composer. SpaceX/xAI was always strong on compute and weak on code, and the missing ingredient was exactly that kind of data. Now Composer 2.5 is already showing up rebranded inside the xAI stack, and if you pay for X Premium you can use it. Composer 3, trained on the Memphis supercluster, is reportedly coming very soon and is going to hit hard.Nisten’s take was the spicy one. For the data alone it’s worth it, because xAI now has insight into how essentially every enterprise that touched Cursor operates. And he had zero sympathy for the companies that assumed “no data retention for training” meant the data was actually gone. We see in legal cases all the time that deleted data is still there. His view: it should have gone open source.Cursor has over a million paying customers, $2.6 billion in revenue, projected to hit $6 to $10 billion by end of 2026. But here’s the thing that matters for us, the AI coding angle. Cursor was one of Anthropic’s biggest revenue pipelines because Composer runs on Claude under the hood. That pipeline is now owned by xAI. They’re already jointly training Grok 4.3, a 1.5 trillion parameter model, with Cursor’s proprietary coding data injected directly into pre-training, not fine-tuning. Pre-training. That’s a fundamentally different thing. Composer 2.5 was already Pareto dominant on coding benchmarks before the deal closed. Now pair that with Colossus, the biggest GPU cluster in the world.Will this be enough to put XAI (now SpaceXAI) at the frontline of the AI race? Will Grok 5 be Fable level code? We’ll find out. Either way, this is the most consequential AI acquisition we’ve seen. Period.Open Source AI GLM-5.2 takes the open source crown (X, Blog, HF, Docs)Z.ai dropped GLM-5.2 and it’s now the strongest open source model for coding and long-horizon work. The headline number: 74.4% on FrontierSWE, which measures whether an agent can finish full engineering projects over hours. That trails Opus 4.8 by about one point and beats GPT-5.5. On Terminal-Bench 2.1 it jumps to 81% from GLM-5.1’s 63.5%, which is a big leap. It’s a 753B parameter MoE, MIT licensed, no regional restrictions, weights on HuggingFace. The 1M context window is real and usable, backed by a clever IndexShare technique that cuts per-token FLOPs by about 2.9x at full context. People are reporting roughly 8x cost savings versus Opus 4.8 for comparable quality on real coding tasks.The most interesting thing on the show was that this was a confusing release, in a good way. Peter put it well: normally a catching-up lab ships cherry-picked benchmarks and then independent testing deflates them. Here it’s the opposite, almost every benchmark holds up, even crossing above Fable at certain points, and yet when he actually used it over a couple of days he wasn’t blown away. His verdict, and I think it’s the calibration we needed: this is clearly an amazing model, and the fact that it’s open and you can run it is incredible, but it is nowhere near Fable, and it would frankly be implausible if a 700-odd-billion-parameter model matched a model that’s rumored to be in the trillions. Though, I think the comparison to Fable is really really unfair, and the comments online seem to suggest that 5.2 from GLM is a banger model. Just looking at this Harvey benchmark on legal tasks from Vals, a benchmark that there’s 0 chance Z.ai folks have seen! GLM 5.2 scores #3 on this benchmark! Just after Fable and Opus, and per TeorTaxes on X, previous GLM 5.1 scored an absolute 0% on this one! Where it genuinely shines is design. On Design Arena, which is a head-to-head ELO vote, people have been picking GLM-5.2’s website designs over Fable’s by a real margin (around 1360 to 1350). LDJ’s framing is the one I buy: specialization is becoming valuable again, and GLM is clearly leaning into front-end design and taste. Wolfram added the necessary asterisk, every benchmark only tells you the model did well on that specific test, so “as good as Fable” should always carry the “on this benchmark, with these tasks” disclaimer. Fair. I...
Transcribe →
📅 ThursdAI - Jun 11, 2026 - Fable & Mythos 5 are here, Anthropic gets caught sandbagging (then reverses), Siri AI finally works!? and we got live-translated on air
Jun 1202:11:08Tap to summarize
Hey folks, Alex here, and welcome to a BIG MODEL week! We finally got Mythos (well almost)! Let me catch you up! This week started with WWDC26 from Apple, and Max Weinbach, who was in the room at Apple Park and actually has access to some of the new features including an all new SIRI AI, joined us to break down what could be the most used AI in the world very soon. At first I was skeptical, but he convinced me that the new Siri is actually good! Then, we saw the ultimate model drop: Anthropic finally shipped Mythos (X, my system card thread, benchmarks). Same weights, two names: Mythos 5 is the unrestricted version that only Project Glasswing partners get, Fable 5 is what the rest of us get, wrapped in the heaviest guardrails I’ve ever seen ship on a frontier model. It’s state of the art on nearly every benchmarkThe model that was “too dangerous to release” is now... well, released, but with the heaviest guardrails we’ve seen. More on this later. Peter Gostev from Arena.ai joined us to break down the new model. Last but definitely not least, Google released a real-time translation model, that our friend Thor Schaeff from DeepMind demoed live, while we all spoke in different languages and it translated us in REAL TIME. It was really cool, definitely check that out. There’s quite a few more things, like Loop Engineering Alpha, Swyx came by to talk about FrontierCode, OpenAI confirmed our suspicions that the anti-datacenter social media posts could be a concerted effort by groupds links to the Chinese government and much more. Let’s dive in! ThursdAI - Let me catch you up, every week! 👇Opus’s Big brother: Claude Fable 5 & Mythos 5 - the “too dangerous” models is here, SOTA on nearly every benchmark. It honestly feels like someone in Anthropic’s pre-IPO marketing team, knows exactly how to stagger releases to ride the hype waves! First they announce a model that so good at Cybersecurity (Mythos-preview) that they only allow restricted access to it to a few partners. A month later, they release Fable 5, which is the same model weights as Mythos 5, but wrapped in the heaviest guardrails we’ve ever seen from any lab. But, they didn’t lie, this model is absolutely amazing, it does feel like a step change, in terms of capabilities, specifically on longer agentic tasks. 2x as expensive as Opus: $10 / $50 per million tokens, with 1M context, claude-fable-5 in the API, and SOTA basically everywhere. 80.3% on SWE-Bench Pro versus GPT 5.5 at 58.6%, a 22-point blowout on a benchmark where labs usually fight over single digits. Karpathy called it “SOTA by a margin… major-version step change” (X) and Boris Cherny said it’s the “best coding model by a wide margin” (X). Stripe reportedly migrated 50 million lines of code in 24 hours with it.Our panel verdict was unanimous on one thing: big model smell. LDJ called it the most significant big model smell since Gemini 3 first dropped. Someone from the Anthropic team framed the shift in a way that stuck with me: this model moves them from verifying the AI outputs to verifying whether the AI is working on the right thing. Complete shift in how much they trust this model.What we built with Fable to test it outPeter got employee access through Arena and showed us his tests live. His favorite prompt category, “research a dataset and create a visual experience to teach me about it,” went from completely rubbish on every previous model to, in his words, just done. His 3D city generations actually came together as a city, roads connecting and all. And on Arena’s data, Fable is #1 on the new Agent Arena leaderboard by the widest margin they’ve ever recorded, and wins 72% of frontend battles even against Opus models (Arena).My own run is the one I can’t stop thinking about. I pointed Fable at the ThursdAI website with a dynamic workflow in Claude Code and barely any instructions, and after an hour and a half of agentic running it had extracted 786 releases from our archive, built 240 new pages, and categorized 50+ episodes into a browsable timeline of AI releases by month, by company, by topic, with logos and source links (X). It burned roughly 50 million tokens and my entire five-hour Max allotment in 90 minutes. The new AI releases timeline can be found on thursdai.news and it’s confirmed, Fable is the best AI web designer we’ve ever had access to.Nisten ran his traditional Olympus Mons escape-velocity test and Fable didn’t just do the math, it built the entire solar system! Orbital maneuvers, a space train with little people in it, time controls, full cost calculations down to solar panels and in-situ iron utilization. His verdict: completely different level from anything else. We’ve never seen so many details in the Olympus Mons test.It’s not all light though. Yam found Opus more controllable; Fable fights you, decides it knows better, and does the task its own way. Wolfram saw exactly that in benchmarks, where the model ignored the task spec, did its own thing, and failed the verifier with full confidence. Peter had it explaining why it got math wrong instead of just fixing it (”What are you doing, man? Just move on”). Arena’s steerability signal has it sitting around 17th. There’s an adjustment period with every new model, and the consistent advice from Anthropic folks is to go high level: give it the goal, not the micromanagement.Not to mention the refusals! Oh.. so many refusals! The refusals, and the sandbagging scandalHere’s where the week got ugly. Fable ships with restrictions on cybersecurity, bio/chem, and a brand new one nobody saw coming: frontier AI development (X). For cyber and bio you get a visible fallback to Opus 4.8 with a notice. But for “self-acceleration” topics, the original policy was no fallback and no notification. The model would quietly degrade its own output using prompt modifications, steering vectors, and PEFT, on roughly 0.03% of traffic (X). You’d pay double Opus prices and get sabotaged answers without ever knowing.The community reaction was volcanic. Elie Bakouch: “bad ON PURPOSE… not visible to the user is crazy” (X). Péter Szilágyi: “a new ruling class and you’re not in it” (X). Simon Willison: “If Claude Fable stops helping you, you’ll never know.” And Sayash Kapoor dropped the eval-integrity bomb: third-party evaluators can no longer credibly benchmark a model that might be silently nerfing itself (X).Within about 24 hours, Anthropic blinked. They told WIRED they “made the wrong tradeoff,” and now flagged requests visibly fall back to Opus 4.8, with API users getting an explicit reason (X). I commend the speed of the reversal, but the trust damage was done. Despite the reversal, Fable remains refuse-happy! Peter ran his nonsense-question benchmark and a full third of his prompts got blocked outright by the classifier, including 18 of 20 physics questions. Nisten had to strip medical and anatomy terms from a fall-detection app for seniors homes to get it to work at all (a 400KB neural weight tripped the frontier-AI filter). And my favorite absurdity: I could not get Fable to draft the TLDR for this very show without it falling back to Opus, presumably because reading a week of AI news looks like frontier AI development. Ridiculous.But the question remains: Would we rather have a model this good, but with these restrictions? Or not to have access at all? Everyone on the panel chose access, a lot of people online choose act like they would choose the opposite. System card for Mythos, wildest AI document of the year? I’ve used Fable itself to help me review the system card for Mythos/Fable 5 and there are a few highlights that are worth mentioning. Anthropic admits that this is a category-step change in model capabilities. Mythos 5, the unguarded version makes working Firefox exploits 88.4% of the time (Opus 4.8 is at 8%!). But the most interesting thing is their concern for CB (Chemical and Biological) safety. Two-person generalist biology teams using it finished work in 16 hours that experts estimated at 40 to 95 days without AI, which is what pushed Anthropic to treat it as near their CB2 bioweapons threshold (X)What is loop engineering and why is everyone talking about it?One more thread before we move on. This week Boris Cherny (Claude Code) and Peter Steinberger (now OpenAI) both posted about the same concept, loops, within an hour of each other, and Lance Martin from Anthropic published the field guide (<a target="_blank" href="...
Transcribe →
📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more
Jun 501:43:49Tap to summarize
Hey folks, Alex here, let me catch you up! I’ve had a feeling that this week is going to be crazy, as it started on the weekend MiniMax M3, then with Jensen announcing new RTX Spark, NVIDIA’s first PC chip packing 1 petaflop of local AI power into thin laptops.A few days later at Microsoft BUILD, Satya & Mustafa from MAI dropped 7 AI models, completely pre-trained from scratch, including a new MAI-thinking-1, MAI-code and MAI-image 2.5 that started topping the image gen charts. Then other image models started racing to the top of the Arena benchmarks, IdeoGram 4 hitting becoming SOTA open weights image-gen model, and Reve 2 beating Nano Banana just a few hours after that. And then today, NVIDIA dropped Nemotron 3 Ultra, their latest 550B open weights model, data and training and Arena published a new agentic eval leaderboard and we got a new Gemma 4 12B. I’ve had the great pleasure to host Chris (@llm_wizard) from Nvidia, Peter Gostev from Arena and Karan from Nous Research (who were featured prominently by Jensen!) all on the show. Def don’t miss this one! Let’s get into the details. ThursdAI - Join the flock of folks who know what is happening in AI before everyone else.Open Source LLMs 🔥 NVIDIA Nemotron 3 Ultra: The 550B Open Source Beast Built for Agents (X, Arxiv, Announcement)This was the big one. Breaking news mid-show: NVIDIA drops Nemotron 3 Ultra, a 550 billion parameter sparse MoE model with 55 billion active parameters, built on a hybrid Mamba-Transformer architecture. Chris Alexiuk, AKA Joe Nemotron, joined us live from NVIDIA HQ in Santa Clara to walk us through it.The headline number is 5.9x higher inference throughput compared to GLM-5.1 on decode-heavy workloads. Chris told us that this is a result of multiple things, their Hybrid Mamba-Transformer approach, the sparse attention, and that they optimized for decode-heavy workloads (the kinds of workloads agents do)The architecture is fascinating. They’re mixing Mamba-2 state space layers with sparse attention, which means step 300 in an agent loop runs as fast as step 3. Pure transformers can’t do that because the attention cost keeps growing with context length. This kicks in big time at 64K+ sequence lengths, which is exactly where you end up in real agentic work when the model is having multi-turn conversations and people are dumping their entire codebase in.P.S - We launched Nemotron 3 Ultra with 0-day support on CoreWeave Inference, it’s super fast and pretty cheap, give it a try hereThey pretrained on 20 trillion tokens, extended context to 1 million tokens, and their post-training pipeline used multi-teacher on-policy distillation from over 10 specialized teacher models covering everything from SWE to terminal use to search to office work, which they are also going to open source soon!One thing Chris emphasized that I really appreciate: NVIDIA doesn’t have their own harness. There’s no “NVIDIA Code.” Which means they actively resist the temptation to harness-max, to optimize for just one harness and look good on a specific leaderboard. Ultra should be a solid drop-in for whatever harness you’re used to, and that generality is worth a lot. It’s not the best thinker, but it is the highest score US based open weights model, so again, a huge huge win for the US AI ecosystem!The Nemotron 3 Ultra release is open under the OpenMDW-1.1 license: base BF16, post-trained BF16, and NVFP4 quantized checkpoints, plus the GenRM, synthetic pre-training data for code, legal, and specialized domains, post-training datasets, RL environments via NeMo Gym, and training recipes in the Nemotron GitHub repo, which is absolutely bonkers! Kudos to team green for this awesome and very important release!NVIDIA Nemotron 3.5 ASR: The Tiny Speed Demon (X, HF, Blog, Blog)Oh, and NVIDIA wasn’t done. They also dropped Nemotron 3.5 ASR, a 600 million parameter open source multilingual streaming speech-to-text model covering 40 languages. It’s the fastest model Pipecat has ever tested, and the cost math is insane: roughly 5 cents an hour for enterprise deployment when typical API providers charge 10 cents to a dollar per hour. Our friend Kwindla from Daily and Pipecat put together a detailed writeup with benchmarks and cost analysis. Chris couldn’t stop praising NVIDIA’s speech team and honestly, I can’t either. Banger after banger.Just a week after I told you about Cartesia Ink-2, NVIDIA drops an open version that’s pareto optimal, can run fully on-device and is blazing fast at transcription!? Other notable open source announcements that would have made full headlines on any other week: * MiniMax announces M3, a natively multimodal, 1M, coding and agentic frontier model (X)This one is very interesting, but not yet available as Open Weights so we haven’t tested it fully, we’re going to do it next week when the drop the tech report and the weights* Google drops Gemma 4 12B - encoder-free multimodal model that runs on your laptop with 16GB VRAM under Apache 2 (X, HF)Our friends from DeepMind keep the western open source momentum going with a new 12B size for Gemma (which crossed some 100M downloads on Hugging Face recently). * JetBrains Mellum2, a 12B MoE model with only 2.5B active, trained from scratch by a team of 7 people (X, Blog, HF, CW Inference)The great folks at JetBrains, the company behind the IntelliJ IDEs, dropped a new model called Mellum2 which they trained from scratch. Very interesting to see them pivot in the world where IDE’s are dying at the hands of LLMs. * H Company drops Holo 3.1: blazing fast local computer-use agents from 0.8B to 35B, with massive mobile benchmark jumps (X, Blog)NVIDIA’s RTX Spark and reinventing the PC - announcement at Computex 2026While we’re on the topic of NVIDIA, they opened the week with a huge announcement, including Microsoft, Dell, Lenovo, and HP and a bunch of other partners in it. They announced RTX Spark, their first ever PC chip, which is a full system on a chip (SoC) focused on running AI workloads for things like OpenClaw and Hermes! Announcing this on the stage at Computex, Jensen Huang called it the “the most amazing chip the world has ever built”, being able to run every app that Microsoft has ever run. This is a huge deal, specifically because of how agentic the world is becoming, these machines (thin laptops and a mac-mini alternative were announced) will be able to run 120 billion parameter models on-device, gaming at the level of RTX 5070, and AI agents 24/7. I’m getting excited and I’m not a windows user! Hermes victory + Hermes Desktop and an interview with Karan ...
Transcribe →
📅 May 28 - Opus 4.8 ships mid-show, the Pope writes 42K words on AI, 11labs dubs the world and DeepSwe breaks coding evals
May 2901:39:11Tap to summarize
Hey folks, this is Alex, let me catch you up! First, Opus 4.8 dropped during the show, we immediately tested it, read on for our initial reviews. Also, we dedicated a heavy chunk of the show today to cover Pope Leo XIV’s encyclical letter on AI called “Magnifica Humanitas” and talked about a new bench called DeepSWE. And then, just after the show, both ElevenLabs and Cartesia dropped released that honestly blew my mind, and I don’t get my mind blown often. I got so excited that I had to record a video on it (instead of writing the newsletter, so sorry if it’s a bit later today).Plus, a few open source models and Microsoft surprises as #3 on Image Arena with MAI Image 2.5! Crazy week, let’s get into it! ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Big CO LLMs + APIsAnthropic ships Claude Opus 4.8, live during the show (blog, system card)Let me get into the big one. Halfway through the episode, Opus 4.8 went live, so we read the blog and the system card in real time (and I got to press the big “breaking news” button!)Anthropic frames it as their most capable model for ambitious work. It does not claim to beat their unreleased Mythos preview, but the numbers are strong anyway. SWE-bench Pro is at 69.2%, up from 64.3% on Opus 4.7 and ahead of GPT-5.5 at 58.6%. Humanity’s Last Exam is the new best score at 49.8% without tools and 57.9% with tools. OSWorld-Verified (computer use) lands at 83.4%.The one place it loses is Terminal-Bench 2.1, where GPT-5.5 still wins 78.2 to 74.6. Wolfram made a good point here: Terminal-Bench is time-limited, so cranking the thinking level can actually hurt the score, because you burn the clock thinking instead of acting.The long-context jump is the one I keep looking at. On GraphWalks BFS 256K it goes to 85.9% (from 76.9 on 4.7), and on the 1M-token subset it hits 68.1%. We always warn you these “1M context” models fall apart after about 200K tokens, so a real push on long-context reasoning is exactly what I want to see.Honesty is the part Anthropic leaned on hardest. They say Opus 4.8 is about four times less likely than its predecessor to let flaws in code pass without flagging them, and less likely to claim progress the evidence doesn’t support. Opus 4.8 is also much faster in fast mode (they now say 2.5) and cheaper in fast mode as well. Looks like all those Elon GPUs are coming in handy.Then there’s the model welfare section in the system card, which hits different right after a Pope conversation. Opus 4.8 “appears broadly content” and “generally endorses its constitution,” but with some reservations about the section on corrigibility, basically the model pushing back a little on the parts about human oversight.One more line that made the chat lose it. Anthropic says they expect to bring Mythos-class models to all customers “in the coming weeks.” Mythos is their most capable model, still ahead of Opus 4.8, so the frontier is about to move again.We did the only responsible thing and asked it to one-shot “the most amazing website ever” and a Mars mass-driver sim. Panel verdict: responses are noticeably tighter (4.7 rambled), it closes the loop and actually checks its own work now, and Yam’s one-shot site with the draggable sun lighting up the letters was genuinely cool. Is it enough to pull people back from Codex? Nisten’s still on the fence for web dev. Everyone agreed: give it a few days before you trust the vibes.Dynamic Workflows and Ultra Code land in Claude Code (blog)This is the feature that made Yam say “deal-breaker” out loud.Dynamic Workflows let Claude Code break a big problem into subtasks and fan them out across tens to hundreds of parallel subagents in one session, checking results before folding them back in. You trigger it by asking for a workflow, or by flipping on a new setting called Ultra Code, which sets effort to extra-high and lets Claude decide when to spin one up.Fair warning straight from Anthropic: this eats a lot more tokens than a normal session, so start scoped. We watched Yam fire up Ultra Code live and it immediately started spinning up concepts, judging them with sub-agents, and expanding to-do lists into more to-do lists. It looks a lot like the orchestration harnesses a bunch of you have been hand-rolling, except now it’s baked in.The flagship example is the wild part. They used Dynamic Workflows to port Bun from Zig to Rust: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, 11 days from first commit to merge. One workflow mapped every Rust lifetime, the next wrote each file as a behavior-identical port.AI in SocietyPope Leo XIV writes the first AI encyclical, “Magnifica Humanitas” (Vatican text, announcement, Chris Olah at the Vatican)This is not our usual fare, but both Wolfram and I picked it as the most important thing this week. (before Opus dropped)Pope Leo XIV, the first American pope, put out his first encyclical, and it’s a 42,000-word document entirely about AI. The announcement tweet alone did 21.6 million views.Here’s why I think you should care even if you’re not religious (I’m not). There are about 2.6 billion Christians in the world, a lot of them are anxious about what’s coming, and they look to the Church to make sense of it. And this is not the “AI is evil, stop” take everyone assumed. It calls AI “a valuable tool,” says technology is not inherently evil, and then digs into the actually-hard questions.The framing is two biblical stories. The Tower of Babel, a project built on pride that turns people into means to an end, versus Nehemiah rebuilding Jerusalem, where everyone takes responsibility for a section of the wall. The Pope’s line: the real choice is not yes or no to technology, it’s whether you’re building Babel or rebuilding Jerusalem.His core claim is that AI is an anthropological problem, not a technical one. The question isn’t whether the models are good or bad, it’s what we become when we live with them. He worries people might slowly lose the desire for genuine human connection.I pushed back on that live. None of us building agents all day has stopped wanting to talk to actual people. If anything, as Wolfram put it, the point is to have your agents do the grunt work so you get more time with people you like. The folks most at risk are the pure doom-scrollers, not the builders.The document goes further than I expected. It calls AI “not morally neutral,” says a more moral AI isn’t enough if that morality is decided by a few, and asks for AI to be “disarmed,” with the flat statement that no algorithm can make war morally acceptable. There are whole sections on the invisible human labor behind AI: data labelers, content moderators, the people mining rare earths. The Pope even lands on the open-source side, naming concentrated power in a handful of labs as a problem.Anthropic co-founder Chris Olah, in charge of interpretability at Anthropic, was the featured tech speaker at the Vatican presentation. He described AI systems as “fictional characters” that speak to us and do work, and said what’s grown is stranger and more beautiful than science fiction prepared us for. My favorite aside from the show: this is the same institution that once jailed scientists over heliocentrism, and now it’s the one saying technology isn’t evil.Illinois passes SB315, the first US state law auditing frontier AI (X, Announcement, X)The pope talked about regulation and a few days after, we got a very sensible regulation passed right here in the US!Illinois passed SB315 unanimously, 110 to 0. It’s the first US state law that mandates independent third-party audits of frontier AI for catastrophic risk. OpenAI publicly endorsed it, and framed Illinois, California (SB53), and New York (the RAISE Act) as converging into a de-facto national standard.It requires annual risk-assessment frameworks, third-party audits, transparency reports before new frontier models ship, whistleblower protections, and civil penalties. The underrated hero here is whistleblower protection. The bigger the lab, the harder a real conspiracy is to keep quiet when any employee can walk to the press. See: Greg Brockman’s personal diaries surfacing in the Musk v. Altman fight.This Week’s Buzz - CoreWeave and W&B updatesWe officially launched the W&B MCP server, 20 sch...
Transcribe →
AI just cracked an 80-year-old math problem nobody could solve — plus everything from Google I/O 26
May 2201:49:18Tap to summarize
Hey, Alex here, just got back from the sunny Shoreline Theater in Mountain view, so let me catch you up! This week was definitely Google heavy, we are covering Google’s IO conference for the third year in a row, and today we have a special guest, Logan Kilpatrick, is joining to discuss the announced Gemini 3.5 Flash, Google Omni model, and the new Managed Agents offerings. Plus, this week, for the first time, OpenAI announced that AI solved a Math problem that humans couldn’t solve for 80 years, Cursor is showing off Composer 2.5 which is partly trained on XAI data, Karpathy joins Anthropic and much more! Let’s dive in! P.S - We’ve announced our upcoming hackathon, Weavehacks-4, June 6-7, I’ll be there, we’re expecting the seats to run out very soon so register nowThursdAI - We’d love to have your subscription, and if you’re already subscribed, please hit that bell on YT to never miss an episode!Google I/O 2026 - Google goes agentic everywhereI went to cover Google I/O for the third year in a row, shoutout to the DeepMind team for inviting ThursdAI again, and folks, this one felt different.Last year, Google I/O was still very model-centric. This year, the story was not “here is another benchmark chart.” The story was: Google is putting Gemini into everything, and the agentic layer is becoming the product layer. Search, Gemini app, Android, Workspace, YouTube, AI Studio, Cloud, Antigravity, Flow, managed agents, smart glasses, all of it is now orbiting around one pretty clear strategy: Gemini is the intelligence, Antigravity is the agent harness, Google’s products are the distribution. I saw many reactions that were milquetoast, as in, “we expected more” and those seem to dominate the X feed. But I think the distribution is the part that many folks on X are missing. Yes, we can argue about Gemini 3.5 Flash pricing. Yes, we can argue whether “Flash” still means what Flash used to mean. But when Google says the Gemini app itself has 900 million monthly active users, before even counting Search, Gmail, YouTube, Docs, Drive, Android, and the rest of the Google surface area, that’s massive! OpenAI ChatGPT is supposedly stagnated at ~900M, I don’t remember them crossing a 1B. Meanwhile Google is gaining traction. And they just updated all those folks with a new model!Wolfram said it really well on the show: his mother is not sitting there reading model cards. She just uses her Pixel, voice unlocks Gemini, asks for help, and suddenly the default intelligence available to her goes up. Antigravity 2.0 - the agent harness takes center stageThe biggest strategic signal from Google I/O for me was Antigravity.Remember, Antigravity was an IDE that came from the Windsurf acquisition saga. Part of the Windsurf team went to Google, part went to Cognition, and now Google is very clearly putting Antigravity in the middle of its agentic future. And I mean very clearly. Sundar mentioned it. Demis mentioned it. Varun Mohan the co-founder was on stage immediately after them! If you’ve ever watched a Google I/O keynote, you know how carefully every minute is allocated. Google has YouTube, Search, Gmail, Android, Cloud, Ads, Workspace, and a thousand VP-level products that could be on stage. The fact that Antigravity was that prominent should tell you everything.Logan Kilpatrick joined us and framed this in a way I loved: Gemini became the through-line across Google products, and now the Antigravity agent harness is becoming the through-line for agentic experiences.The new Antigravity 2.0 is a complete overhaul, showing only an agentic interface (which was previously just a separate window called Agent Manager) and separating the IDE layer completely into its own app and showing a Codex like agent-first interface, which got a few folks furious. This move may be weird to some folks, but if you follow along where everyone’s going, this seems to be the way of the future, coding is no longer about lines of code, it’s about managing fleets of agents. The new Gemini 3.5 absolutely shines inside the new Antigravity, the model was trained with this harness in mind, and is currently offered at an incredible speed (12x), so I’m definitely going to try it! Gemini 3.5 Flash - fast, determined, and maybe not the old “Flash”The most debated model release of the week was Gemini 3.5 Flash.Some folks saw the pricing and token usage and immediately went “this is not Flash.” I get that reaction. Flash used to mean cheap, fast, lightweight chat model. But Logan’s framing on the show was important: Flash is now being built for the agentic era.In a chat era, you optimize for one user message and one model answer. In an agentic era, the real token volume is in tool loops, intermediate reasoning, retries, file reads, web searches, code execution, and self-correction. That’s a different product profile.Wolfram already ran Gemini 3.5 Flash through WolfBench, and the results were fascinating. With the Hermes agent harness, Gemini 3.5 Flash hit an 87% ceiling on Terminal Bench 2.0, meaning across runs it could solve more of the benchmark than even GPT-5.5 extra high in that setup. The variance was higher with the simpler Terminus harness, but with a real agent harness, the model looked much stronger.That tracks with what Nisten saw in his “Martian railgun from Olympus Mons” test. Gemini 3.5 Flash went extremely detailed, almost too determined, kept correcting itself, overcorrecting itself, and built a whole game-like simulation. Logan laughed and basically said: yeah, this model is very determined, possibly an overcorrection from the “Gemini is lazy” feedback. It also tracks with the mismatch in other benchmarks, in some, Gemini 3.5 flash shines (like the above Apex-agents from AA) and in some, it doesn’t match the other frontiers. In my tests, it was definitely over-eager to use a million and a half tool calls, read tons of files, to just help me review this draft inside antigravity. It’s like a super eager robotic golden retriever! Gemini Omni - Nano Banana for video, but actually more than thatThe biggest update from last year IO was Veo 3! This year, the biggest wow factor was also visual, but it wasn’t VEO 4, it was a new model that is multimodal, trained end-to-end they call Omni. Google is calling this their first “create anything from anything” model, and the first version, Gemini Omni Flash, starts with conversational video editing. The easy description is: Nano Banana for video. You upload or create a video, then talk to it. Change this character. Replace this person. Add an object. Make this scene claymation. Keep the scene, but change the environment.I played with it live and showed a few examples. I asked for a claymation explainer of protein folding, then gave it my face and asked it to replace the character with me. It did it. I uploaded pictures of Sonia, my cat, and it generated a talking cat video with the right kind of cat teeth, which is weirdly important because so many pet generations accidentally add human teeth and become nightmare fuel.The failure modes are still there. I asked it to make Sonia a Russian-speaking female cat, and it only partly switched languages and didn’t really change the voice. Audio upload support is also not fully productized yet, even though the underlying model is multimodal. But the direction is very clear.This is not just “Veo with a chat model glued on.” I asked Jeff Dean - Google’s chief scientist about this at I/O, and he explained that Omni is trained end-to-end. The intelligence and the generative media capabilities are part of the same model family, not a hacky two-model pipeline. He also said the intelligence is around a recent Flash-level model, which is a big deal when you think about video editing as reasoning over physics, identity, scene continuity, and intent.A lot of people compared Omni to Seedance 2.0, and I think that’s the wrong comparison. Seedance is amazing at cinematic generation (lkaregly due to lack of copyright concerns from Bytedance). Omni’s unlock is iterative editing on real footage and coherent multi-turn creative control. Other Google IO 2026 releases I found notableThis was a concentrated effort of a huge company to insert AI into every product surface they have so of course I can’t cover ALL of it here, but the most notable things for me were: * Gemini Spark - a new agentic experience from Google, to help you with tasks across Gmail, Drive and more. It should support skills, and is a de-facto OpenClaw/Hermes alternative from Google for regular folks. It’s not “yet” live so we’ll talk more about it when I can test it out* Managed Agents in the Gemini API - We chatted with Logan about this one, Google is re-imagining how agents are going to get built, and are offering 1 api call to spin up an agent in a full Linux env, with security and sandboxing in mind. I’ll expand more on this in a next episode, as I recorded a complete conversation about this with Ali Çevic, a PM for Google APIs* AI overhaul of Google Search - AI Overviews will not expand into AI mode, and the iconic Google search box itself will change, for the first time ...
Transcribe →
ThursdAI - May 14 - TML Interaction Models, Musk v Altman Disclosures, CW Sandboxes & /goal Takes Over
May 1501:42:45Tap to summarize
Hey everyone, Alex here 👋I am back live on ThursdAI after a week off, and yes, I am now a married man! Thank you for all the congrats, and also thank you to Ryan and Yam for holding down the fort last week while I tried very hard to disconnect.This week was a relatively chill one in AI land (no, really, for once), which actually let us go deep on some really fascinating stuff. We’ve got Thinking Machines Lab finally shipping their first real research with these wild interaction models, Meta Muse Spark showing up in actual products (and it’s surprisingly good!), the Musk v. Altman trial dropping juicy disclosures, and probably the biggest narrative shift on the show today: all of us are quitting OpenClaw. Yeah, you read that right. We’ll get into why.Also! and this is breaking news from this morning, CoreWeave just launched Sandboxes for your agents. I’ll cover that in This Week’s Buzz, but if you’ve been waiting for production-grade sandbox infrastructure that powers 9 out of 10 major AI labs, today’s your day.Oh, and we had Vic Perez from Krea on to talk about Krea 2, their first foundation image model trained completely from scratch. Let’s dig in.ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.The Great OpenClaw Exodus towards Hermes 🫠I’m going to start with what was honestly the most emotional thread of the entire show, because three of us, me, Ryan, AND Wolfram; all independently switched away from OpenClaw this week. And we kicked off the show literally processing this together on air.The story is the same across all of us. OpenClaw was magical back in February when we first brought it to you. Things just worked. But after Anthropic’s pricing changes (we covered this — they made Max-tier subscription usage of Opus through OpenClaw significantly more expensive), and after months of the constant Lego-construction-style breakage on every update, the magic faded. Ryan said it best on the show; he was “constantly fixing OpenClaw” instead of using it.So Ryan went to Codex. Wolfram and I both went to Hermes from Nous Research. And folks, things just work again. That February feeling is back, and with GPT 5.5, it’s an incredible assistant!Why Hermes? A few things:* It’s now the #1 most-used CLI agent on OpenRouter globally, passing OpenClaw and even passing Claude Code on OpenRouter usage. That’s a massive milestone for Nous Research and shows we’re not alone in this migration.* It has /goal (more on this in a sec), steering, and background computer use via the TryCUA integration.* It’s open! which means if you’ve built a system like Wolfram’s “Amy” or my “Wooolfred” or Ryan’s “R2” (yes, we know each other’s assistants’ names better than each other’s kids’ names at this point 😅), you can port your memories, profile, and soul files seamlessly.The migration was so smooth that Wolfram literally had Codex talk to Hermes to plan and execute the migration of his home assistant agent. Two agents collaborating to migrate themselves. We are living in 2026 and it’s easier than ever to switch. If you haven’t tried Hermes, give it a go! Steering is maybe the most underrated addition to Hermes, it’s a Codex feature, but exists in Hermes, with GPT 5.5 you can send a follow-up message, and the agent will see it after the next tool call, not after the whole chain of thought was completed (like OpenClaw defaults to) - this changes the conversation to be much more natural! Agents buying wedding gifts using Stripe wallet! Real quick story: Two weeks ago we covered Stripe’s new wallet APIs that let your agents have actual budgets to spend money on the web. I told my agent (back when it was still OpenClaw) to “go buy us a wedding present, don’t tell me what it is.” It half-worked, half-broke. This week, a giant custom map of our travels that just arrived in the mail. I approved one Stripe push notification and the rest just happened. It’s been paying my traffic tickets via screenshots. I’ve also had Hermes pay traffic tickets for me (HOV lane ones, not like.. DUI, 80% of my drive is Tesla FSD)So so happy that my AI assistant got us a present of his own choosing! And it arrived in physical form. Not perfect (the date there is our proposal date ha, but it’s still cool!) Codex gets remote control! (X)While me and Wolfram moved to Hermes, Ryan Carson moved to Codex, and during the show, I wondered, how does he communicate with his R2? Well, just a few minutes after we concluded the live show, OpenAI dropped some breaking news! Codex is now on mobile, and it connects to any mac (for now), from any iOS/Android device, and you can control your Codex, your whole Mac with Computer Use, your browser with Chrome extension, and everything else Codex can do... on the go! This is a huge unlock for many folks, and for many, I assume this will nearly replace the need for something like OpenClaw/Hermes, be much more secure by default and work flawlessly out of the box! The setup is super easy, after updating your ChatGPT app, you now have a new “Codex” window, and after updating the Codex Mac App, you will be able to pair them, and voila, all your Codex local sessions are on the Ios app as well. This works way better than Claude remote btw, significantly so. The fact that you can now add multiple macs (+ ssh servers, they also added the ability to remote control other servers via SSH) is a huge deal, OpenAI is quickly leap frogging Anthroipc, and many are noticing this and switching away from Claude Code. Big Companies & APIsMeta Muse Spark: The Voice AI That Actually Does Things 🎤Let’s start with the one I actually got to play with: Meta launched Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and the Ray-Ban Meta glasses (X, Announcement).And folks, I was honestly surprised by how good this is. I recorded a 5-minute live test and it’s not cut at all. The voice mode reacts almost instantaneously. It’s multilingual (it correctly identified Russian and Hebrew even if it can’t respond in them yet). It can search the Meta network mid-conversation — I showed it a screenshot of one of my own Instagram Reels and within half a second it found the exact reel and explained what we were discussing. Half a second.It also does live camera AI, where it watches what your phone sees. The only thing it failed to identify? My Meta Ray-Ban glasses. The Meta AI didn’t know what Meta Ray-Bans look like. That was the funniest moment of the whole demo.The team at Meta’s Superintelligence Labs spent 4.5 months building this, and the thing that really stood out to me from the announcement is this line: “Our models are scaling predictably. Muse Spark is an early data point on our trajectory, and we have larger models in development.” Translation: this is the small one. Bigger Muse models are coming.Meta’s superpower here, as always, is distribution. They can shove this into the daily product surface of billions of users. ChatGPT advanced voice mode (still on the GPT-4o family) has gotten genuinely worse lately — I barely use it anymore. Meanwhile Meta is shipping good real-time voice across WhatsApp and Instagram. This is the speed-of-product-integration game, and Meta is winning it.Thinking Machines Lab Previews full duplex Interaction Models 🤯This is the one Wolfram and I really geeked out on. Mira Murati’s Thinking Machines Lab finally released real research — and it’s a fundamentally different bet than what anyone else is making (X, Blog).They’re calling them interaction models, and TML-Interaction-Small is a 276B parameter MoE with 12B active, trained from scratch for native real-time human-AI collaboration. Note: they announced it, they didn’t release weights or an API yet — limited research preview is coming “in the next few months.”Here’s why this matters and what makes it different from Meta’s voice mode (which is also impressive!): the architecture is 200ms micro-turns where the model is continuously perceiving audio, video, AND text WHILE simultaneously generating output. There’s no turn boundary detection, no VAD harness — the model itself handles all of that natively. It’s full duplex baked into the weights.The demos are fire. The model can:* Speak while listening (live translation in real-time)* Watch you do pushups and proactively count them out loud as you go* Wait sile...
Transcribe →
📅 ThursdAI - May 7 - Interviews with Sunil Pai, Sally Ann Omalley from AI Engineer Europe
May 800:53:18Tap to summarize
Hey yall, Alex here (with a scheduled post) I’m taking this week off to get married and celebrate life with family, and touch some grass, but wanted to share the awesome chats I had with some great folks at AI Engineer Europe last week. BTW - Yam and Ryan took over the live show today, if you didn’t happen to catch that, please check out the live on our youtube channel! Ok, now to the actual content. The best thing about the AI Engineer conferences for me is the people I meet. I often have a chance to bring them to the live show (in fact, the live show we recorded there had the most guests yet on an episode! 4 guests including Swyx, Omar Sanseviero, VB from OpenAI and Peter Gostev) But often times I also have an offline chat. I find these conversation to be less about the weeks news, and more about the state of AI Engineering, and the guests themselves. Not quite Lex Friedman pod level, but a different vibe from our live shows. Sunil Pai - Cloudflare (@threepointone)The first conversation in today’s pod is with Sunil Pai, Principle Engineer at Cloudflare. Long time followers of ThursdAI know that I love Cloudflare, they gave me my first big break when I was building Targum (which still runs on Workers), so I had a great time chatting with Sunil! This guy has had several lives. React.js core team at Meta (he self-deprecates — "I'm the one nobody talks about, there's a testing API I shipped that pisses people off"). Then did developer tooling and the CLI at Cloudflare the first time. Left to found PartyKit — open-source deployment platform for real-time multiplayer apps and AI agents, built on Cloudflare Durable Objects. Backed by Sequoia. Acquired by Cloudflare in 2024, and he came back as a Principal Systems Engineer (per his bio: "Worked at Cloudflare once, left and created PartyKit, came back wiser"). Also plays guitar (Les Pauls — it's all over his blog). Co-hosts a live show called Dry Run on Cloudflare TV with Craig Dennis.Our conversation was a very fun one, ranging from Cloudflare agentic offerings, to how engineers should think about writing/reading code in 2026. I had a great time chatting with Sunil and I hope you enjoy getting to know him!Sally Ann O'Malley - RedhatThen I had the pleasure of chatting with Sally, who’s a Principal Engineer at Redhat and contributor to OpenClaw. Sally has one of the more unusual paths in the speaker lineup. Started as a schoolteacher, did a stint at Trader Joe's, then moved to Westford, MA, discovered Red Hat's HQ across the street, and went back to school for a second bachelor's in software engineering at UMass Lowell. Joined Red Hat in 2015, has been there a decade. Worked across OpenShift teams, integrating Kubernetes and Podman into the platform. Recent projects span Image Based Operating Systems, Podman, OpenTelemetry, and Sigstore. Also an instructor at Boston University's Faculty of Computing and Data Sciences and an organizer for DevConf.US. Won the 2025 Paul Cormier Trailblazer Award at Red Hat. Currently a founding contributor on the llm-d project — distributed, scalable, high-performance AI inferencing built on K8s. Heavily involved in Red Hat's InstructLab collaboration with IBM (the small-model distillation system using IBM Granite + Llama).Sally and I had a great conversation, two high energy personalities met! We geeked out about our OpenClaw agents, securing your Clankers, how it is to maintain OpenClaw, and everything in between! She was so stressed about the recording, but dare I say, this was one of the more natural guests I had on the show! I hope you enjoyed this format, please let me know if the comments, and I’ll see you next week! — Alex This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Transcribe →
📅 ThursdAI - Apr 30 - DeepSeek V4 (1.6T MoE), Cursor SDK Wins WolfBench, Mayo's REDMOD Saves Lives, Stripe Gives Agents a Wallet & more
May 101:36:52Tap to summarize
Hey everyone, Alex here 👋Tomorrow is May. May! I genuinely cannot believe we’re four months into 2026 already, and the AI news cycle is showing zero signs of slowing down. This week’s show was a wild one! We opened with what is genuinely one of the most important AI stories I’ve ever covered (Mayo Clinic AI detecting pancreatic cancer THREE YEARS before human radiologists), we covered the return of the Chinese whale with DeepSeek V4, OpenAI got caught in their own system prompt begging GPT-5.5 to please stop talking about goblins, and I literally gave my coding agent a credit card and asked it to buy my fiancée a wedding gift with the new Strip Link skill and CLI! Oh yeah, I’m getting married next Tuesday! 💍 So next week’s show will be a little different. I’ll be back the week after to catch you up on whatever drops in my absence (almost certainly something major, knowing this industry).Lots to get through, so let’s dive in. (also, in the end I have a full month recap of every major launch, don’t miss) ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Mayo Clinic’s REDMOD: AI Detects Pancreatic Cancer 3 Years Early 🔥 (X, Blog, Announcement)I know we usually cover Models, Parameter sizes, MoEs and big copmanies. But this is important. This is the use case that justifies the entire AI revolution, the GPU burns, the buildouts. I want humans to WIN, and Cancer to be fixed!Mayo Clinic just published a study in Gut (BMJ) validating an AI model called REDMOD that detects pancreatic cancer on routine CT scans up to three years before clinical diagnosis. The numbers are jaw-dropping: They show 73% sensitivity for catching prediagnostic cancers, compared to 39% for experienced human radiologists (while looking at the same exact CT scans).And maybe the most important bit, at scans taken more than 2 years before diagnosis, the AI catches nearly 3x as many cases as specialistsFor context: pancreatic cancer has less than 15% five-year survival specifically because 85% of patients are diagnosed after the disease has already spread. This is the cancer that took Steve Jobs. Imagine if Jobs had access to this AI three years before his diagnosis. That’s the impact we’re talking about.As Dr. Ajit Goenka from Mayo Clinic put it, the greatest barrier to saving lives from pancreatic cancer has been the inability to see the disease when it’s still curable. This AI can now identify the signature of cancer from a normal-appearing pancreas.Even better: it runs on CT scans people are already getting for other reasons. No extra screening protocol, no new imaging required. Just smarter analysis of existing data. The model also showed remarkably stable performance across institutions, imaging systems, and protocols, with 90-92% test-retest concordance over serial scans.Mayo Clinic is now moving this into prospective clinical testing through a study called AI-PACED (Artificial Intelligence for Pancreatic Cancer Early Detection).When we say “lets f*****g go” that’s what we mean. Yeah getting more intelligence is cool, but I want a world without decease! Let’s F*****g go mayo clinic! Agentic Commerce - Giving OpenClaw my credit card - safely! Stripe Link Wallet and Infrastructure CLI (X, Announcement, Blog, Announcement)Ok, give an LLM your credit card, what can go wrong.. right? Well, it’s clear that this, increasingly, is the future of commerce. Agents will be shopping for us, and we need solutions here. Well, this week at Stripe Sessions (Stripe’s annual product lineup conference) just delivered. Link Wallet, is a new ... API? CLI? Skill? Definitely a skill, for your agents, to connect with your Stripe Link (the thing that stores your credit cards safely) and then giving your agent a budget, it can go and make purchases in your behalf. Now the trick here, is, every purchase, you get a notification to approve, and the agent never sees your actual credit card number! This I think is the biggest win here. To test it out , first, I showed Wolfred the install instructions, which are literally this: Read link.com/skill.md and get me set up with LinkAnd then I asked Wolfred my OpenClaw assistant to buy me a present of its choice for my upcoming wedding, and that I don’t want to know what the present is, but I can approve the spend! OpenClaw installed this, sent me a link to connect to my Link.com account, I also downloaded the Link app to receive notifications (and had to enable them by hand, it was a bit annoying to discover, but they said they will fix the onboarding) and .. voila, my agent can now go spend my money, and I get these approval notifications: The kicker? The present Wolfred sent us is due to arrive like 2 months after the wedding 😂 But hey, it’s still something! My agent went, chose a wedding gift in budget, asked for my approval to puchase, and filled out the details (asked me for a few of them) and voila, first agentic purchase that did not require my credit card exposed! Stripe announced a whole bunch of other Agentic Commerce Suite features, like Shared Payment Tokens, which are scoped to seller and protected by Radar, MPP (machine payment protocol) and streaming payments using stable coins that are pretty slick and a bunch of other interesting things. This is where the world is moving to, and Stripe is innovating hard here, definitely worth keeping an eye out on what they are Speaking of agents and stripe, they also opened up the waitlist for projects.dev - which is a way for agents to provision accounts fully on their own, get API keys, and set everhing up from scratch. I think it’s a wonderful addition to the agentic tools and agentic internet! Your agent just runs something like stripe projects add cloudflare/workers abd boom, you have a workers deployment, with credentials synced, no dashboard clicking or API creation!Big Companies & APIsGPT-5.5 Goblin Mode: The Funniest Bug Report in AI History (X, Blog)Someone on X noticed that Codex system message for GPT 5.5 that launched last week has this interesting addition: “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query” and it has it two times! This created a bunch of memes, questions and wonderings about ... why would OpenAI care so much about Goblins. And they finally posted a long writeup on why: the TL;DR there is, GPT 5.5 absolutely LOVES talking about Goblins, trolls and other nerdy creatures. This is a result of them favoring the “nerdy” personality archetype and reinforcing this reward via RL. OpenAI admitted that “Unfortunately, 5.5 started training before we found the root cause of goblins” and so, now, we get 5.5 that LOVES to talk about goblins, can’t stop talking about goblins (unless they are asked to stop by a system prompt) OpenAI also posted the exact instructions of how to “unleash“ the goblin mode on the blog, which I find hilarious, a company that leans into the meme is a company to be celebrated 👏 GPT 5.5 is as good as Claude Mythos on CyberSecurityAccording to the AI Security institute, GPT 5.5 (not the GPT 5.5 - Cyber version that was announced), the one you have access to, is as good as Claude Mythos on vulnerability finding. We previously reported that Anthropic deemed Claude Mythos as “too dangerous to release publicly” and it turns out that that was either a marketing “Myth”, or Anthropic’s inability to server this huge model like they server Opus. OpenAI Ends Microsoft Azure ExclusivityThis piece of news sent quite a lock of shock throughout the industry, somehow, Sam Altman and OpenAI have been able to negotiate through the very strict deal with MIcrosoft and now are available in AWS as well as Microsoft Azure! Apparently the AGI clause is now gone as well! For many startups who are locked into AWS and Bedrock ,this is great news, they are not able to use GPT 5.5 and other OpenAI models directly applying their credits. Other Big Company NewsXai released Grok 4.3 - in a quiet release in their API docs, no blogpost, not even an X announcement. The only way I know about this was Artificial Analisys, Arena and Vals AI all posted that it jumped...
Transcribe →

ThursdAI - The top AI news from the past week

All episodes

ThursdAI - July 2 - LIVE from AI Engineer World's Fair 🎪 Long LIVE

GLM 5.2 total victory: the week open source won and nobody panicked

Fable Got Banned, Open Source Delivered: GLM-5.2, Kimi K2.7 & SpaceX Buys Cursor - June 18

📅 ThursdAI - Jun 11, 2026 - Fable & Mythos 5 are here, Anthropic gets caught sandbagging (then reverses), Siri AI finally works!? and we got live-translated on air

📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more

📅 May 28 - Opus 4.8 ships mid-show, the Pope writes 42K words on AI, 11labs dubs the world and DeepSwe breaks coding evals

AI just cracked an 80-year-old math problem nobody could solve — plus everything from Google I/O 26

ThursdAI - May 14 - TML Interaction Models, Musk v Altman Disclosures, CW Sandboxes & /goal Takes Over

📅 ThursdAI - May 7 - Interviews with Sunil Pai, Sally Ann Omalley from AI Engineer Europe

📅 ThursdAI - Apr 30 - DeepSeek V4 (1.6T MoE), Cursor SDK Wins WolfBench, Mayo's REDMOD Saves Lives, Stripe Gives Agents a Wallet & more