Podcast Summary: Last Week in AI (#235)
Date: February 16, 2026
Hosts: Andrei Karenkov & Jeremy Harris
Overview
This episode of Last Week in AI tackles a major "model release wave," with all the major AI labs and several Chinese companies dropping new and impressively powerful models. The hosts analyze headline releases including Opus 4.6, GPT-5.3 Codex, Google's Gemini Free Deepthink, Seedance 2.0, GLM-5, and more. They discuss hardware shifts, business and investment stories, agent workflow trends, and growing questions of safety, benchmarking, and alignment.
“Somehow everyone decided to release new models that are like mindbreaking at the same time… Opus, Codex, Gemini, deepthink, models from the Chinese companies… we might have to be quick. It's going to be pretty dense.”
(Andrei, 01:13)
Key Discussion Points & Insights
1. Major Model Releases and Upgrades
1.1 Anthropic’s Opus 4.6 & Agent Teams
- Release highlights:
- 1 million token context window (up from 200k)
- 2.5x speed improvement on “Extra Fast” version
- New “agent teams” for parallel task decomposition
- Positioning Shift:
- Moving from developer tools toward “universal knowledge worker” use––tight integrations for Excel, PowerPoint, etc.
- Competing more with broad office platform AIs, not just coding assistants.
- Host vibe:
“...this is really, I would say, the moment to call it on. I would expect to start to see some big white collar market shifts in response to these kinds of capabilities…”
(Jeremy, 07:28) - Timestamps:
- Opus 4.6 context window and features: 04:42
- Agent teams and market implications: 05:48
1.2 OpenAI’s GPT-5.3 Codex & Codex Spark
- Release highlights:
- GPT-5.3 Codex: 25% faster, outperforms previous models on coding and benchmarks (77.3% on TerminalBench, compared to Opus 4.6’s 65.4%)
- Codex Spark: Ultra-fast (1,000+ tokens/sec) model, enabled by $10B OpenAI–Cerebras partnership; instant code generation but smaller model than 5.3
- MacOS app for non-coders; focus on making agentic AIs accessible.
- Recursive Self-Improvement?
- Blogposts claim Codex “helped build Codex,” sparking debate over whether this constitutes “recursive self-improvement.”
“In my opinion, this recursive self improvement thing is a nothing burger… it didn’t actually help make a smarter model…”
(Andrei, 15:36) - Hardware Strategy:
- OpenAI moving portions of inference to Cerebras hardware to reduce Nvidia dependency and cost.
- Timestamps:
- Model benchmark details: 14:09
- AI helping train itself discussion: 14:09–18:01
- Spark/Cerebras partnership: 19:08–20:05
1.3 Google Gemini Free Deepthink
- Release highlights:
- Major jump in “abstract reasoning” benchmarks (ARC AGI 2: 84.6% pass rate; prev. best was Opus 4.6 at 68.8%)
- Beats previous SOTA on math and reasoning (IMO: 81.5% vs. GPT 5.2’s ~71%)
- No system card released; debate as to whether these “runtime upgrades” deserve more rigorous safety review.
- Available only to AI Ultra tier and select test APIs.
- Concerns:
“If you can have a leap like this that comes from whether it’s scaffolding, whether it’s just like more compute… hey, that’s the world we live [in]... there may be other ways to unlock latent reasoning capabilities…”
(Jeremy, 25:25) - Timestamps:
- Gemini Deepthink benchmarks: 24:21
1.4 Seedance 2.0 and Other Major Chinese Releases
-
Seedance 2.0 Video Generation:
- Bytedance drops a “text-to-video” model so lifelike it’s “hard to tell it’s AI,” supporting images, videos, audio as input.
- Trained on vast copyright material, enabling stunning (but legally murkier) outputs (anime, films, etc.)
“Truly, this model seems to be trained on as much video as Bytedance could get its hands on, including just everything and anything with copyright restrictions…”
(Andrei, 28:01)- Timestamps: 28:10–31:44
-
Other Significant Chinese Models:
- CBrium 5.0 (image), Alibaba’s Quen Image 2.0, and more – all showing huge leaps, especially in editing and multilingual capability.
1.5 GLM-5, Deepseek, and Other LLMs
-
GLM-5:
- Massive (744B param, 40B active) Mixture-of-Experts (MoE) model, with powerful new RL infrastructure (“SLIME”).
- Noted for using “deepseek sparse attention,” an innovation now spreading across Chinese LLMs.
“It is really impressive…being able to train a 744 billion parameter [MoE] is a huge deal…”
(Jeremy, 35:37)- Timestamps: 34:08–38:09
-
Deepseek New Release:
- 1M token context, significant for code agent applications.
-
Cursor (Composer 1.5):
- Now pivoting from autocomplete to “agentic code” competitor.
-
XAI’s Grok Imagine API:
- Newly announced image/video gen API; not yet leading but entering a “competitive space with a few big players.”
- Timestamps: 41:08–42:23
2. Business & Investment Roundup
2.1 ElevenLabs Valuation
- Raises $500M at $11B valuation; founded in 2022.
- Dominates text and audio gen (speech, music, SFX).
- 2025 revenue: $330M ARR.
- Sequoia, a16z, Lightspeed, and others on cap table.
“It's crazy what we've gotten used to… they were founded in 2022…”
(Jeremy, 44:08)
2.2 Runway’s $315M Series E
- Now at $5.3B valuation; expanding from AI video/image editing to world modeling and robotics applications.
- Notably received investments from both Nvidia and AMD.
“You can’t make that much money off consumer AI generated video… at a certain point you’re going to have to go after the enterprise.”
(Jeremy, 50:30)
2.3 Apptronik’s Humanoid Robotics Megaround
- Raises $935M (in Series A extension!) at $5.3B; deals with Google, Mercedes
"Still investors are kind of excited about it."
(Andrei, 51:17) - Rapid interest in factory automation; direct competition with Figure and 1X.
2.4 Industry Dynamics & Vendor Lock-In
- API switching is easier than ever (w/ gateways like lightllm; vendor lock-in fading)
- Looming margin pressure on LLM providers as capabilities converge.
3. Agent Workflows, Safety & Security
- Rise of agent teams/parallelization (Opus 4.6, GPT-5.3, etc.)
- Fast iteration of tools for non-coders (MacOS apps, “cowork”-type UIs)
- Vulnerabilities in agent code ecosystems:
- Recent malware-spread on open agent “skill” hubs (prompt injection, system compromise); “maybe have a burner laptop…” (Jeremy, 67:10).
4. Benchmarks, Evaluation & The Limits of Evals
-
Benchmarks Less Meaningful?
- Models are suspected of “eval awareness” and benchmarks can't keep up with qualitative leap.
“Evals… are just kind of not there anymore, for all kinds of reasons. But also this recursive self improvement thing, we can have a philosophical debate about what that means…”
(Jeremy, 09:39) -
Quantitative Leap Example:
- TerminalBench coding:
- GPT-5.2 Codex: 64%
- Opus 4.6: 65.4%
- Codex 5.3: 77.3%
- TerminalBench coding:
-
VettingBench:
- Opus 4.6 achieves record $8,000 average balance; lessens “doom loop” failure mode vs. prior versions.
-
Meta's Work Time Horizon:
- Next-gen AIs can perform multi-hour complex tasks; exponential trend with possible “curve steepening.”
"I believe that this doubling time... is actually... faster than every seven months, maybe closer to every four months..."
(Jeremy, 80:46)
5. Research & Alignment / Safety
5.1 RL, Distillation, and World Model Learning
- “Learning to Reason in 13 Parameters”
- Shows that you can distill 90% of LLM “reasoning” upgrades into as little as 13 parameters (for small models); most RL only “elicits” latent capabilities, doesn’t create new ones.
“13 numbers actually seem to be sufficient to capture 90% of the reasoning capabilities of a model, which very much seems to suggest that all RL is doing is it's allowing you to not create new capabilities…but rather elicit capabilities that are already there.”
(Jeremy, 69:01) - World Model Paper:
- Agents better learn environments by “bumping around” and predicting next states, not just chasing goals—parallels RL in robotics (“learning from play”).
5.2 The “Hot Mess” of AI Alignment (Anthropic)
- Long, complex tasks induce more error, not always from misalignment, but pure “incoherence” or inability to keep track/execute.
- Larger models can be more brittle on hard tasks (more variance), while scaling reliably helps on easier ones.
- Suggests unpredictable or “hot mess” systemic failures over coherent “paperclip maximizer” catastrophes.
"Rather than a paperclip maximizer… maybe the failure modes are going to shift to more systemic things like… self undermining behavior."
(Jeremy, 87:59) - New technical term introduced: “Hot mess” = highly incoherent, error-prone behavior (opposite of systematic misalignment).
6. Industry Drama & Miscellaneous
- Anthropic’s Surprising Super Bowl Ad:
- Open dig at OpenAI on “advertising in AI chatbots;” OpenAI leaders react defensively on social media.
- XAI (SpaceX acquisition) Talent Exodus:
- Two major cofounders and at least 11 engineers leave in wake of the SpaceX/XAI merger—industry trend of founding teams fracturing.
7. Open Source Highlights
- GLM5 on HuggingFace; Quinn FreeCoder Next
- Major Chinese open LLMs keeping up with Western releases; Quinn FreeCoder Next is highly efficient via hybrid attention and novel distillation regimen.
Notable Quotes & Moments
- On the competitive pace:
“Something has changed in the last three months. We’ve moved from the impressive demo stage to the, ‘actually this should be your first port of call for an awful lot of workflows.’”
(Jeremy, 06:50) - On evals’ limitations:
“It’s really difficult to know what models are actually at what point in this whole singularity loop. So yeah, just basically hard to agree with everything you said there.”
(Jeremy, 16:45) - On hardware margins:
“Every dollar you spend on an Nvidia GPU, 90% of that is just like pure profit, right? ...He who controls the full stack here has a real, real advantage.”
(Jeremy, 20:05) - On model convergence:
“With Codex 5.3 and Gemini Free and Claude... margins have to start falling, right?”
(Andrei, 47:07) - On unsafe agent skill repositories:
“A large proportion of the stuff that was on this hub was malware.”
(Andrei, 66:50) - On safety and alignment:
“...incoherence is a new notion here which seems very useful actually. And they do have some pretty practical conclusions...”
(Andrei, 87:59)
Segment Timestamps (Approximate)
- [04:42] Opus 4.6, Agent Teams & Market Impact
- [09:39] GPT-5.3 Codex, Evals, and Hardware Moves
- [24:21] Gemini Free Deepthink Benchmarks & Safety
- [28:10] Seedance 2.0, Chinese Video/Image AI
- [34:08–38:09] GLM5, Deepseek, and Agentic Coding
- [44:08] ElevenLabs, Runway, Apptronik: Investment Stories
- [66:50] Open Agent Repos Full of Malware
- [69:01] 13 Parameters for LLM Reasoning
- [80:46] Meta’s Work Time Horizon: Exponential Trend
- [84:14] Anthropic’s “Hot Mess” Alignment Results
Final Thoughts
This was one of the most action-packed, transformative weeks in AI on record. The “model release deluge” is pushing the limits not only of capabilities and hardware, but also measurement, business strategy, and safety. The market is shifting beneath us—white-collar automation, foundation model commodification, and new threats/opportunities in both open and closed ecosystems. As the hosts admit: the vibe checks matter more than ever, and we're officially "off the edge of the map."
"I think we've hit the main points here."
(Jeremy, 65:11)
For more details and technical deep-dives, listen to the full episode or check out their newsletter and recommended links.
