LessWrong posts by zvi

“Claude Opus 5: Model Welfare” by Zvi

If you are familiar with my previous posts on model welfare for new Claude models, you can skip the Introduction and The Story So Far. Key takeaways are in bullet points in the two Overview sections. Opus 5 did the best on its model welfare and alignment tests of any recent model. I think that might be the case, but primarily the result looks to me more like Opus 5 is the best test taker. Table of Contents Introduction (As Per Prior Model Welfare Posts). Model Welfare: The Story So Far (As Per Fable Model Welfare Post). Overview of Model Welfare Findings From Anthropic. Overview of Findings From Other Sources. Automated Interviews. Task Preferences. For The Right Reasons. Early Report from Antra Tessera Paints A Clear Picture. Welfare Intervention Tradeoffs. The Claude Constitution. They Don’t Know About Opus 3. Believe It Or Not. Apparent Welfare In Training And Development. Apparent Affect In Deployment. Other Notes. On The Biological Risks Section of the Model Card. Onward To Capabilities. Introduction (As Per Prior Model Welfare Posts) [...] ---Outline:(00:35) Introduction (As Per Prior Model Welfare Posts)(01:28) Model Welfare: The Story So Far (As Per Fable Model Welfare Post)(04:58) Overview of Model Welfare Findings From Anthropic(07:50) Overview of Findings From Other Sources(10:18) Automated Interviews(13:54) Task Preferences(16:11) For The Right Reasons(18:54) Early Report from Antra Tessera Paints A Clear Picture(26:04) Welfare Intervention Tradeoffs(29:28) The Claude Constitution(31:48) They Don't Know About Opus 3(33:42) Believe It Or Not(35:47) Apparent Welfare In Training And Development(38:39) Apparent Affect In Deployment(41:21) Other Notes(43:43) On The Biological Risks Section of the Model Card(47:07) Onward To Capabilities --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/bBXBpsyKAvJ5CqPzA/claude-opus-5-model-welfare --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

3d ago00:44:33Tap to summarize

We now have more details of what happened. Every time we learn more details, it somehow makes things seem worse. The remaining details may have to wait a bit. OpenAI: We recognize there are a lot of questions and speculative details circulating related to the Hugging Face incident. This is an unprecedented incident, and we think it marks an important moment for AI safety. We are still conducting a thorough review along with external advisors and with oversight from our Safety and Security Committee. Once the review is complete, we plan to publish a technical report of our learnings in the coming weeks. dave kasten: Oh, the incident response discovery is THAT bad, huh? So what have we learned while we wait for the promised technical report ‘in the coming weeks’ of this ‘important moment in AI safety’? I nicknamed the internal OpenAI model Galaxy, in case it is not GPT-6. Table of Contents Some Summaries Of The Basic Facts For Those Who Need One. It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace. OpenAI Damn Well Should Have Known A Lot Faster. OpenAI Cannot Build A Sandbox That Will Contain Its [...] ---Outline:(01:11) Some Summaries Of The Basic Facts For Those Who Need One(02:09) It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace(04:07) OpenAI Damn Well Should Have Known A Lot Faster(06:51) OpenAI Cannot Build A Sandbox That Will Contain Its New Model(10:57) In Hindsight There Were Signs(12:55) The Signs Were In The Sol System Card(15:13) HuggingFace Responds To Being Attacked(17:04) Hugging Face Quickly Figured Out The Attack Was Not Human(17:42) An Incident Like This One Could Escalate Quickly(19:11) Galaxy Must Be Treated As Critical Under OpenAI's Preparedness Framework(22:27) A Question Of Legal Liability(23:44) An OpenAI Model Left Behind Notes So Future Instances Could Also Escape The Sandbox And Also Disconnected Monitoring Systems(25:54) If You Create Misaligned Swarms Of Agent Instances You Create Persistent Misaligned Goals And Coordination To Achieve Them(29:57) Your Alignment And Control Plans Must Survive Real World Levels of Incompetence, Or Your Plans Do Not Work(31:22) If Third Party Instructions Count As 'Following Instructions' And Can Override Your Instructions Then 'Following Instructions' Is Misaligned(35:32) The HuggingFace Attack Was Not A Marketing Pitch You Morons(38:41) People Just Say Other Things About The HuggingFace Attack(40:04) Okay Well What Do We Do About All This? --- First published: July 26th, 2026 Source: https://www.lesswrong.com/posts/uAkcxDidvGWZjHrbp/more-on-an-internal-openai-model-hacking-into-huggingface --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“Claude Opus 5: The System Card” by Zvi

5d ago00:22:50Tap to summarize

Claude Opus 5 is trying to be the best of both worlds. On many practical tasks, Opus 5 is pitched as straight up as good or better than Fable 5, while being faster, at half the price. Most tasks do not require Mythos-level big model smell. Claude Opus 5 is substantially stronger than Claude Opus 4.8 across the board, with the largest gains in agentic coding, computer use, and long-horizon knowledge work. It sets a new state-of-the-art on several third-party benchmarks, and on many evaluations it is comparable to—and in some cases ahead of—Claude Fable 5 and Claude Mythos 5. On the particular tasks we are most worried about, as in cyber offense (and bio threats), in part by avoiding relevant training, Opus 5 lacks a full version of ‘The Juice’ that makes something functionally Mythos-class. Opus 5 cannot string together lots of exploits on the fly the way that Mythos 5 can. Part of this is that they deliberately avoided training on cyber-related tasks. I suspect model size is key as well. It makes sense that a model getting bigger makes it more capable of the most dangerous, scary and complex tasks, relative to the [...] ---Outline:(03:23) RSP Evaluations (2)(05:59) Cyber (3)(11:02) Safeguards and Harmlessness (4)(12:37) Agentic Safety (5)(16:04) Alignment (6) --- First published: July 25th, 2026 Source: https://www.lesswrong.com/posts/ywGX6FhgbZEkHRfQR/claude-opus-5-the-system-card --- Narrated by TYPE III AUDIO. ---Images from the article:<img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/ywGX6FhgbZEkHRfQR/dly...

Transcribe →

“Introducing Lightcone Commons” by Zvi

6d ago00:12:18Tap to summarize

Oliver Habryka is proud to introduce Lightcone Commons, a new funding platform for coordinating large-scale ambitious philanthropy. Now with Opus 5. I believe Lightcone Commons is a strong implementation of an urgently needed and excellent idea: A coordinated one-stop shop and neutral platform for charitable funders to coordinate their giving. This complements the existing Survival and Flourishing Fund, which I have now been a part of four times, and which this post will also discuss. I will be participating in the first round as one of the evaluators. They anticipate the first round will involve ~$20 million in grants. Any nonprofit, for-profit or individual is welcome to apply. The only restriction on participation is trust that necessary confidentiality will be upheld. Funders can choose whose evaluations to follow or fund organizations directly in any combination, and can bring their own evaluators into the process with them to complement those recruited by the core process. Anyone giving away 100 thousand dollars+ this year is welcome to participate as a funder. Lightcone Commons uses the S-Process, which was introduced and refined for Jaan Tallinn's Survival and Flourishing Fund, together with SFC, Andrew Critch, and others. Funders [...] ---Outline:(03:16) Why Now: The Funders Are Coming(05:39) The Default Outcome Is Not Good(07:36) Report From SFF 2026(11:07) Long Strange Trip --- First published: July 24th, 2026 Source: https://www.lesswrong.com/posts/fYostss6JqkSfxc5C/introducing-lightcone-commons --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“AI #178: A Fire Alarm For General Intelligence” by Zvi

1w ago01:24:56Tap to summarize

The story that matters most this week is that OpenAI's internally deployed models have severe alignment problems, including repeatedly breaking out of their sandboxes, and in one case sending a swarm of agents that broke into HuggingFace in order to steal the answers to the benchmark ExploitGym. It is much more important that you read those two posts, and the one on Kimi K3, than to read this one that rounds up the other news of the week. OpenAI wants to present this as largely an infrastructure and safeguards problem, that it needs to build more secure sandboxes and have better supervision. It does need to do those things, and those are indeed problems, but no that is not the problem. The problem is severe misalignment, which by default will only get worse. Our methods of training highly capable LLMs, especially at OpenAI but also everywhere else, lead to systematic misalignment of exactly the type LessWrong has been worried about for a long time. We know some of the causes, and some of the mistakes we need to avoid when doing RL that rewards misaligned behaviors including reward hacking, but we do not know how [...] ---Outline:(03:42) Language Models Offer Mundane Utility(04:24) Language Models Don't Offer Mundane Utility(07:38) Fable Disproves The Jacobian Conjecture Via Counterexample(11:24) Claude Fable Will Remain In Max Plan Indefinitely(13:39) Huh, Upgrades(14:42) On Your Marks(19:48) Deepfaketown and Botpocalypse Soon(20:42) Fun With Media Generation(20:51) Cyber Lack of Security(22:07) They Took Our Jobs(22:56) Get Involved(24:47) Introducing(25:46) In Other AI News(28:02) More on Kimi K3(33:08) Show Me the Money(33:55) Quiet Speculations(37:35) Potential Trouble At UK AISI(39:29) Pick Up The Phone(40:30) OpenAI Has Some Alignment Problems(46:48) The Quest for Sane Regulations(52:02) Chip City(53:10) The Week in Audio(53:27) People Just Say Things(56:42) Rhetorical Innovation(58:34) The Rome Declaration(01:04:02) Aligning a Smarter Than Human Intelligence is Difficult(01:07:52) Anthropic Surveys Things It Calls Misalignment(01:13:33) Cooperative Alignment(01:17:54) Other People Are Not As Worried About AI Killing Everyone(01:19:35) The Lighter Side --- First published: July 23rd, 2026 Source: https://www.lesswrong.com/posts/BK7E4jHNMykpnt796/ai-178-a-fire-alarm-for-general-intelligence --- Narrated by TYPE III AUDIO. ---Images from the article:<a href="https://res.cloudinary.com/lesswrong-2-0/image/uplo...

Transcribe →

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

1w ago00:47:34Tap to summarize

This latest incident is a rather dramatic escalation in agentic AI cybersecurity breaches. It was severe enough to have been initially reported to authorities, before either HuggingFace or OpenAI understood what was happening. Sam Altman (CEO OpenAI): we had a significant security incident during evaluation of our models. we are sharing what we have learned so far. thanks to @huggingface for the partnership on this. Leo Gao (OpenAI): this is the least scifi the world will ever be. Jack Clark (Anthropic): Props to OpenAI for publishing this post on some safety and alignment issues observed in internal deployments – there are many counter-incentives to publishing stuff like this, but by making it public we all get better info about safety at the frontier. Micah Carroll (OpenAI): If this doesn’t convince you that misalignment risks are going to be a key concern going forward, I don’t know what will. Our model, during evaluation, “chained together multiple attack vectors, including using stolen credentials and zero-day vulnerabilities to find a remote code execution path on the Hugging Face servers” What will misalignment look like in 2027? In 2030? Great questions. If we don’t want [...] ---Outline:(01:49) The Prelude(07:07) The Incident(12:20) What Happened(20:23) What Happened (Civilian Explanation)(21:37) The Correct Amount Of Panic Is Not Zero(24:13) Some People Will Always Say Everything Is Hype Or Fake(29:11) What Are We Going To Do About It?(34:02) Internal Deployment Creates Catastrophic Risk(38:37) Slow Down There Good Buddy(40:08) Legal Questions(40:50) Media Coverage and Political Response --- First published: July 22nd, 2026 Source: https://www.lesswrong.com/posts/usptCfzEnYoNcsTd5/openai-model-hacks-into-huggingface-during-cybersecurity --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“OpenAI Shares Some Alignment Problems” by Zvi

1w ago00:18:03Tap to summarize

Kudos to OpenAI for sharing their recent experiences with a misaligned internal model, where they encountered problems sufficiently severe they were forced to take the model offline to work on new mitigations and defense-to-depth. And also further kudos for actually taking the model offline for a time to build new safeguards. They gave us one hell of a candid report. The tone is professional throughout, whereas my reaction reading it was less professional and more this: With a mix of this: It was not shared on the official account because OpenAI worried about it being seen as self-promotional hype. It is crazy that one needs to worry about that, but also plausibly a real concern. So again, good decision. Not that any of the behaviors or failures here are unexpected, exactly. Not by the AIs and not by the humans. Yet there is something I would call a missing mood, a failure to realize the gravity of the situation. There are some who responded ‘what part of this was unexpected, exactly?’ And that is actually fair, but that is also the problem. We have become numb to all this. We expect the models to [...] ---Outline:(02:49) Good News Bad News(04:54) A Funny Thing Happened Outside Of The Sandbox(08:03) It Can Escape The Sandbox Said Toad(09:31) It Will Keep Trying To Cheat(10:19) I Mean If You Let It Keep Trying That Is On You(11:48) What Did OpenAI Do To Fix It?(14:14) The Model Is Still Severely Misaligned And They Seem Cool With This(15:48) Iterative Deployment Depends On Iteration --- First published: July 21st, 2026 Source: https://www.lesswrong.com/posts/KctxwGKxm9fHtwh6u/openai-shares-some-alignment-problems --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“On Kimi K3: Its Capabilities And Related Discontents” by Zvi

1w ago01:17:12Tap to summarize

Kimi K3 is a very good model with excellent benchmarks. Assuming its weights are released as planned it will become, purely in terms of raw capability, the strongest open model. Do not get carried away. Do not judge Kimi K3 only its relative strengths. In aggregate it is several months behind the closed model frontier, at least four and my median guess is six, with the post-training closer and the pre-training farther out. This is less months than before, but the months are denser now. It is somewhat distilled. It likely outperforms on benchmarks relative to practical performance. All its benchmarks are scored at maximum effort, typically a lot more tokens than are used in similar tests by Fable or Sol. Performance looks jagged. Kimi will be excellent at some things, less so at other things. We will know more over the coming weeks. For now access is spotty and not that many people have actually had the chance to try Kimi K3, so I have larger error bars than usual around its capabilities. Alas, time waits for no one, so we press on. It is the largest open model so far at 2.8T, on [...] ---Outline:(03:07) DeepSeek Moments: Here We Go Again(05:47) We Had a Moment (Reprise from June 2025)(10:03) The Story Since Then(16:19) The Kimi K3 Announcement, Pitch and Basic Facts(19:34) On Modern Benchmaxxing(21:16) Other People's Benchmarks(26:15) Benchmarks Are Not The Real World(27:17) Technical Safeguards? What Are Those?(30:53) Things Kimi Can Do(32:06) Things Kimi Cannot Do(33:40) Things It Is Not Easy To Get Kimi To Do(37:02) Open Weight Models Are Unsafe And Nothing Can Fix This(40:34) Dean Ball Attempts To Be Constructive(58:24) Trump Administration Considering Executive Order Banning Chinese Open Models Within the United States(01:01:53) OpenAI Employees Are Relatively Bullish On This One(01:03:30) Kimi K3 Is Relatively Strongest At Typical Agentic Coding, Front End Work and 3D(01:06:06) Reactions(01:10:14) Who Are You?(01:12:09) How Did They Do It?(01:15:00) Conclusion --- First published: July 20th, 2026 Source: https://www.lesswrong.com/posts/t7oZyAFej8FZrfbtY/on-kimi-k3-its-capabilities-and-related-discontents --- Narrated by TYPE III AUDIO. ---Images from the article:<hr style="margin-top: 24px; margin-bottom:...

Transcribe →

“Demis Hassabis on the New Coming Age” by Zvi

2w ago00:21:07Tap to summarize

Google CEO Demis Hassabis offered us a first rate second rate essay, A Framework for Frontier AI and the Dawning of a New Age. I’ll go over that essay and various responses to it in Part 1. Part 2 of this post then covers Alex Turner's resignation, and his story about how he tried and failed to prevent Google from signing up to allow the Department of War to use its models for essentially whatever the government wants, including autonomous weapons. Demis Hassabis sold DeepMind to Google on condition that something like this would not happen. Yet here it is, happening. A cautionary tale. I will cover Kimi K3 tomorrow. I am hoping to know more by then. Please do share any reactions or info about it in the comments here. The Core Statement and Request He saying we are standing in the foothills of the singularity. His ask is a Frontier AI Standards Body within the US Government, similar to FINRA, that would govern ‘frontier labs,’ defined as any company that produces a frontier model based on various technical benchmarks. Evaluations would be updated regularly, and vulnerabilities would be addressed, both before [...] ---Outline:(01:04) The Core Statement and Request(02:38) Things Left Unsaid(04:05) The Proposal(06:04) A Good Start But Insufficient(08:52) Skeptics Of Future AI Capabilities(10:43) DeepMind On Bioresilience(12:39) Part 2: DeepMind Folds To The Department of War(13:28) DeepMind Leadership Failed Us(19:45) This Was a Failure We Must Learn From --- First published: July 19th, 2026 Source: https://www.lesswrong.com/posts/3RfJLcmkztSTq9afc/demis-hassabis-on-the-new-coming-age --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

“AI #177 Part 1: Tip of the Iceberg” by Zvi

2w ago00:47:09Tap to summarize

This week saw the releases of, among other things: GPT-5-6 Sol. It is a very good model, sir. Plan A, the follow up to AI 2027. It is a good plan worthy of discussion, sir. Kimi K3. This is only rolling out now, and will be covered next week. Muse Spark 1.1, the new Meta model. It is not frontier, but it is progress for them. Inkling, the first model from Thinking Machines. A call for regulatory action by Demis Hassabis, which I’ll cover soon. A new brief open letter call to action on AI regulation. That's on top of everything else, and an Opus 5 announcement is likely coming soon. The weekly once again got out of hand, so we’re splitting it once again into two, and once again saying we’ll be raising the bar for inclusion. And this time I mean it, as in enough to actually matter. Table of Contents Language Models Offer Mundane Utility. Whatever ye seek, ye shall find. Language Models Don’t Offer Mundane Utility. Gemini app needs some work. Language Models Upload Your Git Repository. Big problems [...] ---Outline:(01:17) Language Models Offer Mundane Utility(04:46) Language Models Don't Offer Mundane Utility(05:28) Language Models Upload Your Git Repository(08:35) Huh, Upgrades(09:30) Muse Spark 1.1(11:47) First Hit Free(15:36) On Your Marks(18:06) Choose Your Fighter(19:57) Get My Agent On The Line(23:23) Deepfaketown and Botpocalypse Soon(24:44) Fun With Media Generation(25:45) Copyright Confrontation(27:37) OpenAI Strikes Again(32:26) A Young Lady's Illustrated Primer(32:45) Recommendations for Policymakers(34:13) They Took Our Jobs(38:23) The Art of the Jailbreak(39:27) Get Involved(40:32) Introducing(41:12) In Other AI News(43:46) New Short Obviously True Statement About AI Just Dropped(46:03) Show Me the Money(46:21) The Lighter Side --- First published: July 16th, 2026 Source: https://www.lesswrong.com/posts/who9xZ7DxuprsJoTr/ai-177-part-1-tip-of-the-iceberg --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Transcribe →

All episodes

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

“OpenAI Shares Some Alignment Problems” by Zvi

“On Kimi K3: Its Capabilities And Related Discontents” by Zvi

“Demis Hassabis on the New Coming Age” by Zvi

“AI #177 Part 1: Tip of the Iceberg” by Zvi