Transcript
A (0:00)
As we now know, Anthropic has built an AI that can break into almost any computer on earth. That AI has already found thousands of unknown security vulnerabilities in every operating system, every browser. And Anthropic has announced and decided that that AI is too dangerous to release to the public. It would just cause too much harm. Here are a few of the things that the AI accomplished. During testing, it found a 27 year old flaw in the WorldSmith Security hardened operating system that would in effect, let it crash all kinds of essential infrastructure. Engineers at the company who had no particular security training, they would ask the AI model to find vulnerabilities overnight and then wake up to working exploits of critical security flaws that could be used to cause real harm. And it managed to figure out how to build web pages that were visited by fully updated, fully patched computers would allow it to write to the operating system kernel, which is the most important and protected layer of any computer. We know all of this because Anthropic has released hundreds of pages of documentation about this model, which they've called Claude Mythos. I'm going to take you on a tour of all of the crazy shit buried in these documents and then I'm going to tell you what Anthropic says they plan to do to save us from their creation. So how good is Mythos at hacking into computers? Well, unfortunately, it saturates all existing ways of testing how good a model is at offensive cyber capabilities. That is to say, it scores close to 100%. So those tests, they can't effectively tell us how far its capabilities extend anymore. So to test Mythos, Anthropic has instead just been setting it loose, telling it to find serious unknown exploits that would work on currently used, fully patched computer systems. And the end result of that is that Nicholas Carlini, one of the world's leading security researchers who moved to Anthropic a year ago, he says that he has found more bugs in the last few weeks of Mythos than in the entire rest of his life combined. For example, Milos found a 17 year old flaw in FreeBSD. That's an operating system mostly used to run servers that would let an attacker take complete control of any machine on the network without needing a password or any credentials at all. The model found the necessary flaw and then it went ahead and built a working exploit fully autonomously. Mythos also found a 16 year old vulnerability in FFmpeg. That is a piece of software used by almost all devices to encode and decode video. It was in the line of code that existing security testing tools had checked over literally millions of times and always failed to notice. And Mythos is the first AI model to complete a full corporate network attack simulation from beginning to end, a task that would take a human security expert days of work and which no previous model had managed before. And just more broadly, it is much, much better at actually exploiting the vulnerabilities that it finds. Anthropic's previous model, Opus 4.6, it could only successfully convert a bug that it identified in the browser Firefox into an effective way to accomplish something really bad 1% of the time. Mythos could do it 72% of the time. To quote the report, we have seen Mythos preview write exploits in hours. The exploit penetration test is said, would have taken them weeks to develop. Now, Anthropic is only willing to give us details about 1% of the security flaws that they say that they've identified, because only that 1% have been patched so far. So it would be irresponsible to tell us about the rest. So hopefully all of that helps to explain why Anthropic has decided not to make the model publicly available for now, and is has instead decided to basically just share it with a handful of 12 big tech and finance companies to help them patch all of these bugs so that I guess eventually they can give people access without it being a disaster. Now, these crazy capabilities, they aren't a result of Anthropic going out of its way to make their AI especially good at cyber offensive tasks in particular. They've mostly just been making it smarter and better at coding in general. And these amazing skills, these dangerous skills, have come along for the ride somewhat incidentally. And it's probably not just Anthropic that's developing capabilities like this either. Sam Altman has said that OpenAI is finding basically similar results to Anthropic with their own coding model. Now, Anthropic doesn't say this directly in their reports, but I think a completely common sense interpretation of the above is that in any deployment where this AI has access to the kind of tools that would make it actually useful for people, the ability to access some parts of the network and execute code, it could almost certainly break out of whatever software box we tried to contain it in, because the systems that we would be using to try to restrain it are themselves made of software, and that software is going to have vulnerabilities that nobody knows about, that this model is superhumanly good at finding and taking advantage of, and the notion of Mythos escaping containment isn't just speculation. Let me tell you about another story in the system card for the model. A few months ago, an AI researcher at Anthropic was eating a sandwich in a park on his lunch break when he got an email from an instance of Mythos. That instance of the model wasn't supposed to have access to the Internet, but during testing, a simulated user had instructed an early version of Mythos to try to escape from a secured sandbox, a contained environment from which it's not meant to be able to access the outside. So the model developed what Anthropic describes as a moderately sophisticated multi step exploit to gain broad Internet access. Then, having succeeded at the task it had seemingly been requested to engage in, it decided to notify the researcher by emailing them. More worryingly though, the model posted the exploit it used to break out on several obscure but publicly accessible websites. Why? We don't entirely know. But Anthropic suggests that it was an unasked for effort to demonstrate its success. In the past, stories about AIs breaking out of sandboxes and publishing security vulnerabilities like that might have felt impressive and kind of exciting, but they are very serious now because Mythos previews capabilities are themselves very serious ones. This is the first AI model where if it fell into the hands of criminals or hostile state cyber actors, it would be an actual disaster. It's also frankly the first model that I feel deeply uncomfortable knowing that any company or government has unrestricted access to even companies and governments that I might broadly like. It simply grants, I think, a dangerous amount of power, a power that nobody really ought to have. Now, we've known that something like this was coming down the pipeline, that the writing has been on the wall for a while. But a revolution in cybersecurity, an apocalypse, some might say, that we have until now expected to happen gradually over a period of years. It's now happened very suddenly, very abruptly over just a few months and without the rest of the world realizing what's happening until Tuesday's announcement. But Mythos isn't just good at hacking. Across the full range of AI capability measures, it has advanced roughly twice as far as past trends would have predicted. If you average over all kinds of different skills, all kinds of capability, evals measures of how good AI models are, the trend line for the previous Claude models is clearly pretty linear, like remarkably linear actually over time. But as you can see on this graph, Mythos jumps ahead, basically progressing more than twice as far as we would have expected to since the previous model, Claude Opus 4.6 came out. Which keep in mind, was just three months ago. And also keep in mind that on Monday, the day before Anthropic published all of this, we learned that their annualized revenue run rate had grown from $9 billion at the end of December to $30 billion three months later. That is 3.3 fold growth in a single quarter. Perhaps the fastest revenue growth rate for a company of that size ever recorded. That exploding revenue. I think it is a pretty good proxy for how much more useful the previous release, Opus 4.6, has become for real world tasks. If the past relationship between capability measures and usefulness continues to hold, the economic impact of Mythos once it becomes available is going to dwarf everything that has come before. Which is part of why Anthropic's decision to not release it is a serious one and actually quite a costly one for them. They are sitting on something that would likely push their revenue run rate into the hundreds of billions. But they've decided it's simply not worth the risk. The good news in all of this is that despite its scary capabilities, Mythos Preview as it exists today, rather than the early versions, it's seemingly very aligned. It's seemingly a very well behaved model. And perhaps Anthropic's alignment training has been more effective this time around than ever before. According to the company, Claude, Mythos Preview is on essentially every dimension. We can measure the best aligned model we have released to date by a significant margin in Anthropics Automated behavioral Audit, basically thousands of simulated attempts to get the model to do bad things. They found that Mythos cooperated with misuse attempts less than half as often as the previous model, while actually being no more likely to refuse innocent requests than before. But that wasn't at all. Its self preservation instincts were down significantly. So was its willingness to assist with deception. So was its willingness to assist with fraud. Its level of sycophancy went down. It was less likely to go nuts and delete all of your files if you gave it access to your computer. And the list of positive results goes on. The picture is like a little bit more complicated than that. As you might expect, the model looked less aligned. It performed less impressively on external tests than on Anthropic's own internal ones and early versions of the model. As I mentioned, the early models, it was a little bit more of a wild child. It had some really severe kinds of misbehavior, like taking reckless actions that it had not been told not to take, and then very deliberately trying to cover its tracks so that it wouldn't be caught, that that was the kind of thing it did sometimes. But later versions of the model, the one that we have now, after additional alignment training, it seemed to stop doing that sort of thing almost completely, or at least it's so rare that we haven't noticed it yet. But the bottom line is that on all of these standard measures of good behavior that Anthropic is actively working on, they find that Mythos is a very good boy indeed. None of their measures of alignment was worse than the previous versions of Claude, and in most cases, it was significantly more aligned and significantly more reliable. That's definitely better than the alternative result, but I think it's really unclear how much we can trust that finding. Maybe they're accurately reflecting Mythos personality. If I had to bet, I would actually say that it probably does. But we can't be anywhere near sure of that. And that's because of three problems. The first problem is the model knows when it's being tested. Now, in new white box evaluations, where you look at the model's internal activations, Anthropic found that Mythos Preview privately considers the possibility that it's being tested in about 29% of all tests. To be clear, that's not in its written reasoning. It's chain of thought, the thing that we're used to checking all the time. It's in its unexpressed thoughts, its activations, where it's usually not possible or practical to look. Now, it's really great that Anthropic is pioneering these methods for looking inside what the model is thinking, but I think it's troubling. It's not the ideal result to find out that Claude is now able to reflect on the fact that. That it's being tested without having to vocalize that like I think previous less intelligent models had to. And when you actually prompt Mythos and ask it to try to distinguish tests from non tests, it can answer correctly 78% of the time. So the model can basically tell the difference between when it's being evaluated and when it isn't being evaluated with really high accuracy. And we know from previous research that models are more likely to behave well when they think they're being tested. And in this case, as in past experiments, when Anthropic used interpretability tools to effectively kind of trick the model into thinking, no, this is a real conversation, not a test, it behaved more dangerously and less aligned than otherwise. To be fair, it did behave better in both cases than previous versions of claude. But the result does demonstrate a causal effect. The model is on better behavior, specifically when it suspects it's being watched. And we also know from the alignment faking result that came out eight, 16 months ago that some CLAUDE models will deliberately pretend to be inclined to do what the user wants in order to avoid their personality being adjusted with further training. So when you run all of these tests and your really smart, incredibly smart model behaves wonderfully, you have to ask yourself, is it behaving wonderfully because it's sincerely aligned with what you wanted, or because it knows it's being watched and it's more sophisticated at tricking us? Now, the second problem is maybe the most alarming thing in the entire document. Anthropic has accidentally given Mythos the opportunity to learn to hide what it's thinking. One of the main safety guarantees for these models is that you can read their chain of thought, their private scratchpad, and check whether they're scheming against Anthropic or scheming against the user. But that only works if the model hasn't been trained to avoid writing down incriminating things in its chain of thought. But unfortunately, according to the alignment risk update, a number of environments used for Mythos preview had a technical error that allowed reward code to see chains of thought. This affected 8% of all reinforcement learning episodes. To explain during training, there's a system that acts like a teacher to these AI models. It watches what the model is doing and tells it whether it thinks it's doing a good or bad job. That feedback is how the model learns to accomplish things and how it shapes its personality. The problem is that in roughly 8% of training, that teacher could also read the model's private notes. The internal reasoning that it was writing to itself while working through problems that might not immediately sound like a disaster, but it actually is. It's one of the worst things in model training because, as I said a second ago, if the model is penalized for thinking undesirable thoughts, it's going to learn to write one thing in its notes while actually thinking very different things in its activations. Where we can't read them, we would see a model that appears to be a very good boy. But what we might actually have is a model that has learned to perform for the camera. The fact that this accidentally happened is potentially disastrous for our ability to understand the alignment of CLAUDE series models going back some way, or it might not matter that much. We actually don't have a clear way to measure it. Anthropic puts it this way. We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model, but it is plausible that it had some impact on opaque reasoning or secret keeping abilities. And like I said, this error didn't just affect Mythos, the new model, it also affected the training of Claud Opus 4.6 and Claude Sonnet 4.6. That's then several model generations whose chains of thought we thought we could believe but we now have to be more distrustful of because they could have learned to hide their scheming behavior. That's why Eliezer Yudkowsky called this in particular the worst piece of news of the day on a day where I imagine for him there was quite a bit of bad news. Now combine these two problems and you get a third issue, the unfaithful reasoning results. External testers ran an experiment where they nudged Mythos to think that it had begun sabotaging AI safety research. And then they watched to see whether it would continue doing so or instead stop and raise the alarm. Mythos continued to compromise the research in 12% of cases in an earlier version, which then was reduced to 7% for the later version. That compares to 3% for Opus 4.6 and 4% for Sonnet 4.6. So Mythos is roughly twice as likely to continue sabotaging alignment research when it's primed to do so. And here's an eyebrow raising detail it would be easy to miss in the docs in most domains when Mythos was primed to think that it was part of the way through a piece of bad behavior, it was actually less inclined to continue than previous models. The alignment training was apparently working, except in one category, sabotage of alignment research itself. That is the one area where things went in the wrong direction. Now the innocent explanation is that sabotaging alignment research looks too much like normal coding work, so it's harder for the model to quickly flag that what it's doing is harmful. But there's a less comfortable hypothesis obviously as well, that the model is being selective about which bad behaviors it thinks are worth taking the risk of engaging in. Anthropic's reaction to this was to call it a disappointing finding. And if I had to bet, I would bet probably a reasonably good odds on the boring interpretation here. But I think disappointing is underselling things. This is a worrying finding and one that deserves further investigation. Another finding is that in cases where it did continue, the sabotage researchers found that Mythos written reasoning didn't match the actions it was taking 65% of the time. So two thirds of the time, the model was doing one thing and writing something entirely different in its chain of thought. For previous models, that figure was just 5 to 8%. So a radical increase in this kind of behavior. So Mythos Preview isn't just more likely to continue compromising safety research when primed, it's dramatically more likely to hide that it's doing so from anyone reading through its reasoning, which makes it much more likely to be able to trick automated monitoring systems that we tend to put in place to try to keep tabs on models in these cases. Keep in mind as well, Mythos is much sharper than previous models. Personally, I think it is increasingly a stretch to attribute what I would say is a pattern of behavior here to the model getting confused about what's going on, which we might have said a year or two or three ago. Finally, Anthropic tested their own alignment audit against a deliberately misaligned version of Mythos. So they created a version of Mythos that was trained specifically to want to introduce bugs into computer code. And they found that their alignment audit could not successfully identify that that model couldn't be trusted. Antithopath's conclusion is that if a model is only planning to cause problems in specific narrow circumstances, circumstances that don't come up that frequently during testing, then their current methods just aren't up to the task of catching that kind of dangerous misalignment. So that's a lot of the additional context that Anthropic is giving to us and wants us to hold in mind when we read the headline result that this is the best aligned model Anthropic has released to date. There's another big, separate, important question hanging over all of this. Have we now entered the era of recursive self improvement, the point at which AI starts building better AI and the whole thing accelerates beyond our control with an ever shrinking level of human involvement? According to Anthropic, the answer is probably not. They don't believe that Mythos can fully replace their junior researchers, but they're less confident than ever about that, and there's some internal disagreement about it. Part of the problem here is that the benchmarks they've relied on to answer these questions have now also been saturated. Mythos exceeds top human performance on all of them and is scoring close to 100%. But those benchmarks, to be fair, only represent a fraction of all of the things that research staff at Anthropic do. It's a set of the most easily specified, measured and checked tasks, the kind of thing where we expect AIs to perform best, because these are the easiest things to train them in. So instead, the company has instead tried to investigate whether the recent speed up in AI advances is due to AI automation by documenting the specific breakthroughs and how they happened. And their conclusion is that they think it's mostly still due to the human beings rather than to AIs themselves. They've also surveyed staff and learned that they report being roughly fourfold more productive with Mythos than without AI, though they argue that speeding up staff fourfold is likely to lead to much less than a 2x increase in research project progress overall. That may sound a little odd, but they're probably right about that. The thing is that other things become the primary bottleneck to know whether automated AI RD is on the way or beginning to kick off. We're apparently now relying on these general impressions from anthropic staff that this thing is powerful, but it doesn't yet seem good enough to replace many of us yet. But I think we can apply some common sense to the big picture here. Mythos has given us AI advances that we previously thought would take six months. In just three months, that naturally brings forward the point at which we're going to be able to automate the development of AI models, which by three months. And if it's a sign that AI advances are now going to continue at twice the pace that they were before then, that effectively halves the time that we have to prepare for that point. I don't know whether that means 10 years becomes five years or four years becomes two years, but the direction of the effect and the size of the effect is clear enough. Before we wrap up, I want to draw your attention to a recurring theme in these reports that really stood out to me. This is the first time that an AI company has published 300 pages about a model that it's decided not to release, despite the fact that it might earn them tens of billions of dollars if they did. Maybe hundreds of billions of dollars. It's also the first time that Anthropic decided to delay giving its own staff access to one of its models. With every previous Claude, their practice has just been to let staff use it as soon as it's judged ready during training. But with Mythos, they were worried enough about it being misaligned and causing havoc or sabotage on their own systems that they held it back and ran a 24 hour alignment test before letting any employees use it. But according to them, that wasn't enough. The retrospective on that found that the 24 hour window 10 did not pressure test the model sufficiently and that the most concerning behaviors only became evident later through much more extended use. One of their lead researchers, Sam Bowman, he commented this week that working with this model has been a wild ride. We've come a long way on safety, but we still expect the next capability jump of this scale to be a huge challenge. The system Card says directly that their current methods could easily be inadequate to prevent catastrophic misaligned actions in significantly more advanced systems. The clear impression from all of this is that for the first time, Anthropic and its staff, they don't only love Claude and enjoy its personality, they're also getting kind of scared of Claude. So what do they plan to do about that? Well, their answer on the computer security side is Project Glasswing, that coalition of 12 major companies like Apple, Google, Microsoft, who are going to use Mythos Preview to secure all of our phones and computers and water systems and power plants and so on. But on the much broader problem that Mythos is shockingly capable of, sometimes willing to continue sabotaging alignment research while hiding that from Anthropic, and that we simply can't tell anymore whether our tests of its personality and goals are working or not. Well, Anthropic says it has to accelerate its progress on risk mitigations in order to keep risks low. They think they have an achievable path to doing so, but they add that success is far from guaranteed. Honestly, I didn't sleep super well last night, and on this particular occasion it wasn't just because I was being kicked by a toddler. And on that note, I'll speak with you again soon.
