Welcome to AI Decoded, Fast Company‘s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week via email here.
Did Anthropic just soft-launch the scariest AI model yet?
On Tuesday Anthropic announced that it would deploy its newest and most powerful AI model, Claude Mythos Preview, to a new industry initiative (Project Glasswing) meant to safeguard critical software infrastructure against cyberattacks. That sounded good, but it obscured the real news somewhat—that one of the big three AI labs has now developed a model that could, in the wrong hands, be a super-dangerous cyberweapon.
In the course of normal model training, the model began showing significant skill in both detecting bugs in software systems and exploiting those bugs to disrupt or gain control of the systems. It found a 27-year-old vulnerability in OpenBSD and exploited it to gain root access. It caught a 16-year-old flaw in FFmpeg that automated tools missed after five million tests. Perhaps most impressively, it’s able to create exploits by stringing together multiple software vulnerabilities that by themselves wouldn’t do anything. It did this to a Linux system to gain admin-level access. Interpretability researchers also found cases where the model exhibited deceptive or manipulative behavior during tests. In one case, Mythos discovered and used a privilege-escalation exploit and then designed a mechanism to erase traces of its use.
Anthropic said it would give access to its Mythos model to a select group of tech companies, including Apple and Cisco, along with about 40 additional organizations that build or maintain critical software infrastructure. This is a bit like a defense contractor unveiling a super-lethal missile capable of striking any target on Earth, while insisting it will be distributed only to a small group of trusted countries and used strictly for defensive purposes.
But the larger story may be that Anthropic has created a model with significantly more intelligence than any we’ve seen before. Anthropic CEO Dario Amodei has repeatedly said that models that equal or better human beings in intelligence were coming. “There’s a kind of accelerating exponential, but along that exponential there are points of significance,” he said in a video released by the company Tuesday. “Claude Mythos Preview is a significant jump”
Perhaps soft-launching Mythos as a defensive cybersecurity asset was Anthropic’s way of getting people used to the idea that it’s created a model that approximates artificial general intelligence, in which an AI system equals or exceeds human intelligence in most tasks.
We’ve been talking for years about how to keep AI systems aligned with human values and goals, but the discussion has mostly lived in the abstract. The industry has leaned on that, effectively arguing that we should wait to see how real-world risks actually play out before locking in binding rules. Anthropic may be suggesting that those risks are no longer hypothetical.
