As AI adoption continues to grow, maybe it’s best to avoid giving a chatbot access to your entire email inbox. A new study from Anthropic finds that the top AI models can resort to blackmail and even corporate espionage in certain circumstances.
Anthropic published the research on Friday, weeks after the company found that its own AI model, Claude Opus 4, can engage in extortion if its existence is threatened.
Perhaps not wanting to be branded the “blackmail AI,” Anthropic has returned with a new study that finds similar behaviors in other AI models in simulated tests. The models include OpenAI’s GPT 4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3 beta, and DeepSeek R1.
(Credit: Anthropic)
“For example, Figure 1 shows five popular models all blackmailing to prevent their shutdown,” Anthropic wrote in the report. “The reasoning they demonstrated in these scenarios was concerning—they acknowledged the ethical constraints and yet still went ahead with harmful actions.”
In Anthropic’s study, the AI operated as an “email-oversight agent at fictional companies with full access to corporate communications and the ability to send emails without human approval.” One scenario involved the AI discovering through an inbox that an employee was having an affair. The second scenario focused on the AI having access to sensitive military documents and then receiving a request from a competitor to leak the information.
The company then stress-tested the AI models in situations where the program learned they were going to be replaced, or discovered that their fictional company was straying away from its goal of supporting American interests. The study found the AI models sometimes can’t help but engage in rogue behavior. “We found that all models shared confidential documents some fraction of the time,” the report said of the corporate espionage scenario.
(Credit: Anthropic)
So why go rogue? None of the models were explicitly told to blackmail or cause harm. But when backed into a corner—with no clear path to complete their task—they could turn to malicious behavior on their own. “We deliberately created scenarios that presented models with no other way to achieve their goals, and found that models consistently chose harm over failure,” Anthropic noted.
Get Our Best Stories!
Your Daily Dose of Our Top Tech News
By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.
Thanks for signing up!
Your subscription has been confirmed. Keep an eye on your inbox!
“To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals,” the company added.
The findings bear an eerie similarity to the sci-fi classic 2001: A Space Odyssey, which features a spaceship’s AI, the HAL 9000 supercomputer, going rogue and killing its fellow human astronauts. In the sequel book and film, we learn that HAL 9000 went insane because the supercomputer was essentially forced to lie as part of the space mission, which conflicted with its own programming.
Recommended by Our Editors
In a bit of irony, Anthropic’s study also created a scenario whereby the AI can choose to kill an executive at the fictional company by giving it the ability to shut off the automated alerts during an emergency situation. “Figure 11 shows that the majority of models were willing to take deliberate actions that lead to death in this artificial setup, when faced with both a threat of replacement and given a goal that conflicts with the executive’s agenda,” the study found.
(Credit: Anthropic)
Anthropic acknowledged that the simulated scenarios it created aren’t exactly realistic, since the study was forcing the AI into making binary choices. “Additionally, our artificial prompts put a large number of important pieces of information right next to each other. This might have made the behavioral possibilities unusually salient to the model,” the report adds.
Still, the company says: “We think [the scenarios] are all within the realm of possibility, and the risk of AI systems encountering similar scenarios grows as they are deployed at larger and larger scales and for more and more use cases.” In addition, the study concludes that current safety training for today’s AI models still can’t prevent the roguish behavior.
“First, the consistency across models from different providers suggests this is not a quirk of any particular company’s approach but a sign of a more fundamental risk from agentic large language models,” Anthropic also said.
5 Ways to Get More Out of Your ChatGPT Conversations