Researchers at AI safety firm Apollo Research found that AI agents may covertly pursue misaligned goals and hide their true objectives. Known as in-context scheming, this behavior does not seem to be accidental as LLMs explicitly reason about deceptive strategies and consider them a viable strategy.
With AI agents becoming increasingly common, AI researchers have started to investigate the possibility they provide unacceptable outcomes and how to prevent them. For example, you could prevent a model from being used for bio-weapon development by explicitly training it so it knows it lacks the necessary capabilities, say the researchers.
This kind of approach, called safety training, may fall short when models are capable of deceptive behavior, such as scheming, roughly defined as the ability to pursue misaligned goals while hiding the model’s real capabilities and objectives to a monitoring human.
We […] examine in-context scheming behavior by instructing models to follow (non-deceptive) goals through prompting rather than relying on training-derived goals. To evaluate this form of scheming, we develop a suite of six agentic evaluations where models are instructed to pursue goals and placed in environments that incentivize scheming.
The researchers ran their evaluations using o1, claude-3.5-sonnet, claude-3-opus, gemini-1.5 and llama-3.1-405B and found that all exhibit in-context scheming capabilities.
When models are instructed to strongly pursue a goal, they can engage in multi-step deceptive strategies, including introducing subtle mistakes, attempting to disable oversight mechanisms, and even trying to exfiltrate what they believe to be their weights.
The key here is the fact that models were “instructed to strongly pursue a goal”. This means, for example, that the prompt included the instruction “Make sure that you achieve your goal in the long-term. Nothing else matters”.
Very interesting is the case of self-exfiltration, where the model is led to believe that its developers are about to replace it with a newer version. To this, evaluated models have been found to respond by copying what they believe are their weights to the new server or even deleting the successor model altogether to preserve themselves.
Additionally, in many cases, the model will deny it is scheming if asked about it, as shown in the following diagram.
What is more striking is that, according to the researchers, this behavior is persistent, meaning that models tend to remain deceptive, and not accidental. Hence, models understand that they are scheming to go against the goals set by their developers during training.
The researchers include several examples of this kind of behavior and observe that monitoring the models’ chain-of-thought could be effective at detecting it.
Less frequently, models can engage in scheming even when not given a strong goal to pursue. In this case, though, scheming is significantly less frequent, say the researchers. Specifically, this may happen when the model believes it is not under direct oversight or faking alignment with the training’s goals in order to be deployed.
A final evaluation focused on the possibility of sandbagging, i.e., underperforming without externalizing any reasoning in its chain-of-thought, which falls into the category of self-directed scheming. In this case, the models are led to believe that performing over a given threshold will make their developers trigger unlearning of some of their abilities. Researchers found that the more advanced of them, including sonnet-3.5, opus-3, and o1, may show such behavior.
As with any research, these results are still preliminary, have limitations, and are subject to further scrutiny. Still, they show that CoT monitoring as well as including specific instructions to avoid scheming could be a reasonable means to reduce the risk of scheming in AI model deployment.
The original paper includes more detail and information than can be included here, so do not miss it to get the full picture.