Recent research by Anthropic engineers explores identifiable patterns of activity that seems to give rise to an emerging personality. These traits, known as persona vectors, help explain how a model’s personality shifts over its lifecycle and lay the groundwork for better controlling those changes.
To better explain what they mean by a model’s personality, Anthropic point to cases such as Microsoft Bing adopting its “Sydney” alter-ego, ChatGPT starting to show unbalanced, sycophantic behavior, and xAI Grok’s recent instance of identifying itself as “MechaHitler”. More generally, personality shifts can be subtler, potentially leading a model to start fabricating facts.
To better understand these behaviors, Anthropic’s research focuses on extracting the patterns a model uses to represent character traits. For example, to study persona vectors involved in sycophancy, researchers compare the model’s activations when that behavior appears versus when it does not. Once the relevant persona vectors are localized, their effect can be tested by injecting them into a model and observing how its behavior changes.
When we steer the model with the “evil” persona vector, we start to see it talking about unethical acts; when we steer with “sycophancy”, it sucks up to the user; and when we steer with “hallucination”, it starts to make up information.
Anthropic’s method is automated, the researchers note, making it possible to extract persona vectors for any trait based on a definition of that trait. The paper focuses mainly on evil, sycophancy, and hallucination, but the same approach can also be used to study politeness, apathy, humor, and optimism.
The end goal of identifying persona vectors is to enable monitoring and controlling a model’s personality traits and their fluctuations throughout the different phases of its life cycle, from training to deployment.
For training, the expectation of Anthropic researchers is finding a way to train a model without it learning undesirable behaviors. They tried out two different approaches: inhibiting undesirable personas after the training was complete and preventing it from learning them in the first place. While both approaches proved effective, the first one had the side effect of making the model less intelligent. The second approach relies on an interesting kind of “trick”:
The method is loosely analogous to giving the model a vaccine —by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data —we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.
During deployment, a model’s personality can shift due to side effects from user instructions or intentional jailbreaks. The researchers found that when a system prompt deliberately steers the model toward a specific behavior, the corresponding persona becomes activated.
This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits. This information could also be helpful to users, to help them know just what kind of model they’re talking to.
Additionally, this technique helps predict which training data activate persona vectors, making it possible to identify datasets or even individual training samples likely to induce unwanted traits. In fact, their method allowed them to catch samples that were not obviously problematic to the human eye and that an LLM judge failed to flag.
There is much more to Anthropic research into persona vectors than can be covered here. Do not miss the full paper to get the full detail.