By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Anthropic Investigates How Large Language Models Develop a Character
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Anthropic Investigates How Large Language Models Develop a Character
News

Anthropic Investigates How Large Language Models Develop a Character

News Room
Last updated: 2025/08/12 at 4:29 PM
News Room Published 12 August 2025
Share
SHARE

Recent research by Anthropic engineers explores identifiable patterns of activity that seems to give rise to an emerging personality. These traits, known as persona vectors, help explain how a model’s personality shifts over its lifecycle and lay the groundwork for better controlling those changes.

To better explain what they mean by a model’s personality, Anthropic point to cases such as Microsoft Bing adopting its “Sydney” alter-ego, ChatGPT starting to show unbalanced, sycophantic behavior, and xAI Grok’s recent instance of identifying itself as “MechaHitler”. More generally, personality shifts can be subtler, potentially leading a model to start fabricating facts.

To better understand these behaviors, Anthropic’s research focuses on extracting the patterns a model uses to represent character traits. For example, to study persona vectors involved in sycophancy, researchers compare the model’s activations when that behavior appears versus when it does not. Once the relevant persona vectors are localized, their effect can be tested by injecting them into a model and observing how its behavior changes.

When we steer the model with the “evil” persona vector, we start to see it talking about unethical acts; when we steer with “sycophancy”, it sucks up to the user; and when we steer with “hallucination”, it starts to make up information.

Anthropic’s method is automated, the researchers note, making it possible to extract persona vectors for any trait based on a definition of that trait. The paper focuses mainly on evil, sycophancy, and hallucination, but the same approach can also be used to study politeness, apathy, humor, and optimism.

The end goal of identifying persona vectors is to enable monitoring and controlling a model’s personality traits and their fluctuations throughout the different phases of its life cycle, from training to deployment.

For training, the expectation of Anthropic researchers is finding a way to train a model without it learning undesirable behaviors. They tried out two different approaches: inhibiting undesirable personas after the training was complete and preventing it from learning them in the first place. While both approaches proved effective, the first one had the side effect of making the model less intelligent. The second approach relies on an interesting kind of “trick”:

The method is loosely analogous to giving the model a vaccine —by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data —we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.

During deployment, a model’s personality can shift due to side effects from user instructions or intentional jailbreaks. The researchers found that when a system prompt deliberately steers the model toward a specific behavior, the corresponding persona becomes activated.

This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits. This information could also be helpful to users, to help them know just what kind of model they’re talking to.

Additionally, this technique helps predict which training data activate persona vectors, making it possible to identify datasets or even individual training samples likely to induce unwanted traits. In fact, their method allowed them to catch samples that were not obviously problematic to the human eye and that an LLM judge failed to flag.

There is much more to Anthropic research into persona vectors than can be covered here. Do not miss the full paper to get the full detail.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Forget the iPhone 17 — next year’s iPhone 18 due for huge performance boost
Next Article Linux Kernel Patched For AMD SEV-SNP Cache Coherency Vulnerability
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Man poisons himself after taking ChatGPT's dietary advice
News
Best AI Campaign Generators to Try in 2025 |
Computing
Deal: Get the Sony WH-1000XM6 headphones at their record low price!
News
How Incentiv’s 26% Token Pool Could Change Who Gets Paid in Web3 Forever | HackerNoon
Computing

You Might also Like

News

Man poisons himself after taking ChatGPT's dietary advice

1 Min Read
News

Deal: Get the Sony WH-1000XM6 headphones at their record low price!

3 Min Read
News

Mandrake – the rural life sim that lets you befriend a river and eavesdrop on the dead

5 Min Read
News

Shedeur Sanders slammed for confrontation with ESPN star after preseason game

5 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?