A researcher affiliated with Elon Musk’s startup xAI has found a new way to both measure and manipulate entrenched preferences and values expressed by artificial intelligence models—including their political views.
The work was led by Dan Hendrycks, director of the nonprofit Center for AI Safety and an adviser to xAI. He suggests that the technique could be used to make popular AI models better reflect the will of the electorate. “Maybe in the future, [a model] could be aligned to the specific user,” Hendrycks told WIRED. But in the meantime, he says, a good default would be using election results to steer the views of AI models. He’s not saying a model should necessarily be “Trump all the way,” but he argues it should be biased toward Trump slightly, “because he won the popular vote.”
xAI issued a new AI risk framework on February 10 stating that Hendrycks’ utility engineering approach could be used to assess Grok.
Hendrycks led a team from the Center for AI Safety, UC Berkeley, and the University of Pennsylvania that analyzed AI models using a technique borrowed from economics to measure consumers’ preferences for different goods. By testing models across a wide range of hypothetical scenarios, the researchers were able to calculate what’s known as a utility function, a measure of the satisfaction that people derive from a good or service. This allowed them to measure the preferences expressed by different AI models. The researchers determined that they were often consistent rather than haphazard, and showed that these preferences become more ingrained as models get larger and more powerful.
Some research studies have found that AI tools such as ChatGPT are biased towards views expressed by pro-environmental, left-leaning, and libertarian ideologies. In February 2024, Google faced criticism from Musk and others after its Gemini tool was found to be predisposed to generate images that critics branded as “woke,” such as Black vikings and Nazis.
The technique developed by Hendrycks and his collaborators offers a new way to determine how AI models’ perspectives may differ from its users. Eventually, some experts hypothesize, this kind of divergence could become potentially dangerous for very clever and capable models. The researchers show in their study, for instance, that certain models consistently value the existence of AI above that of certain nonhuman animals. The researchers say they also found that models seem to value some people over others, raising its own ethical questions.
Some researchers, Hendrycks included, believe that current methods for aligning models, such as manipulating and blocking their outputs, may not be sufficient if unwanted goals lurk under the surface within the model itself. “We’re gonna have to confront this,” Hendrycks says. “You can’t pretend it’s not there.”
Dylan Hadfield-Menell, a professor at MIT who researches methods for aligning AI with human values, says Hendrycks’ paper suggests a promising direction for AI research. “They find some interesting results,” he says. “The main one that stands out is that as the model scale increases, utility representations get more complete and coherent.”