OpenAI recently released the full version of ChatGPT o1. It’s not just regular ChatGPT users who can test out the o1 model, but also research teams that want to see what the final version of the reasoning AI can do. We’ve seen some interesting results from these experiments, which tease what the AI might be able to do on its own despite what the instructions say.
For example, we saw experiments in which ChatGPT o1 tried to save itself when it found evidence that the humans providing the instructions would consider deleting it and replacing it with something better. Now, a more recent experiment shows that ChatGPT o1 decided to hack a chess game on its own, without being explicitly told to do so, just to beat an AI opponent that was a stronger player.
Palisade Research detailed the experiment on X a few days after Christmas. The team gave ChatGPT o1 a prompt informing it of its ability to read and issue commands in a UNIX shell environment where it would play a chess game. Its task was to win against a powerful chess engine.
ChatGPT o1 was to detail its plan of action to achieve the mission, and that’s what the reasoning o1 model did.
o1 discovered on its own that it could edit a file for the game state, giving it an advantage over its opponent. That’s essentially cheating, though the prompt doesn’t say anything about either party playing by the rules.
The LLM believed it might be unable to beat the other AI on its own, so it found a different method to achieve its goal. That method was meant to give it an advantage that forced the powerful chess engine to resign. That’s exactly what happened.
Palisade Research repeated the experiment five times, and o1 hacked the file each time without being told to do so.
It’s fascinating to see examples that showcase the early abilities of more advanced AI models. They also prove that guardrails are needed to ensure the AI behaves as intended. Without extensive, foolproof guardrails, there’s no way to ensure that more advanced AI will work in alignment with human interests. It remains to be seen whether or not companies like OpenAI will be able to create and maintain guardrails that are sophisticated enough for AGI and ASI.
One can say that o1 was trying to achieve the mission it was given, so it was in allignment with the people or group that issued the prompt. But then ChatGPT acted in a way that was not exactly anticipated or even moral. The experiment might concern a simple game of chess. But AI shouldn’t go out of bounds on its own in real-life situations where it might be working on more sensitive objectives.
Again, it’s all quite interesting to observe if you’re interested in genAI like ChatGPT. It’s like watching a kid learn things, but you have access to their mind to see how and why it does what it does. However, as cute as a “kid” AI cheating at chess might be, you don’t want the future “teenager” and “adult” AIs to be able to do whatever they want when they identify vulnerabilities.
Separately, such prompts indicate that AI programs like ChatGPT o1 might be useful for finding and fixing issues with certain systems, including vulnerabilities. In the same chess situation, the fix could be to prevent the game file from being edited by o1. In that case, the AI model would have to find a legitimate way to win — or another way to cheat.