A livestream of an AI model playing Pokémon Red on Twitch is captivating audiences this week.
The model is Anthropic’s latest release, Claude 3.7 Sonnet, which is navigating the classic Gameboy game with no prior training.
“HE’S DOING IT,” says one onlooker in the live chat. “Let’s see what happens now,” another adds. “GO, CLAUDE, GO!”
Claude 3.7 Sonnet plays Pokémon Red live (Credit: Twitch)
Although the livestream page claims the experiment is “a passion project made by a person who loves Claude and loves Pokémon,” it was actually set up by Claude’s creator, Anthropic.
The idea to unleash Claude on Pokémon Red began internally at Anthropic in 2024, with an earlier model called Claude 3.5 Sonnet. The project “gained a cult following within the company,” David Hershey, Anthropic technical staff member, tells PCMag. “The livestream on Twitch was a natural extension of that internal enthusiasm…Our team quickly created the ongoing livestream so anyone could watch Claude attempt to catch ’em all.”
Claude 3.7 is getting further in the game than its predecessor Claude 3.5. While Claude 3.5 could catch Pokémon and leave the starting area of Pallet Town, the “real breakthrough” with Claude 3.7 Sonnet is that it can complete challenges, collecting three badges from Pokémon gym leaders, Hershey says.
Video game progress is a lot easier to understand than the typical AI improvement metrics that OpenAI, Grok, Google, and all AI companies release with each new model.
Claude 3.7 Sonnet specs (Credit: Anthropic)
That’s why Claude included its new models’ gaming chops in the 3.7 Sonnet announcement. “We’re slowly moving away from traditional benchmarks in favor of more ‘accessible’ tests that can be understood by a larger group of people,” says Dianne Penn, lead product manager of research at Anthropic. “We’re at a point where standard evaluations don’t tell the full story of how much more capable each version of these models are.”
Recommended by Our Editors
Measuring the nuances of AI model improvement is a difficult task. This week, OpenAI admitted it struggled to measure the improvements of its latest model, GPT-4.5, and had to develop its own testing scale for “vibes,” or humanlike behavior.
Diagram of how Claude plays the game (Credit: Twitch)
When playing Pokémon Red, Claude can perform actions with the main game buttons (A, B, Up, Down, Left, Right, Start, Select) and navigate to specific coordinates on the screen. It takes screenshots and processes the images to understand its surroundings. As it plays, it updates its knowledge base with new information and keeps building upon it.
It’s not perfect, and sometimes gets confused by the navigation and where it is. It’s not always successful, either, but human onlookers are finding its solutions to challenges creative. In that sense, it’s providing a fresh perspective on how to beat the game that humans may not have thought of, along with some good internet fun.
Get Our Best Stories!
This newsletter may contain advertising, deals, or affiliate links.
By clicking the button, you confirm you are 16+ and agree to our
Terms of Use and
Privacy Policy.
You may unsubscribe from the newsletters at any time.
About Emily Forlini
Senior Reporter
