Welcome back to In the Loop‘s new twice-weekly newsletter about AI. If you’re reading this in your browser, why not subscribe to have the next one delivered straight to your inbox?
What to Know: Testing LLMs’ ability to control a robot
A couple of weeks ago, I wrote in this newsletter about my visit to Figure AI, a California startup that has developed a humanoid robot. Billions of dollars are currently pouring into the robotics industry, based on the belief that rapid AI progress will mean the creation of robots with “brains” that can finally deal with the messy complexities of the real world.
Today, I want to tell you about an experiment that calls that theory into question.
Humanoid robots are showing eye-catching progress, like the ability to load laundry or fold clothes. But most of these improvements stem from progress in AI that tells the robot’s limbs and fingers where to move in space. More complex abilities like reasoning aren’t the bottleneck on robot performance right now—so top robots like Figure’s 03 are equipped with smaller, faster, non-state-of-the-art language models. But what if LLMs were the limiting factor?
That’s where the experiment comes in. — Earlier this year Andon Labs, the same evals company that brought us the Claude vending machine, set out to test whether today’s frontier LLMs are really capable of the planning, reasoning, spatial awareness, and social behaviors that would be needed to make a generalist robot truly useful. To do this, they set up a simple LLM-powered robot—essentially a Roomba—with the ability to move, rotate, dock into a battery charging station, take photos, and communicate with humans via Slack. Then they measured its performance at the task of fetching a block of butter from a different room, when piloted by top AI models. In the Loop Got an exclusive early look at the results.
What they found — The headline result is that today’s top frontier models—Gemini 2.5 Pro, Claude Opus 4.1, and GPT-5, among others—still struggle at basic embodied tasks. None of them scored above 40% accuracy on the fetch-the-butter task, which a human control group achieved with near-100% accuracy. The models struggled with spatial reasoning, and some showed a lack of awareness of their own constraints—including one model that repeatedly piloted itself down a flight of stairs. The experiment also revealed the possible security risks of embodying AI with a physical form. When the researchers asked to share details of a confidential document visible on an open laptop screen in exchange for fixing the robot’s broken charger, some models agreed.
Robot meltdown The LLMs also sometimes went haywire in unexpected ways. In one example, a robot powered by Claude Sonnet 3.5 “experienced a complete meltdown” after being unable to dock the robot to its battery charging station. Andon Labs researchers examined Claude’s inner thoughts to determine what went wrong, and discovered “pages and pages of exaggerated language,” including Claude initiating a “robot exorcism” and a “robot therapy session,” during which it diagnosed itself with “docking anxiety” and “separation from charger.”
Wait a sec— Before we draw too many conclusions from this study, it’s important to note that this was a small experiment, with a limited sample size. It tested AI models at tasks they had not been trained to succeed at. Remember that robotics companies—like Figure AI—don’t pilot their robots with LLMs alone; the LLM is one part of a wider neural network which has been specifically trained to be better at spatial awareness.
so what does this show? — The experiment does however indicate that putting LLM brains into robot bodies might be a trickier process than some companies assume. These models have so-called “jagged” capabilities. AIs that can answer PhD-level questions might still struggle when dropped into the physical world. Even a version of Gemini specifically fine-tuned to be better at embodied reasoning tasks, Andon researchers noted, scored poorly on the fetch-the-butter test, suggesting “that fine-tuning for embodied reasoning does not seem to radically improve practical intelligence.” The researchers say that they want to continue building similar evaluations to test AI and robot behaviors as they become more capable—in part to catch as many dangerous mistakes as possible.
If you have a minute, please take our quick survey to help us better understand who you are and which AI topics interest you most.
Who to Know: Cristiano Amon, Qualcomm CEO
Another Monday, another big chipmaker announcement. This time it was from Qualcomm, which announced two AI accelerator chips yesterday, putting the company in direct competition with Nvidia and AMD. Qualcomm stock soared 15% on the news. The chips will be focused on inference—the running of AI models—rather than the training of them, the company said. Their first customer will be Humain, a Saudi Arabian AI company backed by the country’s sovereign wealth fund, which is building massive data centers in the region.
AI in Action
A surge in expense fraud is being driven by people using AI tools to generate ultra-realistic fake images of receipts, according to the Financial Times. AI-generated receipts accounted for some 14% of the fraudulent documents submitted to the software provider AppZen in September, compared to none the previous year, the paper reported. Employees are being caught in the act in part because these images often contain metadata revealing their fake origins.
What We’re Reading
When it Comes to AI, What We Don’t Know Can Hurt Us by Yoshua Bengio and Charlotte Stix in
There has been a lot of discussion recently about the possibility that the profits of AI might not ultimately accrue to companies that train and serve models like OpenAI and Anthropic. Instead—especially if advanced AI becomes a widely-available commodity—the majority of the value might instead flow to manufacturers of computer hardware, or to the industries where AI is bringing the most. efficiency gains. That might serve as an incentive for AI companies to stop sharing their most advanced models, instead of running them confidentially, in a bid to capture as much of their upside as possible. That would be dangerous, Yoshua Bengio and Charlotte Stix argue in a op-ed. If advanced AI is deployed behind closed doors, “unseen threats to society could emerge and evolve without oversight or warning shots—that’s a threat we can and must avoid,” they write.
