Google DeepMind has revealed a few artificial intelligence (AI) models that allow robots to perform complex general tasks and reason in ways that were previously impossible.
Earlier this year, the company unveiled the first version of Gemini Robotics, an AI model based on the Gemini Large Language Model (LLM), but specialized for robotics. This allowed machines to reason and perform simple tasks in physical spaces.
The basic example Google points to is the banana test. The original AI model could receive a simple instruction, such as “place this banana in the basket,” and send a robotic arm to carry out that command.
Powered by the two new models, a robot can now take a selection of fruit and sort them into individual containers based on color. In one demonstration, a pair of robotic arms (the company’s Aloha 2 robot) accurately sorts a banana, an apple, and a lime into three plates of the correct color. Furthermore, the robot explains in natural language what it is doing and why it is performing the task.
Look
“We enable it to think,” he said Jie Tana senior staff researcher at DeepMind, in the video. “It can sense the environment, think step by step, and then complete this multi-step task. Although this example seems very simple, the idea behind it is really powerful. The same model will power more advanced humanoid robots to perform more complicated everyday tasks.”
AI-powered robotics of tomorrow
Although the demonstration seems simple at first glance, it demonstrates a number of advanced capabilities. The robot can spatially locate the fruit and plates, identify the fruit and the color of all objects, associate the fruit with the plates based on common features, and provide a natural language output describing its reasoning.
It’s all possible because of the way the latest iterations of the AI models interact. They work together in much the same way as a supervisor and an employee.
Google Robotics-ER 1.5 (the “brain”) is a vision language model (VLM) that collects information about a space and the objects in it, processes natural language commands, and can use advanced reasoning and tools to send instructions to Google Robotics 1.5 (the “hands and eyes”), a vision language action model (VLA). Google Robotics 1.5 links these instructions to the visual understanding of a space and creates a plan before executing it, providing feedback on its processes and reasoning.
The two models are more capable than previous versions and can use tools like Google Search to perform tasks.
The team demonstrated this ability by having a researcher ask Aloha to use recycling rules based on her location to sort some items into compost, recycling, and trash bins. The robot recognized that the user was in San Francisco and found recycling rules on the Internet to help it accurately sort the waste into the correct bins.
Another advancement in the new models is the ability to learn (and apply that learning) to multiple robotic systems. DeepMind representatives said in a statement that all the knowledge gained about the Aloha 2 robot (the pair of robot arms), the humanoid Apollo robot and the two-armed Franka robot can be applied to any other system because of the general way the models learn and evolve.
“General-purpose robots require a deep understanding of the physical world, advanced reasoning, and general and dexterous control,” the Gemini Robotics Team said in a technical report on the new models. That kind of general reasoning means the models can approach a problem with a broad understanding of physical spaces and interactions and solve problems accordingly, breaking down tasks into small, individual steps that can be accomplished easily. This contrasts with previous approaches, which relied on specialized knowledge applicable only to very specific, limited situations and individual robots.
The scientists provided an additional example of how robots could help in a real-world scenario. They presented an Apollo robot with two bins and asked it to sort clothes by color – with white going into one bin and other colors into the other. They then added an additional hurdle as the task progressed by moving the clothes and bins, forcing the robot to reassess the physical space and respond accordingly, which it successfully did.