Google DeepMind introduced Gemini Robotics On-Device, a vision-language-action (VLA) foundation model designed to run locally on robot hardware. The model features low-latency inference and can be fine-tuned for specific tasks with as few as 50 demonstrations.
Gemini Robotics On-Device is the latest iteration of the Gemini Robotics family and the first that can be fine-tuned. It is intended for applications that need to run locally on the robot hardware for low latency or because of a lack of networking. The model follows natural language instructions and uses vision to find and reason about objects in its environment. DeepMind trained the model on dual-armed Aloha robots but also evaluated it on several other robotic platforms, showing that it could handle complex tasks on new hardware. According to DeepMind:
Gemini Robotics On-Device marks a step forward in making powerful robotics models more accessible and adaptable — and our on-device solution will help the robotics community tackle important latency and connectivity challenges. The Gemini Robotics SDK will further accelerate innovation by allowing developers to adapt the model to their specific needs. Sign up for model and SDK access via our trusted tester program. We’re excited to see what the robotics community will build with these new tools as we continue to explore the future of bringing AI into the physical world.
DeepMind first announced the Gemini Robotics family earlier this year. Based on Google’s Gemini 2.0 LLMs, Gemini Robotics includes an output modality for physical action. Along with the models, DeepMind released several benchmarks, including the ASIMOV Benchmark for evaluating robot safety mechanisms and the Embodied Reasoning QA (ERQA) evaluation dataset for measuring visual reasoning ability.
DeepMind tested their model’s ability to adapt rapidly to new tasks. For seven different tasks, such as preparing food and playing with cards, they fine-tuned the model with at most 100 demonstrations; on average, using their model the robot successfully completed the tasks over 60% time, beating the “current, best on-device VLA.” However, the off-device version of the Gemini Robotics model performed even better at nearly 80%.
In a Hacker News discussion about Gemini Robotics On-Device, one user wrote:
I’ve spent the last few months looking into VLAs and I’m convinced that they’re gonna be a big deal, i.e. they very well might be the “chatgpt moment for robotics” that everyone’s been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot….The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there’s no reason this method can’t be applied to any task. Want a smart lawnmower? It already understands “lawn,” “mow”, “don’t destroy toys in path” etc, just needs a finetune on how to correctly operate a lawnmower.
Gemini Robotics On-Device is not generally available but interested developers can sign up for the waitlist. There is also an interactive demo of a related model, Gemini Robotics-ER, available on the web. The Gemini Robotics SDK is available on GitHub.