Physical Intelligence recently announced π0 (pi-zero), a general-purpose AI foundation model for robots. Pi-zero is based on a pre-trained vision-language model (VLM) and outperforms other baseline models in evaluations on five robot tasks.
Pi-zero is based on the PaliGemma VLM, which was then further trained on a custom dataset collected from 7 different robots performing 68 tasks as well as the Open X-Embodiment dataset. The resulting base model can accept natural language commands and perform tasks “at rudimentary proficiency.” The Physical Intelligence researchers compared pi-zero’s performance to two baseline models, OpenVLA and Octo, on five different tasks, including folding laundry and bussing a table; pi-zero achieved “large improvements” over the baselines. According to Physical Intelligence:
The frontiers of robot foundation model research include long-horizon reasoning and planning, autonomous self-improvement, robustness, and safety. We expect that the coming year will see major advances along all of these directions, but the initial results paint a promising picture for the future of robot foundation models: highly capable generalist policies that inherit semantic understanding from Internet-scale pretraining, incorporate data from many different tasks and robot platforms, and enable unprecedented dexterity and physical capability.
Pi-zero’s architecture is inspired by Transfusion, a model created by Meta and Waymo that operates on tokens representing both discrete and continuous data. In the case of pi-zero, the model has a distinct module that handles robot-specific actions I/O, which the researchers call the “action expert.” The model’s input is a combination of vision images, the robot’s joint angles, and a language command; the output is a sequence of robot action tokens.
For some complex tasks, the human operator’s language command was first fed into a high-level VLM which decomposed it into a sequence of simpler tasks, as done by models like SayCan. The researchers found that this scheme improved performance on tasks such as setting a table. They also found similar improvement when the human operator gave the robot a sequence of simpler commands.
Physical Intelligence co-founder Karol Hausman answered several questions about the model on X. He confirmed that their demo video was not scripted or teleoperated. When asked why his team used folding laundry for evaluating their model, he said:
There are…many reasons why laundry folding is a good task:
– everyone understands if it’s done well
– it’s easy to reset (throw the clothes back in the basket)
– it can be arbitrarily long (multiple items in a row)
– it’s easy to generate diverse data (many clothing items)
Andrew Ng’s The Batch newsletter discussed pi-zero, saying:
One of the team members compared π0 to GPT-1 for robotics — an inkling of things to come. Although there are significant differences between text data (which is available in large quantities) and robot data (which is hard to get and varies per robot), it looks like a new era of large robotics foundation models is dawning.
Several other large players have been developing multimodal foundation models for robotics. Earlier this year, InfoQ covered NVIDIA’s GR00T model, which is trained on video, text, and real robot demonstrations. Last year, InfoQ covered Google’s PaLM-E, a combination of their PaLM and Vision Transformer (ViT) models designed for controlling robots, and Google DeepMind’s Robotics Transformer 2 (RT-2) a vision-language-action (VLA) AI model for controlling robots.