Meta has introduced V-JEPA 2, a new video-based world model designed to improve machine understanding, prediction, and planning in physical environments. The model extends the Joint Embedding Predictive Architecture (JEPA) framework and is trained to predict outcomes in embedding space using video data.
The model is trained in two phases. In the first, over one million hours of video and one million images are used for self-supervised pretraining without any action labels. This enables the model to learn representations of motion, object dynamics, and interaction patterns. In the second phase, it is fine-tuned on 62 hours of robot data that includes both video and action sequences. This stage allows the model to make action-conditioned predictions and support planning.
One Reddit user commented on the approach:
Predicting in embedding space is going to be more compute efficient, and also it is closer to how humans reason… Really feeling the AGI with this approach, regardless of the current results using the system.
Others have noted the limits of the approach. Dorian Harris, who focuses on AI strategy and education, wrote:
AGI requires broader capabilities than V-JEPA 2’s specialised focus. It is a significant yet narrow breakthrough, and the AGI milestone is overstated.
In robotic applications, V-JEPA 2 is used for short- and long-horizon manipulation tasks. For example, when given a goal in the form of an image, the robot uses the model to simulate possible actions and select those that move it closer to the goal. The system replans at each step, using a model-predictive control loop. Meta reports task success rates between 65% and 80% for pick-and-place tasks involving novel objects and settings.
The model has also been evaluated on benchmarks such as Something-Something v2, Epic-Kitchens-100, and Perception Test. When used with lightweight readouts, it performs competitively on tasks related to motion recognition and future action prediction.
Meta is also releasing three new benchmarks focused on physical reasoning from video: IntPhys 2, which tests for recognition of physically implausible events; MVPBench, which assesses video-question answering under minimal changes; and CausalVQA, which focuses on cause-effect reasoning and planning.
David Eberle, CEO of Typewise, noted:
The ability to anticipate and adapt to dynamic situations is exactly what is needed to make AI agents more context-aware in real-world customer interactions, too, not just in robotics.
Model weights, code, and datasets are available via GitHub and Hugging Face. A leaderboard has been launched for community benchmarking.