SIMA 2 Uses Gemini And Self-Improvement To Generalize Across Unseen 3D And Photorealistic Worlds

Google DeepMind researchers introduced SIMA 2 (Scalable Instructable Multiworld Agent), a generalist agent built on the Gemini foundation model that can understand and act across multiple 3D virtual game environments. The agent marks a departure from its predecessor SIMA 1 by moving beyond simple command execution to “reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images.” Where the first version required step-by-step direction, SIMA 2 can formulate multi-step plans and discuss strategy with users.

The researchers report the agent “substantially closes the gap with human performance” across their test portfolio of games while demonstrating what they describe as “robust generalization to previously unseen environments.” The system retains the underlying Gemini model’s reasoning capabilities and can interface with more advanced Gemini variants for additional functionality.

Source: Google DeepMind SIMA 2 Self-Improvement

The agent employs a self-improvement cycle where Gemini supplies an initial task along with an estimated reward for SIMA 2’s actions. The system adds this information to a bank of self-generated experience, which it then uses for training in subsequent iterations. According to the researchers, this process allows the agent to “improve on previously failed tasks entirely independently of human-generated demonstrations and intervention.”

The researchers tested SIMA 2’s generalization capabilities by evaluating performance in entirely held-out environments where the agent encounters new visuals, menus, and game mechanics.

The researchers also performed qualitative assessments in The Gunk, a story-driven action-adventure platformer centered on planetary cleanup using a handheld suction tool, and in Genie 3 environments. Genie 3 is a generative world model that creates photorealistic scenes conditioned on text descriptions or initial frames. These newly-generated environments do not appear in training datasets, allowing the team to test whether SIMA 2 can apply Gemini’s world knowledge beyond video game worlds to photorealistic settings.

The SIMA 2 architecture uses a Gemini Flash-Lite model trained on a mixture of gameplay and Gemini pretraining data. The researchers state this mixture was “crucial to maintain the original capabilities of the base model, such as vision understanding, dialogue, reasoning, and promptability.” The training process begins from a pretrained Gemini Flash-Lite checkpoint and applies supervised finetuning using the mixed dataset, training the model to produce keyboard-and-mouse action responses when prompted with image frames and instructions.

Google DeepMind researchers position SIMA 2 as a step beyond simple instruction following, creating what they describe as a more capable and collaborative embodied agent that can reason, converse, and perform goal-directed actions across 3D virtual worlds. The system demonstrates generalization that “extends beyond game worlds to novel photorealistic environments generated by Genie 3” and can improve in new environments based on self-generated experience.

Technical community members discussing the research on Reddit noted potential applications beyond gaming. One commenter observed that,

This will help train robots in realistic worlds, in a very cheap and safe manner. Should help boost research training for AI.

Another highlighted the technical architecture, noting that,

the team is using Genie 3 to create worlds and SIMA 2 to recursively self-improve in that world.

The team acknowledges current limitations, noting SIMA 2 “still faces challenges with very long-horizon, complex tasks that require extensive, multi-step reasoning and goal verification.” The agent operates with a limited context window to maintain low-latency interaction, and the researchers identify precise keyboard-and-mouse control execution and robust visual understanding of complex 3D scenes as ongoing challenges.

DeepMind released SIMA 2 as a limited research preview with early access provided to a small group of academics and game developers. The company worked with its Responsible Development and Innovation Team throughout development, particularly regarding the agent’s self-improvement capabilities. The researchers suggest the skills SIMA 2 acquired, including navigation, tool use, and collaborative task execution, could serve as building blocks for physically embodied AI systems in robotics applications.