Google DeepMind has recently released the Gemini 2.5 Computer Use model, a specialized variant of its Gemini 2.5 Pro system designed to enable AI agents to interact directly with graphical user interfaces. The new model allows developers to build agents that can click, type, scroll, and manipulate interactive elements on web pages.
The Computer Use model brings Gemini’s multimodal reasoning and visual understanding to environments like browsers and mobile apps, where AI must perceive the on-screen context and act accordingly. Early evaluations show the model performing strongly on several interface control benchmarks, including Online-Mind2Web, WebVoyager, and AndroidWorld. In tests reported by DeepMind and Browserbase, it reached around 70% accuracy on the Online-Mind2Web benchmark, with response times shorter than those of other publicly evaluated systems.
In practical terms, the model operates in a loop via a new computer_use tool exposed through the Gemini API. Developers provide the model with a screenshot of the environment, a task description, and a record of previous actions. The model then returns structured function calls representing actions such as “click,” “type,” or “scroll.” The client executes these actions, captures a new screenshot, and feeds it back to the model — repeating the cycle until the task is complete.
While currently optimized for browser environments, the Computer Use model also shows strong promise for mobile UI control, signaling potential expansion to desktop operating systems in the future.
The launch has sparked critical discussion among developers. Wissam Benhaddad, a senior data science consultant, noted that while the approach is promising, practical deployment remains challenging:
This type of solution is promising, but I do not think it is production-ready yet. Current implementations are extremely slow and can often be replaced by standard API calls or direct app integrations. In my view, reasoning should not happen at the LLM level but rather within a latent space where information can move in a more compressed and efficient way — which is what Deep Learning excels at. I hope to see this kind of product evolve in that direction.
DeepMind emphasizes that safety guardrails are central to the system’s design. The Gemini 2.5 Computer Use model integrates protections against malicious prompts, unsafe actions, and scams within web environments. Each model action is assessed through a per-step safety service before execution, and developers can require user confirmation for sensitive operations such as purchases or system-level interactions.
The model’s system card outlines how these safety features mitigate potential risks while allowing developers to maintain full oversight. DeepMind advises thorough testing before deploying agents to production.
Gemini 2.5 Computer Use is available now in preview via the Gemini API in Google AI Studio and Vertex AI.