It is now a native integration for Computer Use in a Gemini Flash model, and in this case Gemini 3.5 Flash. This capability was previously only available through a dedicated model. With native integration, developers have a unified tool to create sophisticated agents.
With Computer Use in Gemini 3.5 Flash
With this update, Gemini 3.5 Flash can analyze the screen, understand visual context, and generate concrete actions like mouse clicks or keyboard inputs.
The model can thus navigate websites or fill out forms independently, deciding on the best actions to take. More generally, it is about seeing, reasoning andact in web, desktop and mobile browsing environments.
On the OSWorld benchmark, which evaluates such skills, Gemini 3.5 Flash achieves a score of 78.4 and approaches the leaders in the field for interaction tasks.
How does this interaction work?
The process is based on a continuous interaction loop. The AI agent analyzes a screenshot of the GUI, whether it’s a browser, desktop app, or mobile app. The developer’s application must then perform these actions, capture the new screen state, and send it back to Gemini.
Based on the given goal, the model determines the next action to perform (click, scroll, etc.) and returns it for execution. This cycle continues until the task is completed, allowing automation without human intervention or a specific API.
Entrusting so much power to an AI?
Google subjected the model to adversarial training (adversarial training) in order to protect it against prompt injections, an attack technique aimed at diverting the AI from its initial objective.
Two optional safeguards are offered to companies: the need for explicit confirmation from the user for sensitive or irreversible actions, and automatic termination of the task in the event of detection of an indirect attack.
It is possible to preview the capabilities of Computer Use in Gemini 3.5 Flash with a demo environment on Browserbase.
