We already have OpenAi’s agent. It is called Operator, and it is a system capable of seeing our screen and performing actions autonomously in the browser from our requests. It is something we had already seen with ‘Computer Use’ by Anthropic or Mariner de Deepmind, but here the company led by Sam Altman has its own special ingredient.
Computer-Using Agent (CUA). Operator uses a model called Computer-Useing Agent (CUA) that is based on GPT-4O. CUA interprets screenshots and interacts with websites through the typical browser controls, such as a cursor or a mouse.
How Cua works. As explained in the Openai documentation, this system processes those “raw pixels” of the captures that you make and use a mouse and a virtual keyboard to complete its actions. Once you have the screenshot, “reason” and follow a “thought” line in which the past actions take into account.
Promising performance. There are several benchmarks since they allow to evaluate the ability of these agricultural models. According to the tests carried out internally in OpenAI, Cugra achieves 38.1% performance in Osworld (use of a computer in general) against platforms such as Anthropic, which achieves 22%. Humans, yes, achieve 72.4% on average, which makes it clear that these systems still have a lot of improvement margin. In the use of the browser, the Benchmarks Webarena and Webvoyager also allow Operator to score very high: 58.1% and 87% respectively, compared to 36.2% and 56% of their competitors.
What about those catches that I collect operator. Operator continuously performs screenshots to “see” the browser interface with which he interacts. That browser does not run on our PC, but in a remote browser on OpenAI servers. User data, including these catches, are used according to OpenAI’s privacy policy. This is: they can be used to detect fraudulent activities and to improve the service. That implies that our data can be used to train and improve the model, although we can deactivate that option in operator settings. The user, yes, has the capacity for how long this data is stored in Operator. By default these data are saved until the user decides to delete them.
An agent who asks for help (and confirmation) when he needs them. As we have seen in other agents such as ‘Computer Use’ of Anthropic, Operator is an agent who does not act crazy. If you meet an obstacle – like a captcha code or the request to introduce user and password on a website – you will ask that the user take control, and will also ask for the final user confirmation if for example we have to validate a reservation or purchase of a product that has sought Operator. The operator user can also take control at all times.
Do not release the steering wheel. This reminds us of assisted driving systems such as Tesla FSD. It is true that it is able to take us from one place to another once we introduce the destination address, but it is important to continue paying attention and have our hands in the steering wheel in case they occur unforeseen. With Operator and the rest of this type agents something similar happens.
There are things that cannot be done. At the moment Operator cannot complete specialized tasks such as managing complex calendar systems or interacting with very personalized or non -standard websites. You will also refuse to do some tasks with high risk of provoking damages. For example, send emails, make electronic transactions or delete calendar events. Its benefits and capabilities will increase, without a doubt, but they will gradually do so and always guaranteeing that the possibility of error is the least possible.
Image | OpenAI
In WorldOfSoftware | The generative AI seems stagnant. Big tech believe they have an ace in the sleeve: “agents” who do things for us