OpenAI has announced the future release of two new sets of frontier reasoning modelswhich he called o3 y o3 miniwith smaller and adjusted models for more specific tasks. Of course, the company is not going to launch them immediately, since it is still testing them, and they even admit that the results you can achieve with them may vary after you finish your training. What OpenAI has done is accept requests from the research community to test both systems before their general launch, for which there is still no date.
For now, a trial of the o3 mini will be opened, and later the o3 trial will arrive, although a date has not yet been set for this. But according to the company’s plans, the launch of o3 will take place at the end of next January, and then it will be the o3 launch.
Barely three months have passed after the company launched its o1 models, and it has skipped the logical name for its evolution: it has not launched the o2 and will go directly to the o3. The company claims that it has done so to avoid confusion with the telecommunications company O2, although it is likely that it has also done so to avoid conflicts over trademark issues.
When we refer to reasoning models, we are talking about models capable of dividing the instructions they receive into smaller tasks, with the aim of generating more solid results. Models of this type also usually show the path they follow to reach an answer, instead of showing a final solution without additional explanations. And according to OpenAI, o3 also surpasses previous performance records achieved by o1.
Unlike most AI, reasoning models like these check their own results, which helps them avoid some of the pitfalls that lead to mistakes. Of course, the error checking process they follow for this leads to some latency, which is why they take a little longer than other models considered “non-reasoning” to find solutions to the questions posed. In return, its solutions are more reliable in various fields, such as physics, science or mathematics.
o3, in particular, has been trained through reinforcement learning to “think” before answering, which according to OpenAI uses a private chain of thought. Thus, the model can reason about a task and carry out prior planning, carrying out several actions in a more or less extensive period of time, and thus have help to find the solution.
Unlike what happened with the o1 models, the o3 models allow you to adjust the time they have to reason, so they can be configured to use a low, medium or high computing time, or “thinking”. As is evident, the longer the reasoning time, the better they will perform the task assigned to them. Of course, it must be taken into account that these o3 models, considered reasoning, are not free of errors either. Although the reasoning component can reduce errors and hallucinations, it does not eliminate them.
The results of OpenAI o3 improve those of its predecessor in coding tests (SWE-Bench Verified) by 22.8%, and improve the scientific manager of OpenAI in competitive programming. In fact, the model obtained practically perfect results in one of the toughest mathematics competitions, the AIME 2024.
He only failed one question, in addition to achieving 87.7% correctness in a test bed for expert-level science problems, the GPQA Diamond. In the most complex math and reasoning problems, where AI typically runs into problems, o3 solved 25.2% of the problems, when no other model exceeded 2%. OpenAI also ensures that o3 has better results than its other reasoning models in code test benches.
On the other hand, OpenAI has announced that will delve into research on deliberative alignmentwhich requires the AI model to process security decisions step by step. So, instead of just giving the AI model a yes or no answer, this paradigm requires the model to actively reason about whether a user’s request complies with OpenAI security policies. According to the company, when tested on o1 it obtained better results in compliance with safety regulations than with previous models, such as GPT-4.