Artificial intelligence (AI) is not evolving: it is taking off. In just two and a half years we have passed from GPT-3.5 to GPT-4O, and whoever tried both knows: the difference in the conversation experience is enormous. GPT-3.5 marked a before and after when inaugurating the chatgpt era, but today nobody would probably use it again if it has more advanced models.
Now, what does it mean that a model is more advanced? The answer is complex. We talk about broader context windows (that is, the ability to read and process more information at the same time), more elaborate results and, in theory, of fewer errors. But there is a point that is still thorny: hallucinations. And do not always advance in the right direction.
What are hallucinations? In AI, hallucinating means inventing things. They are answers that sound good, even convincing, but that are false. The model does not lie because it wants, it simply generates text depending on patterns. If you don’t have enough data, you imagine them. And that can go unnoticed. There is the risk.
O3 and O4-mini: more reasoning, more errors. In September last year the so -called reasoning models arrived. They supposed an important leap: they introduced a kind of chain of thought that improved their performance in complex tasks. But they were not perfect. O1-PRO was more expensive than O3-mini, and not always more effective. Even so, this whole line was presented with a promise: reduce hallucinations.
The problem is that, according to Openai’s own data, that is not happening. Techcrunch cites a technical report of the company where it is recognized that O3 and O4-MINI hallucinate more than its predecessors. Literally. In internal tests with Personqa, O3 failed in 33% of the answers, twice as O1 and O3-mini. O4-mini made it even worse: 48%.
Other analyzes, such as the Independent Laboratory Transluce, show that O3 even invented actions: he said he had executed code in a MacBook Pro outside of Chatgpt and then having copied the results. Something that simply cannot do.
A challenge that is still pending. The idea of having models that do not hallucinate sounds fantastic. It would be the definitive step to fully trust your answers. But, meanwhile, it’s time to live with this problem. Especially when we use AI for delicate tasks: summarize documents, consult data, prepare reports. In those cases, it should be reviewed all twice.
Because there have already been serious errors. The most popular was that of a lawyer who presented to the judge documents generated by Chatgpt. They were convincing, yes, but also fictional: the model invented several legal cases. The AI will advance, but the critical judgment, for the moment, remains our thing.
Images | WorldOfSoftware with chatgpt | OpenAI
In WorldOfSoftware | Some users are using OPENAI O3 and O4-Mini to find out the location of photos: it is a nightmare for privacy
In WorldOfSoftware | If you’ve ever been afraid of chasing you a robot, China has organized a half marathon to breathe calm