AI providers such as OpenAI, Anthropic and Google promise that their models can capture and process large documents in a very short time. The aim is not only to make employees more productive – but also to partially replace them. However, a recent Microsoft study comes to a sobering conclusion: even the most powerful models make more and more errors with complex documents over time.
Share of AI-related layoffs is increasing
According to the consulting firm Challenger, Gray & Christmas, a quarter of all terminations can now be attributed to AI. In an earlier evaluation based on data from November 2025, the proportion was still less than one percent. More and more companies are openly admitting that they are cutting jobs through AI. Cloudflare also recently announced that it would lay off 20 percent of its workforce. “The move is not a cost-cutting measure or an assessment of individual performance. It is about Cloudflare defining how a world-class, high-growth company operates and creates value in the age of agent-based AI,” it said in a post about the layoffs.
Editorial recommendations
${content}
${custom_anzeige-badge}
${custom_tr-badge}
${section}
${title}
However, the Microsoft researchers found that large language models increasingly distort documents over long workflows – in the worst case, data is lost and the models hallucinate. In order to simulate long workflows in 52 specialist areas, the team developed the Delegate-25 tool. They used it to test 19 language models, including Gemini 3.1 Pro from Google, Claude Opus 4.6 from Anthropic and GPT-5.4 from OpenAI. The result: On average, 25 percent of the content of the top models mentioned was adulterated. For other models it was even more than half.
How reliable are AI tools really?
“Delegation requires trust,” say the three Microsoft researchers Philippe Laban, Tobias Schnabel and Jennifer Neville. “Our analysis shows that current language models are unreliable delegates. They cause rare but serious errors that corrupt documents unnoticed and accumulate over long interaction times.” The error rate depended on the specialist area: the models performed better when programming than in other applications. The researchers defined an accuracy of 98 percent after 20 interactions as the minimum standard for use in a specific area. Most models only achieved this value in a single area – namely Python programming. Gemini 3.1 Pro achieved the best performance, meeting the standard in eleven out of 52 areas.
This is still a preliminary study version that still needs to be assessed. Nevertheless, the researchers find clear words. “Large language models are not yet ready for delegated workflows in the vast majority of areas. In 80 percent of our simulated conditions, the models severely distorted documents,” said the research team. What is striking is that the errors were not caused by constant small inaccuracies, but by sudden, massive data losses. “More powerful models do not avoid small errors better, but rather delay critical failures and experience them in fewer interactions,” the study says. However, progress can be seen: When comparing GPT-4o and GPT-5.4, the accuracy increased from 14.7 to 71.5 percent.
Researchers complain about unreliability
An Asana study also shows that users are skeptical about the technology: Although 77 percent of employees already use AI agents, almost two thirds consider the systems to be unreliable. “Crucially, users who delegate work may lack the expertise or time to review changes implemented by the model and need to trust that it will not cause undetected errors such as hallucinations or deletion,” the researchers said. Nevertheless, it is still necessary to monitor AI systems closely.
Top Article
${content}
${custom_anzeige-badge}
${custom_tr-badge}
${section}
${title}
