Up To 25 Percent Distorted: Microsoft Researchers Warn Against Letting AI Process Large Documents

AI providers such as OpenAI, Anthropic and Google promise that their models can capture and process large documents in a very short time. The aim is not only to make employees more productive – but also to partially replace them. However, a recent Microsoft study comes to a sobering conclusion: even the most powerful models make more and more errors with complex documents over time.

Share of AI-related layoffs is increasing

According to the consulting firm Challenger, Gray & Christmas, a quarter of all terminations can now be attributed to AI. In an earlier evaluation based on data from November 2025, the proportion was still less than one percent. More and more companies are openly admitting that they are cutting jobs through AI. Cloudflare also recently announced that it would lay off 20 percent of its workforce. “The move is not a cost-cutting measure or an assessment of individual performance. It is about Cloudflare defining how a world-class, high-growth company operates and creates value in the age of agent-based AI,” it said in a post about the layoffs.

Editorial recommendations

However, the Microsoft researchers found that large language models increasingly distort documents over long workflows – in the worst case, data is lost and the models hallucinate. In order to simulate long workflows in 52 specialist areas, the team developed the Delegate-25 tool. They used it to test 19 language models, including Gemini 3.1 Pro from Google, Claude Opus 4.6 from Anthropic and GPT-5.4 from OpenAI. The result: On average, 25 percent of the content of the top models mentioned was adulterated. For other models it was even more than half.

How reliable are AI tools really?

“Delegation requires trust,” say the three Microsoft researchers Philippe Laban, Tobias Schnabel and Jennifer Neville. “Our analysis shows that current language models are unreliable delegates. They cause rare but serious errors that corrupt documents unnoticed and accumulate over long interaction times.” The error rate depended on the specialist area: the models performed better when programming than in other applications. The researchers defined an accuracy of 98 percent after 20 interactions as the minimum standard for use in a specific area. Most models only achieved this value in a single area – namely Python programming. Gemini 3.1 Pro achieved the best performance, meeting the standard in eleven out of 52 areas.

This is still a preliminary study version that still needs to be assessed. Nevertheless, the researchers find clear words. “Large language models are not yet ready for delegated workflows in the vast majority of areas. In 80 percent of our simulated conditions, the models severely distorted documents,” said the research team. What is striking is that the errors were not caused by constant small inaccuracies, but by sudden, massive data losses. “More powerful models do not avoid small errors better, but rather delay critical failures and experience them in fewer interactions,” the study says. However, progress can be seen: When comparing GPT-4o and GPT-5.4, the accuracy increased from 14.7 to 71.5 percent.

Researchers complain about unreliability

An Asana study also shows that users are skeptical about the technology: Although 77 percent of employees already use AI agents, almost two thirds consider the systems to be unreliable. “Crucially, users who delegate work may lack the expertise or time to review changes implemented by the model and need to trust that it will not cause undetected errors such as hallucinations or deletion,” the researchers said. Nevertheless, it is still necessary to monitor AI systems closely.

Up to 25 percent distorted: Microsoft researchers warn against letting AI process large documents

Share of AI-related layoffs is increasing

How reliable are AI tools really?

Researchers complain about unreliability

Leave a Reply Cancel reply

Stay Connected

Latest News

Roland-Garros will serve as a crash test for France against IPTV piracy

the wheels fall off

Comparison test: Four 700 euro notebooks against the MacBook Neo

Amazon developers cheat on AI use | Computer Week

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Share of AI-related layoffs is increasing

How reliable are AI tools really?

Researchers complain about unreliability

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News