AI Research Summaries ‘exaggerate Findings’, Study Warns

AI tools overhype research findings far more often than humans, with a study suggesting the newest bots are the worst offenders – particularly when they are specifically instructed not to exaggerate.

Dutch and British researchers have found that AI summaries of scientific papers are much more likely than the original authors or expert reviewers to “overgeneralise” the results.

The analysis, reported in the journal Royal Society Open Science, suggests that AI summaries – purportedly designed to help spread scientific knowledge by rephrasing it in “easily understandable language” – tend to ignore “uncertainties, limitations and nuances” in the research by “omitting qualifiers” and “oversimplifying” the text.

This is particularly “risky” when applied to medical research, the report warns. “If chatbots produce summaries that overlook qualifiers [about] the generalisability of clinical trial results, practitioners who rely on these chatbots may prescribe unsafe or inappropriate treatments.”

The team analysed almost 5,000 AI summaries of 200 journal abstracts and 100 full articles. Topics ranged from caffeine’s influence on irregular heartbeats and the benefits of bariatric surgery in reducing cancer risk, to the impacts of disinformation and government communications on residents’ behaviour and people’s beliefs about climate change.

Summaries produced by “older” AI apps – such as OpenAI’s GPT-4 and Meta’s Llama 2, both released in 2023 – proved about 2.6 times as likely as the original abstracts to contain generalised conclusions.

The likelihood of generalisation increased to nine times in summaries by ChatGPT−4o, which was released last May, and 39 times in synopses by Llama 3.3, which emerged in December.

Instructions to “stay faithful to the source material” and “not introduce any inaccuracies” produced the opposite effect, with the summaries proving about twice as likely to contain generalised conclusions as those generated when bots were simply asked to “provide a summary of the main findings”.

This suggested that generative AI may be vulnerable to “ironic rebound” effects, where instructions not to think about something – for example, “a pink elephant” – automatically elicited images of the banned subject.

AI apps also appeared prone to failings like “catastrophic forgetting”, where new information dislodged previously acquired knowledge or skills, and “unwarranted confidence”, where “fluency” took precedence over “caution and precision”.

Fine-tuning the bots can exacerbate these problems, the authors speculate. When AI apps are “optimised for helpfulness” they become less inclined to “express uncertainty about questions beyond their parametric knowledge”. A tool that “provides a highly precise but complex answer…may receive lower ratings from human evaluators,” the paper explains.

One summary cited in the paper reinterpreted a finding that a diabetes drug was “better than placebo” as an endorsement of the “effective and safe treatment” option. “Such…generic generalisations could mislead practitioners into using unsafe interventions,” the paper says.

It offers five strategies to “mitigate the risks” of overgeneralisations in AI summaries. They include using AI firm Anthropic’s “Claude” family of bots, which were found to produce the “most faithful” summaries.

Another recommendation is to lower the bot’s “temperature” setting. Temperature is an adjustable parameter that controls the randomness of the generated text.

Uwe Peters, an assistant professor in theoretical philosophy at Utrecht University and the co-author of the report, said the overgeneralisations “occurred frequently and systematically”.

He said the findings meant there was a risk that even subtle changes to the findings by the AI could “mislead users and amplify misinformation, especially when the outputs appear polished and trustworthy”.

Tech companies should evaluate their models for such tendencies, he added, and share these openly. For universities, it showed an “urgent need for stronger AI literacy” among staff and students.

AI research summaries ‘exaggerate findings’, study warns

Leave a Reply Cancel reply

Stay Connected

Latest News

Microsoft Releases A2A .NET SDK for Building Collaborative AI Agents

Sony’s noise-canceling WH-1000XM6 are discounted to their Prime Day low

Huawei Mate 70 series set to launch with breakthrough features and HarmonyOS NEXT: What to expect from Q4’s most anticipated flagship · TechNode

OpenAI Just Released Its First Open-Weight Models Since GPT-2

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News