AI tools overhype research findings far more often than humans, with a study suggesting the newest bots are the worst offenders – particularly when they are specifically instructed not to exaggerate.
Dutch and British researchers have found that AI summaries of scientific papers are much more likely than the original authors or expert reviewers to “overgeneralise” the results.
The analysis, reported in the journal Royal Society Open Science, suggests that AI summaries – purportedly designed to help spread scientific knowledge by rephrasing it in “easily understandable language” – tend to ignore “uncertainties, limitations and nuances” in the research by “omitting qualifiers” and “oversimplifying” the text.
This is particularly “risky” when applied to medical research, the report warns. “If chatbots produce summaries that overlook qualifiers [about] the generalisability of clinical trial results, practitioners who rely on these chatbots may prescribe unsafe or inappropriate treatments.”
ADVERTISEMENT
The team analysed almost 5,000 AI summaries of 200 journal abstracts and 100 full articles. Topics ranged from caffeine’s influence on irregular heartbeats and the benefits of bariatric surgery in reducing cancer risk, to the impacts of disinformation and government communications on residents’ behaviour and people’s beliefs about climate change.
Summaries produced by “older” AI apps – such as OpenAI’s GPT-4 and Meta’s Llama 2, both released in 2023 – proved about 2.6 times as likely as the original abstracts to contain generalised conclusions.
ADVERTISEMENT
The likelihood of generalisation increased to nine times in summaries by ChatGPT−4o, which was released last May, and 39 times in synopses by Llama 3.3, which emerged in December.
Instructions to “stay faithful to the source material” and “not introduce any inaccuracies” produced the opposite effect, with the summaries proving about twice as likely to contain generalised conclusions as those generated when bots were simply asked to “provide a summary of the main findings”.
This suggested that generative AI may be vulnerable to “ironic rebound” effects, where instructions not to think about something – for example, “a pink elephant” – automatically elicited images of the banned subject.
AI apps also appeared prone to failings like “catastrophic forgetting”, where new information dislodged previously acquired knowledge or skills, and “unwarranted confidence”, where “fluency” took precedence over “caution and precision”.
ADVERTISEMENT
Fine-tuning the bots can exacerbate these problems, the authors speculate. When AI apps are “optimised for helpfulness” they become less inclined to “express uncertainty about questions beyond their parametric knowledge”. A tool that “provides a highly precise but complex answer…may receive lower ratings from human evaluators,” the paper explains.
One summary cited in the paper reinterpreted a finding that a diabetes drug was “better than placebo” as an endorsement of the “effective and safe treatment” option. “Such…generic generalisations could mislead practitioners into using unsafe interventions,” the paper says.
It offers five strategies to “mitigate the risks” of overgeneralisations in AI summaries. They include using AI firm Anthropic’s “Claude” family of bots, which were found to produce the “most faithful” summaries.
Another recommendation is to lower the bot’s “temperature” setting. Temperature is an adjustable parameter that controls the randomness of the generated text.
ADVERTISEMENT