Leading Chatbots Often Exaggerate Scientific Findings: New Study

Researchers reveal that prominent chatbots tend to exaggerate scientific conclusions, with accuracy prompts surprisingly leading to more overgeneralizations. The findings stress the need for vigilant use of AI in scientific communication.

Leading chatbots, including ChatGPT and DeepSeek, often misrepresent scientific findings by exaggerating conclusions in up to 73% of cases, according to new research. The study, conducted by Uwe Peters from Utrecht University and Benjamin Chin-Yee from Western University in Canada and the University of Cambridge in the UK, highlights significant accuracy issues in AI-generated science summaries.

The researchers tested 10 of the most prominent large language models (LLMs), including ChatGPT, DeepSeek, Claude and LLaMA, analyzing nearly 5,000 summaries of research articles from prestigious scientific journals, such as Nature, Science and Lancet.

They discovered that six out of 10 models consistently stretched the conclusions of original texts, often transforming cautious, study-specific language into misleading, sweeping statements.

“Students, researchers and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite,” Peters said in a news release.

Interestingly, efforts to counteract these inaccuracies by prompting the models for accuracy had the opposite effect. When explicitly asked to avoid inaccuracies, the models were almost twice as likely to produce overgeneralized conclusions than when given unprompted summary tasks.

Published in Royal Society Open Science, the study underscores a concerning trend: newer AI models, such as ChatGPT-4o and DeepSeek, performed worse in terms of accuracy compared to their older counterparts. This poses additional risks in scientific communication, where precision is critical.

The researchers compared the AI-generated summaries to those written by humans. Notably, chatbots were nearly five times more likely to produce broad generalizations than human writers.

“Worse still, overall, newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones,” added Peters.

The issue stems from the fact that overgeneralizations are prevalent in human scientific writing, which the AI models are trained on, Chin-Yee explained.

Additionally, human users’ preferences for clear and broadly applicable language might lead the models to overgeneralize during their training process.

To mitigate these risks, the researchers recommend using LLMs like Claude, which demonstrated the highest accuracy, and adjusting settings to reduce a chatbot’s “temperature,” a parameter that controls its creativity. They also advocate for prompts that enforce indirect, past-tense reporting in summaries. 

“If we want AI to support science literacy rather than undermine it, we need more vigilance and testing of LLMs in science communication contexts,” Peters added.

Source: Utrecht University