Despite its prowess in various domains, AI still falls short in expert-level history knowledge, with top-performing models scoring just 46% on accuracy. The study highlights the limitations and future potential for AI in historical research.
Artificial intelligence chatbots have revolutionized fields from customer service to legal research, but new findings suggest that these systems still struggle with complex historical knowledge. A team of complexity scientists and AI experts recently evaluated the performance of advanced language models, including ChatGPT-4, on Ph.D.-level history questions. The results, presented at the NeurIPS conference in Vancouver, reveal significant gaps in their historical understanding.
Led by Peter Turchin, a complexity scientist at the Complexity Science Hub (CSH), and Maria del Rio-Chanona, an assistant professor at the University College London, the study tested AI models like ChatGPT-4 Turbo, Llama and Gemini against a rigorous benchmark developed using the Seshat Global History Databank. The benchmark encompassed nearly 600 societies, over 36,000 data points and more than 2,700 scholarly references.
“Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields — for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” Turchin, who heads the CSH research group on social complexity and collapse, said in a news release.
Despite improvements from earlier iterations, the best-performing model, GPT-4 Turbo, achieved only 46% accuracy on a multiple-choice history test designed for graduate students. Although this is better than the 25% accuracy expected from random guessing, it underscores the limitations of AI in understanding nuanced historical contexts.
“I thought the AI chatbots would do a lot better,” added del Rio-Chanona, who’s also an external faculty member at CSH and the corresponding author. “History is often viewed as facts, but sometimes interpretation is necessary to make sense of it.”
One of the study’s most surprising findings was the domain specificity of AI capabilities.
“This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin added.
The performance varied markedly across different time periods and geographic regions. AI models were more accurate in answering questions about ancient history, particularly from 8,000 BCE to 3,000 BCE but struggled significantly with more recent historical events from 1,500 CE to the present.
There were also notable disparities in accuracy based on geographic focus, with models like OpenAI’s performing better for Latin America and the Caribbean but less effectively for Sub-Saharan Africa.
First author Jakob Hauser, a resident scientist at CSH, explained the importance of setting such benchmarks.
“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge. The Seshat Databank allows us to go beyond ‘general knowledge’ questions,” he said in the news release.
The study further highlighted that AI models excelled in certain categories like legal systems and social complexity but faltered on topics related to discrimination and social mobility.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” added del Rio-Chanona.
Looking forward, the research team, which includes experts from the University of Oxford and the Alan Turing Institute, aims to expand their dataset and refine their benchmarks to include more diverse and complex historical questions.
“We plan to continue refining the benchmark by integrating additional data points from diverse regions, especially the Global South,” Hauser added. “We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study.”
These findings offer critical insights for both historians and AI developers, highlighting areas for improvement and the potential for better integration of AI in historical research.