{"id":15771,"date":"2025-01-22T21:37:49","date_gmt":"2025-01-22T21:37:49","guid":{"rendered":"https:\/\/www.tun.com\/home\/?p=15771"},"modified":"2025-01-23T15:09:47","modified_gmt":"2025-01-23T15:09:47","slug":"is-ai-capable-of-passing-ph-d-level-history-tests","status":"publish","type":"post","link":"https:\/\/www.tun.com\/home\/is-ai-capable-of-passing-ph-d-level-history-tests\/","title":{"rendered":"Is AI Capable of Passing Ph.D.-Level History Tests?"},"content":{"rendered":"\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-uagb-blockquote uagb-block-e7eb3fc3 uagb-blockquote__skin-border uagb-blockquote__stack-img-none\"><blockquote class=\"uagb-blockquote\"><div class=\"uagb-blockquote__content\">Despite its prowess in various domains, AI still falls short in expert-level history knowledge, with top-performing models scoring just 46% on accuracy. The study highlights the limitations and future potential for AI in historical research.<\/div><footer><div class=\"uagb-blockquote__author-wrap uagb-blockquote__author-at-left\"><\/div><\/footer><\/blockquote><\/div>\n\n\n\n<div class=\"wp-block-group is-content-justification-space-between is-nowrap is-layout-flex wp-container-core-group-is-layout-0dfbf163 wp-block-group-is-layout-flex\"><div style=\"font-size:16px;\" class=\"has-text-align-left wp-block-post-author\"><div class=\"wp-block-post-author__content\"><p class=\"wp-block-post-author__name\">The University Network<\/p><\/div><\/div>\n\n\n<div class=\"wp-block-uagb-social-share uagb-social-share__outer-wrap uagb-social-share__layout-horizontal uagb-block-ee584a31\">\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-ec619ce7\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/www.facebook.com\/sharer.php?u=\" tabindex=\"0\" role=\"button\" aria-label=\"facebook\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\"><path d=\"M504 256C504 119 393 8 256 8S8 119 8 256c0 123.8 90.69 226.4 209.3 245V327.7h-63V256h63v-54.64c0-62.15 37-96.48 93.67-96.48 27.14 0 55.52 4.84 55.52 4.84v61h-31.28c-30.8 0-40.41 19.12-40.41 38.73V256h68.78l-11 71.69h-57.78V501C413.3 482.4 504 379.8 504 256z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n\n\n\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-32d99934\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/twitter.com\/share?url=\" tabindex=\"0\" role=\"button\" aria-label=\"twitter\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\"><path d=\"M389.2 48h70.6L305.6 224.2 487 464H345L233.7 318.6 106.5 464H35.8L200.7 275.5 26.8 48H172.4L272.9 180.9 389.2 48zM364.4 421.8h39.1L151.1 88h-42L364.4 421.8z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n\n\n\n<div class=\"wp-block-uagb-social-share-child uagb-ss-repeater uagb-ss__wrapper uagb-block-1d136f14\"><span class=\"uagb-ss__link\" data-href=\"https:\/\/www.linkedin.com\/shareArticle?url=\" tabindex=\"0\" role=\"button\" aria-label=\"linkedin\"><span class=\"uagb-ss__source-wrap\"><span class=\"uagb-ss__source-icon\"><svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 448 512\"><path d=\"M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z\"><\/path><\/svg><\/span><\/span><\/span><\/div>\n<\/div>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>Artificial intelligence chatbots have revolutionized fields from customer service to legal research, but new findings suggest that these systems still struggle with complex historical knowledge. A team of complexity scientists and AI experts recently evaluated the performance of advanced language models, including ChatGPT-4, on Ph.D.-level history questions. The results, <a href=\"https:\/\/nips.cc\/virtual\/2024\/poster\/97439\" target=\"_blank\" rel=\"noopener\" title=\"\">presented<\/a> at the NeurIPS conference in Vancouver, reveal significant gaps in their historical understanding.<\/p>\n\n\n\n<p>Led by Peter Turchin, a complexity scientist at the Complexity Science Hub (CSH), and Maria del Rio-Chanona, an assistant professor at the University College London, the study tested AI models like ChatGPT-4 Turbo, Llama and Gemini against a rigorous benchmark developed using the Seshat Global History Databank. The benchmark encompassed nearly 600 societies, over 36,000 data points and more than 2,700 scholarly references.<\/p>\n\n\n\n<p>\u201cLarge language models (LLMs), such as ChatGPT, have been enormously successful in some fields \u2014 for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,\u201d Turchin, who heads the CSH research group on\u00a0social complexity and collapse, said in a <a href=\"https:\/\/csh.ac.at\/news\/can-chatgpt-pass-a-phd-level-history-test\/\" target=\"_blank\" rel=\"noopener\" title=\"\">news release<\/a>.<\/p>\n\n\n\n<p>Despite improvements from earlier iterations, the best-performing model, GPT-4 Turbo, achieved only 46% accuracy on a multiple-choice history test designed for graduate students. Although this is better than the 25% accuracy expected from random guessing, it underscores the limitations of AI in understanding nuanced historical contexts.<\/p>\n\n\n\n<p>&#8220;I thought the AI chatbots would do a lot better,\u201d added del Rio-Chanona, who&#8217;s also an external faculty member at CSH and the corresponding author. \u201cHistory is often viewed as facts, but sometimes interpretation is necessary to make sense of it.\u201d<\/p>\n\n\n\n<p>One of the study&#8217;s most surprising findings was the domain specificity of AI capabilities. <\/p>\n\n\n\n<p>\u201cThis result shows that artificial \u2018intelligence\u2019 is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,\u201d Turchin added.<\/p>\n\n\n\n<p>The performance varied markedly across different time periods and geographic regions. AI models were more accurate in answering questions about ancient history, particularly from 8,000 BCE to 3,000 BCE but struggled significantly with more recent historical events from 1,500 CE to the present. <\/p>\n\n\n\n<p>There were also notable disparities in accuracy based on geographic focus, with models like OpenAI\u2019s performing better for Latin America and the Caribbean but less effectively for Sub-Saharan Africa.<\/p>\n\n\n\n<p>First author Jakob Hauser, a resident scientist at CSH, explained the importance of setting such benchmarks.<\/p>\n\n\n\n<p>\u201cWe wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge. The Seshat Databank allows us to go beyond \u2018general knowledge\u2019 questions,\u201d he said in the news release.<\/p>\n\n\n\n<p>The study further highlighted that AI models excelled in certain categories like legal systems and social complexity but faltered on topics related to discrimination and social mobility.<\/p>\n\n\n\n<p>&#8220;The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They&#8217;re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they&#8217;re not yet up to the task,\u201d added del Rio-Chanona.<\/p>\n\n\n\n<p>Looking forward, the research team, which includes experts from the University of Oxford and the Alan Turing Institute, aims to expand their dataset and refine their benchmarks to include more diverse and complex historical questions. <\/p>\n\n\n\n<p>&#8220;We plan to continue refining the benchmark by integrating additional data points from diverse regions, especially the Global South,&#8221; Hauser added.  &#8220;We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study.&#8221;<\/p>\n\n\n\n<p>These findings offer critical insights for both historians and AI developers, highlighting areas for improvement and the potential for better integration of AI in historical research.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Artificial intelligence chatbots have revolutionized fields from customer service to legal research, but new findings suggest that these systems still struggle with complex historical knowledge. A team of complexity scientists and AI experts recently evaluated the performance of advanced language models, including ChatGPT-4, on Ph.D.-level history questions. The results, presented at the NeurIPS conference in [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-no-separators","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[8,18],"tags":[],"class_list":["post-15771","post","type-post","status-publish","format-standard","hentry","category-ai","category-education"],"acf":[],"aioseo_notices":[],"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false},"uagb_author_info":{"display_name":"The University Network","author_link":"https:\/\/www.tun.com\/home\/author\/funky_junkie\/"},"uagb_comment_info":0,"uagb_excerpt":"Artificial intelligence chatbots have revolutionized fields from customer service to legal research, but new findings suggest that these systems still struggle with complex historical knowledge. A team of complexity scientists and AI experts recently evaluated the performance of advanced language models, including ChatGPT-4, on Ph.D.-level history questions. The results, presented at the NeurIPS conference in&hellip;","_links":{"self":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/15771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/comments?post=15771"}],"version-history":[{"count":10,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/15771\/revisions"}],"predecessor-version":[{"id":15787,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/posts\/15771\/revisions\/15787"}],"wp:attachment":[{"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/media?parent=15771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/categories?post=15771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tun.com\/home\/wp-json\/wp\/v2\/tags?post=15771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}