Can AI pass high-level history exams?

Generative AI bot writing answers with pen on paper

Could AI models pass high-level history exams? (Image by © BPawesome - stock.adobe.com)

In a nutshell

Even the most advanced AI models, like GPT-4-Turbo, failed to demonstrate expert-level understanding of global history, scoring only 46% on a rigorous benchmark test designed for graduate-level inquiry.
AI models performed better on ancient history than more recent events, and consistently struggled with regions outside the Western world, especially Sub-Saharan Africa and Oceania, highlighting biases in training data.
The study underscores a major limitation of current AI: while they excel at surface-level fact recall, they lack the deep contextual reasoning and global coverage needed for sophisticated historical analysis.

VIENNA — Large language models like ChatGPT and Claude are supposed to be our new digital encyclopedias, capable of answering complex questions about everything from science to literature. But when researchers put seven leading AI models through graduate-level history exams, even the best-performing model, GPT-4-Turbo, managed only 46% accuracy on questions that any history PhD student should nail.

This isn’t your typical high school history quiz about memorizing dates and famous battles. Researchers from the Complexity Science Hub in Vienna created what might be the most comprehensive test of AI historical knowledge ever conducted. They drew from a massive database called Seshat that contains expert-verified information about 600 historical societies spanning 10,000 years of human civilization. They presented these findings at the 38th Conference on Neural Information Processing Systems in December 2024.

If these AI models were students in a graduate history seminar, most would be failing. GPT-4-Turbo, the standout performer, would barely earn a passing grade, while the weakest model, Llama-3.1-8B, scored just 33.6%, equivalent to someone guessing randomly and getting lucky.

“I thought the A.I. chatbots would do a lot better,” says corresponding author Maria del Rio-Chanona from University College London, in a statement.

ChatGPT prompt on computer — Even the most advanced model of ChatGPT performed poorly on the exam. (Bangla press/Shutterstock)

AI models have been crushing standardized tests left and right. GPT-4 scored above 80% on Advanced Placement U.S. History and Art History exams. But when faced with truly global, expert-level historical knowledge that goes beyond the Western-centric curriculum most AI training data reflects, these digital know-it-alls suddenly look a lot more human.

Testing AI on History’s Hardest Questions

“Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” says study author Peter Turchin, who leads the Complexity Science Hub’s (CSH) research group on social complexity and collapse.

Researchers tested seven major AI models: GPT-3.5, GPT-4-Turbo, GPT-4o, Llama-3-70B, Llama-3.1-70B, Llama-3.1-8B, and Gemini-1.5-flash. But this wasn’t a simple true-or-false exam. Each question presented four possible answers: present, absent, inferred present, or inferred absent, forcing the AI to distinguish between what historians know for certain versus what they can reasonably deduce from limited evidence.

“History is often viewed as facts, but sometimes interpretation is necessary to make sense of it,” adds del Rio-Chanona.

For example, a question might ask whether a specific military technology existed in ancient Rome during a particular century, requiring the AI to understand not just historical facts but the difference between documented evidence and scholarly inference. This mirrors the kind of nuanced thinking expected from graduate students and professional historians.

The test drew from 36,000 data points covering societies from every inhabited continent, spanning from the Neolithic period to the Industrial Revolution. Unlike typical AI benchmarks that might inadvertently favor certain regions or time periods, this dataset was specifically designed to avoid Western bias and represent global historical knowledge equally.

AI Gets Lost in Time

AI models performed better on ancient history than on recent events. GPT-4-Turbo achieved 55.3% accuracy for the period 8,000-6,000 BCE but dropped to just 38.7% for 1,500-2,000 CE. This suggests that while we have far more detailed records from recent centuries, AI models struggle more with the complexity and nuance of well-documented historical periods.

Geographically, the models showed clear biases. Performance was strongest for the Americas and weakest for Oceania and Sub-Saharan Africa. GPT-4-Turbo and GPT-4o performed best in Latin America, while Llama-3.1-70B excelled with North American historical questions. However, nearly all models struggled significantly with Sub-Saharan African history, highlighting persistent gaps in AI training data.

AI student — Human graduate history students employ more advanced thinking that gives them an advantage on high-level history exams. (ImageFlow/Shutterstock)

When broken down by historical topics, AI models performed best on questions about social complexity and warfare — areas with more concrete, measurable evidence. They struggled most with economic systems and legal frameworks, which often require understanding subtle cultural and institutional contexts that don’t translate easily into the kind of data AI models consume during training.

Current language models excel at regurgitating information that appears frequently in their training data, but they lack the deep contextual understanding that allows human experts to make sophisticated judgments about incomplete historical records.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” adds del Rio-Chanona.

AI models aren’t the omniscient digital historians they’re often portrayed to be. They’re sophisticated pattern-matching systems that reflect the biases and limitations of their training data, struggling particularly with the kind of nuanced, expert-level analysis that separates real historical understanding from mere fact recall.

Until AI developers create more globally representative training datasets and better methods for handling uncertainty, AI might help you remember when the Civil War ended, but don’t count on it as a study guide for graduate-level history exams.

Paper Summary

Methodology

Researchers created a comprehensive benchmark test called HiST-LLM using data from the Seshat Global History Databank. They converted 36,000 historical data points about 600 societies into multiple-choice questions with four options: present, absent, inferred present, or inferred absent. Seven major AI models were tested using a multi-shot prompting approach with chain-of-thought reasoning. The questions covered 11 different historical themes including social complexity, warfare, religion, and institutions across all major world regions from 10,000 BCE to 1850 CE.

Results

GPT-4-Turbo performed best with 46% balanced accuracy, while Llama-3.1-8B scored lowest at 33.6%. All models outperformed random guessing (25%) but fell far short of expert-level performance. Models performed better on ancient history than recent periods, with accuracy declining as time periods approached the present. Geographically, performance was strongest for the Americas and weakest for Oceania and Sub-Saharan Africa. Models scored highest on social complexity and warfare questions, lowest on economic and legal system topics.

Limitations

The dataset was primarily compiled from English-language sources, potentially limiting coverage of non-English speaking regions. While the Seshat database aims for global representation, some regions and time periods have better documentation than others. The study only tested seven models from three major families, and the evaluation was conducted in August 2024, so results may not reflect the latest model improvements. The benchmark focuses on factual knowledge rather than historical reasoning or argumentation skills.

Funding and Disclosures

The research was supported by multiple funding sources including CLARIA-AT, the Austrian Research Promotion Agency, James S. McDonnell Foundation, an AHRC grant, and the Alan Turing Institute. The authors disclosed no competing interests. The dataset is being released under Creative Commons license for public use.

Publication Information

The study “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM)” was authored by Peter Turchin, Jakob Hauser, Daniel Kondor, Jenny Reddish, and colleagues was presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks. The complete dataset and results are available here.