AI Models Show Major Gaps in Global Historical Knowledge, Scoring Below 50% on Expert Tests

· 1 min read

article picture

A recent study has revealed that even the most advanced artificial intelligence models fall short when it comes to understanding complex historical knowledge, achieving less than 50% accuracy on expert-level history questions.

The research, presented at the Neural Information Processing Systems conference in Vancouver, tested seven AI models including GPT-4 Turbo, GPT-3.5, Llama, and Gemini using questions derived from the Seshat Global History Databank - a comprehensive collection of historical information spanning 10,000 years and 600 societies.

GPT-4 Turbo emerged as the top performer but only achieved a 46% score on multiple-choice questions about historical facts. While this exceeded random guessing (25%), it demonstrated significant limitations in AI's ability to process and understand historical information.

"One surprising finding was just how bad these models were," noted Peter Turchin from the Complexity Science Hub. "This result shows that artificial 'intelligence' is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others."

The study uncovered geographical biases in AI performance. Models showed better results for historical questions about North America and Western Europe while struggling with queries about Sub-Saharan Africa and Oceania. They also performed better on ancient history (before 3000 BCE) compared to more recent periods.

"When it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability is much more limited," explained Turchin.

The research team used the Seshat Global History Databank, which contains 36,000 data points drawn from over 2,700 scholarly sources. Questions tested whether historical variables like writing systems or governance structures were present or absent in different societies.

"The main takeaway is that LLMs, while impressive, still lack the depth of understanding required for advanced history," said Maria del Rio-Chanona, the study's corresponding author from University College London. "They're great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they're not yet up to the task."

The findings highlight current limitations in AI tools' ability to process complex historical knowledge and underscore the continued importance of human expertise in historical research and analysis.