AI is failing ‘Humanity’s Last Exam’. So what does that mean for machine intelligence?
How do you translate ancient Palmyrene script from a Roman tombstone? How many paired tendons are supported by a specific sesamoid bone in a hummingbird? Can you identify closed syllables in Biblical Hebrew based on the latest scholarship on Tiberian pronunciation traditions?
These are some of the questions in “Humanity’s Last Exam”, a new benchmark introduced in a study published this week in Nature. The collection of 2,500 questions is specifically designed to probe the outer limits of what today’s artificial intelligence (AI) systems cannot do.
The benchmark represents a global collaboration of nearly 1,000 international experts across a range of academic fields. These academics and researchers contributed questions at the frontier of human knowledge. The problems required graduate-level expertise in mathematics, physics, chemistry, biology, computer science and the humanities. Importantly, every question was tested against leading AI models before inclusion. If an AI could not answer it correctly at the time the test was designed, the question was rejected.
This process explains why the initial results looked so different from other benchmarks. While AI chatbots score above 90% on popular tests, when Humanity’s Last Exam was first released in early 2025, leading models struggled badly. GPT-4o managed just 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s most powerful model, o1, achieved only 8%.
The low scores were the point. The benchmark was constructed to measure what remained beyond AI’s grasp. And while........
