menu_open Columnists
We use cookies to provide some features and experiences in QOSHE

More information  .  Close

This new benchmark could expose AI’s biggest weakness

12 0
25.03.2026

Exclusive: This new benchmark could expose AI’s biggest weakness

ARC-AGI-3 tests whether models can reason through novel problems, not just recall patterns, a task even top systems still struggle to do.

[Screenshot: ARC Prize]

The influential AI researcher François Chollet has long argued that the field measures intelligence incorrectly, that popular benchmarks reward a model’s ability to memorize vast amounts of data rather than navigate novel situations and learn new skills. Only recently, with the rise of autonomous AI agents, have companies begun to take that critique seriously. On Tuesday, the ARC Prize Foundation, which Chollet founded with Zapier cofounder Mike Knoop, released a new and more difficult version of its benchmark. The test, called ARC-AGI-3, may offer the clearest measurement yet of how close today’s AI agents are to human-level intelligence.

It consists of more than a thousand simple, video-game-like scenarios designed to measure on-the-fly reasoning rather than memory recall. “You can always achieve skill by memorization by effectively just storing a lookup table of everything you need to do,” Chollet says. “Intelligence is the efficiency with which you’re going to make sense of new things, of new tasks that you’ve never seen before.”

Given no instructions, an agent must develop an understanding of the game environment and its rules, then apply that knowledge to form a strategy across multiple steps toward an ultimate goal. Agents that reach those goals using fewer, more efficient steps earn higher scores, with their creators eligible for all or part of a $1 million prize. As in previous ARC benchmarks, humans can navigate the tasks with relative ease, while many AI systems struggle.

A high score on ARC-AGI-3 could also serve as evidence of artificial general intelligence (AGI). To do “most economically valuable work” performed by humans, as one common definition of AGI requires, AI agents will need to reason through unfamiliar situations in unfamiliar environments. They will need to form abstractions from past experiences and generalize them to new problems they were not explicitly trained to solve.

“I just love that this benchmark basically goes at the heart of this gap that exists between actually measuring for AGI and the standard set of benchmark suites that the big labs and essentially everybody seems to use in the rat race of getting 0.5% of improvement over every other state-of-the-art model for a week,” says Andy Konwinski, whose Laude Institute donated $25,000 to the ARC Prize as part of its Slingshots initiative.

When the first ARC test was released in 2019, the transformer architecture behind today’s AI chatbots was only two years old, and models were just beginning to generate coherent responses to prompts. Because they could not yet reason in real time, they solved almost none of the ARC-1 puzzles, which limited the benchmark’s adoption.

Chollet saw a fundamental problem with how the industry evaluated progress. Systems that could handle tasks described as “PhD-level” intelligence were failing at simple puzzles. “When the most advanced AI systems are stumped, but a child can do it, that’s a big red flashing light telling you that we’re missing something, that something really important is off,” he says.

Artificial Intelligence

Ryan Coogler on how he became a creative leader


© Fast Company