Why GPT-5.4, Claude 4.6, and Gemini 3.1 All Scored 0% on ARC AGI 3

When 0% Is the Most Revealing Score in AI

A score of 0% on a test usually means failure. On ARC AGI 3, it means something far more interesting.

GPT-5.4, Claude Opus 4.6, and Gemini 3.1 — the most capable large language models available from OpenAI, Anthropic, and Google — all scored exactly zero on the latest version of the Abstraction and Reasoning Corpus benchmark. Ordinary humans with no special training scored 100%.

This isn’t a minor embarrassment for AI labs. It’s a signal worth sitting with carefully.

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by AI researcher François Chollet as a formal test of general reasoning. Every version — ARC-AGI 1, 2, and now 3 — is designed so that no amount of memorized training data can crack it. The only path to a good score is genuine reasoning from first principles.

That’s why the 0% result isn’t just a number. It’s a clear statement about where AI currently sits relative to human-level generalization — and why the gap matters for anyone building or deploying AI systems today.

What ARC-AGI Actually Measures

Most AI benchmarks test whether a model has seen similar problems before. Feed it enough training data, and a language model can appear to solve algebra problems, summarize documents, and pass professional exams — because it’s pattern-matching against things it’s encountered in training.

ARC-AGI was built to prevent that.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Each task presents a small number of input-output pairs — usually two to five — as visual grid puzzles. Each grid contains colored cells arranged in patterns. The model must infer the underlying rule from those few examples and apply it correctly to a new input it’s never seen.

The tasks look deceptively simple. Something like: “the blue cluster mirrors itself horizontally while the red pattern scales proportionally.” But each task is genuinely novel — the rules are constructed as distinct combinations of abstract concepts that couldn’t have appeared in any training dataset.

For humans, these puzzles are usually solvable in under a minute. For AI models, even the first version of the benchmark was surprisingly hard.

Why Visual Grid Puzzles Test Reasoning So Well

The choice of visual grids was deliberate. Large language models are trained on text, which means many reasoning tasks can be “solved” through linguistic pattern matching. Grid puzzles sidestep that entirely.

To solve an ARC task, you have to:

Identify what’s changing across the input-output examples
Abstract the underlying rule from very limited evidence
Apply that rule to a completely new case

This is exactly what general reasoning looks like — and it’s what current AI models consistently fail at, even as they improve on text-based benchmarks.

The ARC-AGI Progression: An Arms Race Against Memorization

ARC-AGI 1: The First Real Test

When Chollet released the original ARC benchmark in 2019, it was viewed as a long-term research challenge. Early GPT models scored near 0%.

Over the following years, the research community made real progress. The ARC Prize competition — backed by $1 million — pushed scores into the 55–85% range through combinations of test-time compute, fine-tuning on synthetic data, and search-based methods.

But there was a catch. The techniques that worked were increasingly tailored to the specific patterns in ARC-AGI-1. Researchers generated large amounts of synthetic training data designed to look like ARC-AGI-1 problems. That’s effective for scoring well on a particular benchmark, but it’s not building general reasoning — it’s building a specialized ARC-1 solving machine.

ARC-AGI 2: The Reset

ARC-AGI-2, released in early 2025, immediately reset the scoreboard. The new tasks required compositional reasoning across more abstract concepts and were specifically designed to defeat the synthetic data approaches that had gamed the first version.

Early evaluations of the best available models at the time showed scores in the low single digits. Some models scored 0% on the private evaluation set. Humans still scored 100%.

ARC-AGI 3: Even More Demanding

ARC-AGI-3 targets the strategies that made AI appear to succeed on version 2. Where version 2 defeated synthetic data generation, version 3 is designed to defeat extended chain-of-thought reasoning and brute-force search methods.

The result: GPT-5.4, Claude Opus 4.6, and Gemini 3.1 all score 0%. Not low — zero.

What makes this notable is that these models are considerably more capable than their predecessors on nearly every other benchmark. They write better code, perform better on professional exams, and handle complex multi-step tasks more reliably. The 0% on ARC-AGI-3 isn’t about raw capability. It’s about a specific kind of reasoning that current architectures aren’t producing.

Why Even the Best Models Score Zero

To understand why this happens, it helps to understand what large language models fundamentally are.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Language Models Are Extremely Good Pattern Matchers

A model trained on hundreds of billions of tokens learns to predict likely next tokens given prior context. It develops rich internal representations — semantic relationships, syntactic structure, factual knowledge, and even some abstract reasoning patterns.

But this learning is distribution-dependent. The model performs well on problems that resemble its training data. Move far enough outside that distribution, and performance degrades sharply.

ARC-AGI-3 tasks are designed to sit outside any training distribution. Each puzzle is novel by construction. There’s no training signal that teaches a model how to handle it — only the abstract ability to reason from a few examples, which current architectures haven’t reliably developed.

Chain-of-Thought Doesn’t Solve the Core Problem

A common response to hard reasoning benchmarks has been chain-of-thought prompting — asking models to reason step by step before answering. This genuinely improves performance on many tasks.

But chain-of-thought works by giving the model more tokens to work with, increasing the chance that a helpful intermediate step appears. On ARC-AGI-3, the core problem isn’t that the model needs more steps. It’s that the model hasn’t developed the right kind of inductive generalization in the first place.

Giving a model more time to reason about a problem it fundamentally doesn’t understand doesn’t produce understanding.

More Parameters Don’t Close the Gap Either

You might expect that scaling — more parameters, more compute — would eventually push scores up. But ARC-AGI scores have remained stubbornly near zero even as models have grown dramatically in capability by other measures.

This is one of the benchmark’s key contributions: it demonstrates that the capabilities AI labs have been measuring and optimizing for aren’t the same as general reasoning. Larger models get better at the tasks larger models were trained to do. They don’t automatically become better at novel generalization.

What Humans Do That AI Doesn’t

Humans solving ARC tasks aren’t doing anything obviously extraordinary. They look at the examples, identify what’s changing, and apply that rule. Most people solve each puzzle reliably in under two minutes.

Causal Reasoning From Minimal Examples

Humans learn causal structure from very few examples. Seeing two or three input-output pairs, a person can usually identify not just a surface correlation but an underlying mechanism — and apply it to a new case that looks different on the surface but follows the same rule.

Current AI models struggle with this kind of few-shot causal generalization. They can recognize patterns across large numbers of examples but don’t reliably extract causal rules from a handful of demonstrations.

Compositional Abstraction

ARC tasks often require combining multiple abstract concepts — “rotate this pattern, then apply the color rule from the other example.” Humans handle this compositionally: identify the components, apply them independently, combine the results.

Cognitive scientists believe compositional reasoning is central to human intelligence. It’s what lets us build complex understanding from simpler parts. AI models show some compositional ability, but it breaks down at the level of complexity ARC-AGI-3 tests.

Treating Every Task as Genuinely Novel

Humans approach each ARC task fresh. We don’t retrieve a matching example from memory — we reason from the evidence in front of us.

Language models, by contrast, are implicitly trying to match the task to patterns from training. When no match exists, they struggle — not because they’re confused in any meaningful sense, but because the inference mechanism isn’t built for genuine novelty.

What These Results Mean for the AI Industry

The Field Has Been Optimizing for the Wrong Signals

One takeaway from ARC-AGI-3 is that AI has been optimized against benchmarks that reward pattern recognition, not generalization. MMLU scores, coding leaderboards, professional exam performance — these measure something real, but not what “AGI” implies.

A model that scores 90% on the bar exam and 0% on ARC-AGI-3 is useful for many tasks. It is not demonstrating the flexible, general reasoning that the name “artificial general intelligence” suggests.

The ARC benchmark series is a corrective. It asks: can this system learn new rules from a few examples and apply them in novel contexts? If not, it’s a capable specialist, not a general reasoner.

The Path to Better Generalization Isn’t Clear

The AI community doesn’t have a consensus answer for how to close the ARC-AGI gap. Some researchers believe scaling will eventually produce emergent generalization — that larger models trained on more diverse data will get there. Others, including Chollet himself, argue that the current architectural paradigm has fundamental limitations and that new approaches are needed.

What the ARC-AGI-3 results confirm is that this isn’t a solved problem — or even a near-solved problem. Models representing enormous advances in AI capability still can’t crack it.

This Doesn’t Mean AI Isn’t Useful

It’s important to be clear about what a 0% ARC-AGI-3 score does and doesn’t tell us.

It doesn’t say that today’s AI is bad at most things. These models are genuinely capable at language tasks, code generation, data analysis, image generation, and a wide range of structured reasoning problems. Millions of people and businesses are getting real value from them right now.

What the scores do tell us is that “general intelligence” — specifically, flexible reasoning from minimal examples in novel situations — remains a frontier problem. The AI in your productivity tools is genuinely useful. It just doesn’t resemble what AGI would look like.

Using AI’s Real Strengths

The ARC-AGI-3 results are useful precisely because they calibrate expectations. The same models that can’t pass an abstract reasoning test can draft contracts, analyze reports, generate images from text, and automate multi-step business workflows — reliably and at scale.

The practical question for teams using AI isn’t whether their models can ace ARC-AGI-3. It’s whether those models are deployed for the tasks where they actually excel. That’s a workflow and tooling problem as much as a model selection problem.

Catch up on Hermes — free 60-minute live workshop

MindStudio is built for exactly that. It’s a no-code platform that gives you access to over 200 AI models — including GPT, Claude, and Gemini — without managing API keys or separate accounts. You can build AI agents and automated workflows for real business tasks: processing incoming data, generating personalized content, automating approval flows, connecting your existing tools. The average build takes 15 minutes to an hour.

The key is alignment: don’t ask AI to do what it can’t do, and don’t underestimate what it can. ARC-AGI-3 is a sharp reminder of the first. Tools like MindStudio help you get the most out of the second.

If you’re looking at how to compare AI models for specific use cases, or want to understand which tasks current models handle well, both are practical starting points. You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is ARC-AGI and who created it?

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark created by AI researcher François Chollet. It consists of visual grid puzzles where the solver must infer an underlying rule from a small number of examples and apply it correctly to a new case. It’s designed to test genuine generalization rather than memorized knowledge. Humans consistently score 100%; most AI models score near or at 0%.

Why do AI models score 0% when humans score 100%?

The gap exists because current AI models are built to pattern-match against training data. ARC-AGI tasks are novel by design — they can’t be solved by retrieving a similar example from memory. Humans reason from first principles using very few examples. Current language models haven’t reliably developed this kind of few-shot causal generalization, regardless of how capable they are on other benchmarks.

Is ARC-AGI 3 harder than previous versions?

Yes. Each version targets the strategies that worked on the previous one. ARC-AGI-1 was eventually approached using synthetic data generation and test-time compute. ARC-AGI-2 defeated those methods with harder compositional tasks. ARC-AGI-3 raises the bar again, specifically targeting chain-of-thought and search-based approaches — producing 0% scores even from the most advanced currently available models.

Does a 0% ARC-AGI score mean AI models are useless?

No. ARC-AGI measures one specific kind of reasoning: novel generalization from minimal examples. It doesn’t reflect the many areas where AI excels — language generation, coding, data analysis, image creation, and structured automation. A model that scores 0% on ARC-AGI-3 can still be highly effective for a wide range of real-world applications.

Will AI models ever score well on ARC-AGI 3?

Possibly — but there’s no clear roadmap. Some researchers believe that scaling and architectural improvements will eventually produce better generalization. Others, including Chollet, argue the transformer architecture has structural limits for this kind of reasoning and that something genuinely different is needed. The ARC benchmark series is designed to keep raising the bar as AI improves, so progress would be meaningful.

What’s the difference between benchmark performance and AGI?

Most benchmarks test performance on tasks similar to training data — exams, code, language understanding. AGI, in Chollet’s definition, requires efficiently learning new skills from minimal examples and generalizing to genuinely novel situations. Current AI models do well on the first kind of test and fail on the second. ARC-AGI is designed specifically to measure the second kind.

Key Takeaways

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The ARC AGI 3 results — 0% across every major model, 100% for humans — tell a coherent story about where AI currently is and isn’t.

ARC-AGI measures genuine generalization, not pattern matching. It’s specifically designed to resist memorization, synthetic data approaches, and extended reasoning tricks.
The 0% result holds across GPT-5.4, Claude Opus 4.6, and Gemini 3.1, confirming this isn’t one model’s weakness — it’s a structural limitation of current architectures.
Scaling hasn’t closed the gap. These are significantly more capable models than their predecessors on other benchmarks, and they still score zero on abstract reasoning tasks.
AI remains highly useful for most real-world tasks. The ARC-AGI limitation is real but specific — it doesn’t undermine practical AI deployment.
The gap between AI performance and human generalization is measurable. That’s what makes ARC-AGI valuable: it provides an honest benchmark, not a flattering one.

If you’re building with AI today, the question worth asking isn’t whether your models can pass ARC-AGI-3 — it’s whether they’re deployed for the tasks where they perform well. Start building with MindStudio for free and access 200+ models through a single platform, no API setup required.

Why GPT-5.4, Claude 4.6, and Gemini 3.1 All Scored 0% on ARC AGI 3

When 0% Is the Most Revealing Score in AI