ARC AGI 3 Adds Interactive Games — All Frontier Models Failed

The Score That Stopped the Room

When the ARC AGI 3 results came in, the number that circulated fastest wasn’t impressive — it was the same across every frontier model. GPT-5.4, Claude Opus 4.6, Gemini 3.1. All of them: 0%.

Not single digits. Not a few percentage points that hint at emerging capability. Zero.

The ARC AGI 3 benchmark introduced an interactive video game format that no existing model could handle, and the results landed like a reset button on months of confident predictions about how close AI was getting to human-level generalization. Here’s what the benchmark actually tests, why zero was the result, and what it means for the gap between current AI and genuine fluid intelligence.

How the ARC-AGI Series Measures Generalization

The Abstraction and Reasoning Corpus for Artificial General Intelligence — ARC-AGI — was created by François Chollet to measure something most benchmarks don’t touch: fluid intelligence, not accumulated knowledge.

Most AI benchmarks are essentially memory tests dressed up as reasoning tests. A model performs well because it’s encountered similar patterns in training data. ARC-AGI is specifically designed to make that strategy fail.

The Core Design

Each ARC-AGI task presents a small number of input-output pairs. The model must figure out the underlying rule and apply it to a new input. The tasks use simple visual grids — shapes, colors, spatial relationships — and require no domain expertise or specialized knowledge.

Hermes, walked through line by line — free 1-hour workshop

What they do require is genuine inference from minimal examples, the way a human would approach something genuinely unfamiliar.

Chollet designed the benchmark around four core knowledge priors — the types of reasoning human infants appear to possess before formal learning begins:

Objectness and object cohesion — objects are persistent, bounded things that move as units
Goal-directedness — agents act toward purposes
Numbers and counting — basic quantity and quantity-change reasoning
Basic geometry and topology — shape, symmetry, spatial relationships, containment

Any generally intelligent system should already have these. Everything else has to be worked out from the specific examples in each task.

ARC-AGI-1: Eventually Cracked, But Not the Way You’d Hope

When ARC-AGI-1 launched in 2019, the leading models scored under 5%. Years passed before that changed significantly.

By late 2024, OpenAI’s o3 reached roughly 75–88% on ARC-AGI-1 — but at extraordinary compute cost. The model wasn’t demonstrating fluid reasoning so much as brute-force search: generating and checking large numbers of candidate solutions until one fit the examples. Chollet pointed this out publicly. High compute had gamed the benchmark, not solved the underlying problem.

ARC-AGI-2: The Bar Rises

ARC-AGI-2 launched in early 2025, explicitly designed to resist the compute-heavy approach that allowed o3 to perform well on the first version. The tasks were harder, the pattern surface smaller.

Frontier models scored between roughly 4% and 16% on ARC-AGI-2. Humans, without any special training, scored above 60% on the same tasks. The gap between human and AI performance actually widened.

That context matters for understanding what ARC AGI 3 attempted — and why the 0% result isn’t surprising once you understand what changed.

What ARC AGI 3 Actually Tests

ARC AGI 3 is a significant departure from its predecessors. Where the first two versions used static grid tasks — observe examples, infer a rule, apply it once — ARC AGI 3 introduced an interactive video game format.

Each model is placed in a simple game environment with no instructions. It must learn the rules of the game through interaction, then demonstrate mastery.

Why Interactive Games Test Something Different

Static tasks contain all the information you need at the start. You observe, you reason, you respond. The task is closed.

An interactive game is open-ended. The rules only reveal themselves through action and consequence. You do something, observe the result, update your understanding, try something else. Information arrives sequentially and depends on what you’ve done.

This is much closer to how humans learn almost everything in the real world — not from a labeled set of examples, but from poking at the environment and watching what happens.

What the Games Require

The ARC AGI 3 games were designed using the same core knowledge priors as earlier versions, but tested dynamically. Scenarios included things like:

Objects that behave differently depending on which surfaces they contact
State changes triggered by specific action sequences, not individual actions
Goal conditions that shift based on prior choices made earlier in the game

None of this is described. There are no instructions, tooltips, or rule summaries. The model must infer everything from scratch through interaction.

The Three New Difficulties

The interactive format introduces challenges that static benchmarks don’t have:

Sequential dependence — Every action changes the environment state. You can’t evaluate all possibilities in parallel.
Exploration versus exploitation — To learn the rules, you have to try things that might fail. Optimizing only for success prevents you from discovering what you need to know.
Credit assignment over time — When something goes wrong, the cause might be a decision made several steps earlier. Tracing that requires holding a model of the entire game state across time.

Humans handle all three of these almost automatically. Current language models do not.

Why Every Model Scored Zero

The 0% result across GPT-5.4, Claude Opus 4.6, and Gemini 3.1 wasn’t statistical noise. It reflects specific, structural limitations that the interactive format exposed clearly.

Pattern Matching Doesn’t Transfer to Novel Games

Frontier models are, in large part, very capable pattern matchers. Given a problem that resembles something in their training distribution, they often produce excellent results. That’s not a criticism — it’s a genuine and useful capability.

But ARC AGI 3’s games are procedurally generated with novel rule combinations. There’s no prior pattern to recognize. No similar game in training data. No heuristic that generalizes.

Without a pattern to draw on, models fall back on surface plausibility — taking actions that seem locally reasonable rather than informationally useful. They can’t distinguish between “this looks like a good move” and “this move will teach me something I need to know.”

Language Models Aren’t Built for Sequential Interaction

There’s also a structural mismatch. A language model processes a context window and produces an output. That’s a one-shot operation, not a sequential decision loop.

To play an interactive game, you have to wrap a language model in an agent loop — feed observations as tokens, have it choose an action, execute it, feed back the result, repeat. The model isn’t natively doing this; it’s being asked to reason about a described situation at each step.

This works reasonably well for tasks with low sequential complexity. It breaks down when the task requires maintaining a consistent world model across many steps, especially when new evidence contradicts earlier assumptions and the model needs to update precisely rather than hedge.

The Exploration Problem

The most diagnostic failure: models couldn’t explore effectively.

To learn a game’s rules, you need to take actions whose value is informational, not immediately rewarding. You need to deliberately poke at the environment — “what happens if I do this even though it might fail?” — because that’s how you discover what you need to know.

Language models are trained on prediction tasks. They learn to produce likely, contextually appropriate outputs. That training doesn’t instill a drive to explore unknown states. When placed in an unfamiliar game, they converge on plausible-looking behaviors without actually discovering the underlying rules.

Hermes Crash Course — free 1-hour live workshop

A human playing the same game for three minutes will methodically test edge cases out of curiosity. That behavior is nearly absent in current language models operating in novel interactive environments.

What Zero Actually Means

It’s tempting to interpret 0% as a verdict — either that AI progress is stalling, or that language models are fundamentally unsuited to general reasoning. Neither reading is quite right.

It’s a Precise Measurement, Not a Catastrophe

ARC AGI 3 tests one specific capability: learning novel interactive rules through sequential exploration with no prior description. Zero percent means current frontier models can’t do this reliably, at all.

That’s honest information, not a disaster. The ARC Prize benchmark series was designed explicitly to stay ahead of model capabilities — when models crack a version, the next version closes the loopholes. ARC AGI 3 is doing exactly what it was designed to do.

It Confirms a Real and Significant Gap

What the result does confirm is that the gap between AI and human generalization is real, substantial, and not trivially addressed by scaling existing architectures.

Human performance on ARC AGI 3 exceeds 60% without gaming expertise or special preparation. That’s not because humans are uniquely gifted — it’s because human intelligence applies general reasoning strategies (explore, hypothesize, test, update) that transfer across genuinely novel environments.

GPT-5.4, Claude Opus 4.6, and Gemini 3.1 are each impressive systems. But their capabilities are anchored in training distribution patterns in ways that human reasoning isn’t. ARC AGI 3 specifically probes the gap between those two things.

What This Means for the “AGI is Near” Debate

Since o3’s performance on ARC-AGI-1, a significant wave of commentary has argued that AGI is imminent. ARC AGI 3 adds real evidence to the other side.

Fluid intelligence — reasoning from first principles in genuinely novel, interactive contexts — remains a clear gap. Benchmark performance that looks impressive can reflect sophisticated pattern recognition more than genuine generalization. That’s a live debate, but ARC AGI 3 contributes a clear and specific data point to it.

What This Means for Teams Building with AI

For people building AI systems rather than evaluating them, the ARC AGI 3 results have practical implications — though probably not the ones the headlines suggest.

Current Models Are Still Highly Capable for the Right Tasks

Frontier models remain genuinely excellent at:

Tasks with rich prior training signal: coding, writing, extraction, summarization
Following complex, structured instructions
Reasoning within well-defined problem boundaries
Generating creative output in established formats

None of that changes because of the ARC AGI 3 results. For the vast majority of applied AI use cases — customer support, document processing, workflow automation, code assistance — current models are capable and practical.

Where to Be Careful

Models are less reliable when:

The task requires figuring out rules that aren’t documented or described anywhere
Success depends on recognizing when a prior assumption is wrong and updating cleanly
The environment changes state in ways that need consistent tracking across many steps
Exploration is required before optimization can begin

These are the conditions ARC AGI 3 specifically tests. They’re also conditions that show up in real-world agent deployments when the scope is too loosely defined.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Agentic AI Still Needs Careful Design

The industry’s move toward agentic systems — AI that takes sequences of actions, uses tools, and operates with some autonomy — runs directly into the limitations ARC AGI 3 exposed.

Agents operating in well-scoped domains with established rules (CRM updates, scheduled data processing, document routing) can work well. Agents asked to operate in dynamic, under-specified environments still break in predictable ways. That’s not insurmountable, but it requires deliberate task scoping and meaningful human oversight.

Where MindStudio Fits When No Model Does Everything

One practical takeaway from ARC AGI 3 is that model selection matters more than people often acknowledge. Different models have different strengths, and no single frontier model handles every task equally well.

MindStudio gives you access to 200+ AI models — including GPT, Claude, and Gemini — without managing separate API accounts. You can build a workflow that routes different tasks to different models based on what each does best, and swap in newer models as the landscape shifts.

This matters practically: if you’re building an AI agent for a real business workflow, the right question isn’t “which frontier model is best?” It’s “which model is best for this specific subtask, and how do I chain them together reliably?”

For teams building automated workflows with AI, MindStudio’s no-code builder makes it practical to experiment across models side by side, see where each performs best on your actual tasks, and configure guardrails for the cases where models tend to drift. You don’t need to wait for a model that scores 100% on ARC AGI 3 — you need a model that’s reliable for the specific, well-defined tasks your workflow requires.

You can start building for free at mindstudio.ai.

Frequently Asked Questions

What is ARC AGI 3?

ARC AGI 3 is the third version of the Abstraction and Reasoning Corpus for Artificial General Intelligence benchmark, designed to measure fluid intelligence in AI systems. Unlike standard benchmarks that reward memorization, ARC-AGI tests the ability to infer novel rules from minimal examples. ARC AGI 3 introduced an interactive video game format, requiring models to discover game rules through active exploration rather than observing static examples.

Why did GPT-5.4, Claude Opus 4.6, and Gemini 3.1 all score 0%?

The interactive format exposed three specific weaknesses: current language models can’t explore unknown environments effectively, they struggle to maintain consistent world models across many sequential steps, and they converge on plausible-looking actions rather than informationally useful ones. The games use procedurally generated rules with no analog in training data, so pattern matching — which these models rely on heavily — provides no advantage.

How do humans perform on ARC AGI 3?

Humans score above 60% on ARC AGI 3’s interactive tasks without gaming expertise or special preparation. The gap between human and AI performance on this benchmark is larger than on ARC-AGI-1 or ARC-AGI-2. Human performance reflects general reasoning strategies — deliberate exploration, hypothesis testing, state tracking — that transfer across novel environments.

Does a 0% score mean AI progress has stalled?

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

No. The result means current architectures can’t solve this specific type of task — learning novel interactive rules through sequential exploration — not that AI progress in general has reversed. Frontier models today are substantially more capable than they were two years ago across most applied tasks. ARC AGI 3 is designed to measure the frontier of what models can’t yet do; a zero result confirms the benchmark is still ahead of current capabilities.

What’s the difference between ARC-AGI-1, ARC-AGI-2, and ARC AGI 3?

ARC-AGI-1 used static visual grid tasks. Models eventually reached 75–88% on it using high compute, largely through brute-force solution search. ARC-AGI-2 was harder and more resistant to that approach; frontier models scored 4–16%. ARC AGI 3 added interactive video games, requiring sequential exploration and real-time rule learning with no prior description. Each version was designed to close the loopholes that allowed good performance on the previous one without demonstrating genuine generalization.

Should these results change how I use AI in my products?

For most applications, no. ARC AGI 3 measures one specific and narrow capability: fluid intelligence in novel interactive environments. Customer support tools, document automation, coding assistance, and similar applications sit well within the distributions where current models perform reliably. The results matter most for teams building agents that need to operate in genuinely unknown, dynamically changing environments — where deliberate task scoping and human oversight become more important, not less.

Key Takeaways

Every frontier model — GPT-5.4, Claude Opus 4.6, Gemini 3.1 — scored 0% on ARC AGI 3’s interactive video game benchmark, marking a complete failure on the fluid intelligence task it tests.
The interactive format exposes three specific weaknesses: poor exploration behavior, difficulty maintaining sequential world models, and pattern matching that doesn’t transfer to procedurally novel environments.
Humans score above 60% on the same tasks — the human-AI generalization gap is larger on ARC AGI 3 than on any previous version of the benchmark.
A 0% result doesn’t mean AI is regressing. It means the benchmark is measuring a capability that current architectures genuinely haven’t solved.
For AI builders, the practical lesson is to scope tasks to domains where models are reliable, use appropriate oversight for agentic workflows, and treat model selection as a real design decision rather than an afterthought.
Platforms like MindStudio, which provide access to 200+ models and make multi-model workflow design practical, let you build AI applications that play to current model strengths rather than betting on capabilities that don’t yet exist.

Ready to build something with the AI models that are actually available today? Start free at mindstudio.ai.

ARC AGI 3 Adds Interactive Games — All Frontier Models Failed

The Score That Stopped the Room

How the ARC-AGI Series Measures Generalization

The Core Design

ARC-AGI-1: Eventually Cracked, But Not the Way You’d Hope

ARC-AGI-2: The Bar Rises

What ARC AGI 3 Actually Tests

Why Interactive Games Test Something Different

What the Games Require

The Three New Difficulties

Why Every Model Scored Zero

Pattern Matching Doesn’t Transfer to Novel Games

Language Models Aren’t Built for Sequential Interaction

The Exploration Problem

What Zero Actually Means

It’s a Precise Measurement, Not a Catastrophe

It Confirms a Real and Significant Gap

What This Means for the “AGI is Near” Debate

What This Means for Teams Building with AI

Current Models Are Still Highly Capable for the Right Tasks

Where to Be Careful

Remy is new. The platform isn't.

Agentic AI Still Needs Careful Design

Where MindStudio Fits When No Model Does Everything

Frequently Asked Questions

What is ARC AGI 3?

Why did GPT-5.4, Claude Opus 4.6, and Gemini 3.1 all score 0%?

How do humans perform on ARC AGI 3?

Does a 0% score mean AI progress has stalled?

Remy doesn't write the code. It manages the agents who do.

What’s the difference between ARC-AGI-1, ARC-AGI-2, and ARC AGI 3?

Should these results change how I use AI in my products?

Key Takeaways

Related Articles

What Is GPT-5.6? OpenAI's Soul, Terra, and Luna Model Tiers Explained

What Is Claude Sonnet 5? Anthropic's Cheaper Agentic Model Explained

AI Model Selection Framework: How to Choose Between Daily Driver, Workhorse, and Specialist Models

Open-Weight vs Closed AI Models: Why GLM 5.2 Changes the Cost Equation for Agents