What Is ARC AGI 3? The Interactive AI Benchmark Humans Solve at 100%

When AI Scores Under 1% — and Humans Score 100%

That gap is real. And it’s the whole point of ARC AGI 3.

Released by the ARC Prize team in 2025, ARC AGI 3 is a benchmark designed to test whether AI systems can genuinely reason and adapt — not just pattern-match against training data. The result so far: humans solve it essentially every time, while frontier AI models fail almost all of the tasks.

If you’ve followed AI benchmarks before, ARC AGI 3 represents something meaningfully new. It introduces an interactive element that makes the task fundamentally harder for AI — and reveals a gap in current AI capabilities that raw scores on traditional tests tend to hide.

This article explains what ARC AGI 3 is, how it differs from prior ARC benchmarks, what the tasks actually look like, and what the human-AI performance gap tells us about where AI development stands right now.

The ARC Benchmark Series: A Quick Background

ARC stands for Abstraction and Reasoning Corpus. It was created by François Chollet — the engineer behind Keras and one of the more careful thinkers about what intelligence actually means — and published in 2019.

The core idea is straightforward: if you want to test general intelligence, don’t test memorization. Test the ability to figure out entirely new rules from scratch.

ARC tasks use colored grids. You’re shown a few input-output grid pairs that all follow the same hidden rule. Then you’re given a new input and asked to produce the correct output by applying that rule. The patterns are abstract — things like “reflect the shape,” “fill in the missing piece,” or “apply this transformation to colored cells.” Simple enough that any adult human can solve them. Hard enough to stump systems trained on enormous amounts of data.

ARC-AGI-1: The Original

ARC-AGI-1 launched around 2020 with prize money offered for AI systems that could match human performance. For years, the best systems struggled to crack 30–40%, while humans solved the tasks at rates near 85–98%.

OpenAI’s o3 model eventually scored high on ARC-AGI-1, which was treated as a milestone — though at significant computational cost.

ARC-AGI-2: Raising the Bar

ARC-AGI-2 arrived in early 2025 and addressed ways that AI systems had started to exploit statistical regularities in the original dataset. The human baseline stayed near 99%. Top AI models, including o3, scored in the low single digits.

ARC-AGI-2 made clear that even impressive-looking AI systems weren’t genuinely generalizing. They were doing something more like statistical approximation — close enough to fool some tests, not close enough to actually figure out novel rules.

What Makes ARC AGI 3 Different: The Interactive Component

ARC AGI 3 changes the rules in a fundamental way. Previous ARC tasks were static — you see examples, you produce an answer. ARC AGI 3 is interactive.

In ARC AGI 3, solvers — human or AI — can take actions within the environment and observe results before committing to a final answer. You can probe the system: make a move, see what happens, adjust your theory, try again. It’s closer to running an experiment than taking a test.

This mirrors how humans actually learn and reason. When faced with something unfamiliar, people don’t just stare at a few examples and output an answer. They poke at the problem. They test hypotheses. They use feedback to eliminate wrong theories and confirm right ones.

Why This Changes Everything

The interactive format breaks the strategy that current AI systems rely on most heavily: pattern completion within a fixed context window.

When a language model looks at a static ARC task, it applies its training to recognize patterns that resemble things it’s seen before. That’s limited, but it’s at least a workable strategy for some tasks.

With ARC AGI 3, the task is no longer about recognizing a pattern — it’s about running experiments, updating your model of the system based on results, and using that updated model to solve the puzzle. That’s a fundamentally different cognitive operation.

Current AI architectures aren’t well-suited to active, hypothesis-driven exploration. They generate outputs but don’t naturally answer the question: “What should I try next to learn the most about this system?”

How ARC AGI 3 Tasks Actually Work

The tasks still use the basic visual grid format that made previous ARC benchmarks accessible. But instead of passively observing examples, you interact with the environment directly.

A typical ARC AGI 3 task might work like this:

You’re presented with a grid and an interface for interacting with it.
You take actions — click cells, enter values, make moves — and the environment responds.
Based on what you observe from your actions, you build up a theory of the underlying rule.
Once you understand the rule well enough, you apply it to produce the correct final output.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

The number of interactions allowed is limited, so you can’t brute-force it by trying every possibility. You have to reason about which experiments will give you the most useful information, which is itself a reasoning challenge.

The Human Experience

For humans, this feels intuitive. You try something, see what happens, think “okay, so it’s doing X,” test that theory, confirm it, and solve the puzzle. The whole process takes a minute or two per task.

Most people find it engaging. The feedback loop is satisfying: you’re genuinely figuring something out, not retrieving information you already have stored.

The AI Experience

For AI systems, this is much harder. To solve ARC AGI 3 tasks effectively, an AI would need to:

Decide which actions to take to gather useful information
Update its belief model about the task based on feedback
Know when it has enough information to commit to a solution
Avoid both premature answers and inefficient, scattered exploration

Current large language models and reasoning models aren’t built for this loop. They can be prompted to take actions in agentic settings, but the kind of structured, efficient hypothesis testing required by ARC AGI 3 isn’t something they do reliably. Hence the sub-1% score.

Why Humans Score 100% and AI Scores Under 1%

The gap is large enough to deserve a direct explanation.

Humans don’t score 100% on ARC AGI 3 because the tasks are trivial. They score 100% because humans are genuinely good at exactly the kind of reasoning these tasks require: building mental models of new systems through interaction.

This ability is often called fluid intelligence — the capacity to reason about novel situations without relying on prior knowledge. It’s distinct from crystallized intelligence, which is accumulated knowledge and skill. ARC benchmarks specifically target fluid intelligence, which is why memorizing training data doesn’t help.

Why AI Struggles

Current AI systems — including frontier models like GPT-4o, Claude, and o3 — are extraordinarily good at retrieving and combining information from their training data. When they appear to “reason,” they’re often doing something closer to sophisticated pattern completion.

ARC AGI 3 is specifically designed to make this approach fail. The rules in any given task are novel. No amount of training data helps you recognize a rule you’ve never encountered — you have to figure it out through interaction.

That’s not what current AI systems do well. They’re not running experiments. They’re not building causal models of unfamiliar systems. They’re not efficiently allocating an exploration budget to learn as much as possible from a limited number of interactions.

The under-1% score isn’t a sign that the models are broken. It’s a sign that the benchmark is measuring something genuinely different from what those models are optimized for.

What About AI Agents?

You might wonder whether agentic frameworks — systems designed to take sequences of actions and observe results — would do better. In principle, yes: an agent architecture is at least structurally capable of the kind of interactive exploration ARC AGI 3 requires.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

In practice, current AI agents still struggle because the underlying models aren’t doing the right kind of reasoning. Having the ability to take actions doesn’t automatically mean you’ll take the right actions efficiently. If you’re interested in how AI agents work more broadly, understanding the architecture behind agentic AI helps clarify why this gap exists.

What the Benchmark Gap Reveals About AI Generalization

ARC AGI 3 is valuable not just as a competition but as a diagnostic. The performance gap tells us something specific and useful about the current state of AI.

Memorization Is Not Generalization

A persistent problem in AI benchmarking is that scores can be inflated by training data overlap — models that have seen test cases or similar examples during training will score better not because they reason but because they’ve seen the answer before.

ARC tasks are designed to resist this. The rules are novel enough that memorization doesn’t work. ARC AGI 3 goes further by making the task interactive, which rules out the possibility of matching a static pattern from memory.

Scaling Alone Doesn’t Solve This

One of the most significant implications of ARC AGI 3 is that scaling — bigger models, more training data, more compute — doesn’t automatically close the gap.

ARC-AGI-1 and ARC-AGI-2 saw incremental improvement from more capable models. ARC AGI 3’s interactive structure appears to hit a wall that scale doesn’t fix. You likely need a different architecture, a different approach to learning, or both.

This matters for the broader AGI discussion. If general reasoning requires the kind of interactive, hypothesis-driven problem solving that ARC AGI 3 tests, the current scaling paradigm may not get us all the way there on its own.

What Would Actually Solve It?

Researchers have proposed a few directions:

Program synthesis: AI systems that generate programs (rather than text) to represent hypothesized rules, then test those programs against observations
Neuro-symbolic hybrids: Systems combining neural networks (strong at perception and pattern recognition) with symbolic reasoning engines (strong at logical inference and hypothesis testing)
Active learning architectures: Systems specifically designed to decide what information to gather next and update beliefs efficiently

None of these has solved ARC AGI 3 yet. The benchmark remains an open problem that points to real, specific gaps in current AI capabilities.

How Model Selection Matters for Reasoning Tasks

ARC AGI 3 makes one thing very clear: not all AI models handle complex, multi-step reasoning equally. The difference between a model that scores 0.5% and one that scores 3% on a hard benchmark isn’t arbitrary — it reflects genuine differences in architecture and training approach.

For people building AI-powered tools and workflows, this matters practically. The right model for tasks requiring multi-step reasoning, working through novel problems, or handling ambiguous inputs is often different from the right model for text summarization or content generation.

This is one of the core arguments for testing AI models across real-world use cases rather than picking one based on general reputation. Benchmark scores give you signal, but your specific task is the real benchmark.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

MindStudio gives you access to 200+ AI models — including o3, Claude, Gemini, and other frontier reasoning models — without separate API accounts for each. When you’re building AI agents and automated workflows, you can swap models in and out quickly to test which one actually handles your reasoning tasks best, not just which one scores highest on generic public leaderboards.

The visual builder makes model comparison practical — you change the model in your workflow and run the same test case through both. That kind of empirical comparison is more useful than any benchmark for figuring out which model fits your specific problem. You can start building for free at mindstudio.ai.

Frequently Asked Questions About ARC AGI 3

What is ARC AGI 3?

ARC AGI 3 is an interactive AI benchmark designed to test genuine reasoning and generalization. It’s the third major version of the Abstraction and Reasoning Corpus benchmark, created by François Chollet and the ARC Prize team. Unlike previous versions, ARC AGI 3 requires solvers to take actions and observe feedback before producing a final answer. Humans solve it at essentially 100% accuracy. Current AI systems score under 1%.

How is ARC AGI 3 different from ARC AGI-1 and ARC AGI-2?

ARC AGI-1 and ARC AGI-2 are static benchmarks: you observe input-output grid examples, identify the pattern, and apply it to a new input. ARC AGI 3 is interactive — you can probe the environment by taking actions and observing results, which is how you figure out the underlying rule. This interactive structure is fundamentally harder for AI systems that rely on completing patterns in a fixed context rather than running active experiments.

Why do humans score 100% on ARC AGI 3?

Humans naturally use interactive reasoning — testing hypotheses, observing results, updating their understanding. This is exactly what ARC AGI 3 requires. The tasks aren’t trivially easy; they demand genuine abstract thinking. But humans are well-suited to exploratory, feedback-driven problem solving in a way that current AI systems aren’t. The 100% score reflects a real cognitive strength, not an easy test.

Why do AI systems score under 1% on ARC AGI 3?

Current AI systems — including large language models and reasoning models — are primarily good at pattern completion within a fixed context. ARC AGI 3 requires something different: structured exploration, efficient hypothesis testing, and building a causal model of an unfamiliar system through limited interactions. This is not what current architectures do naturally, which is why even state-of-the-art models fail almost all of the tasks.

Does a high ARC AGI score mean a model is more intelligent?

Not necessarily in a general sense, but ARC benchmarks do measure something important: the ability to reason about novel patterns without relying on prior knowledge. A model that scores higher on ARC benchmarks demonstrates better genuine generalization — applying reasoning to situations it hasn’t specifically trained on. This is directly relevant to AGI research because it targets fluid intelligence rather than accumulated knowledge retrieval.

Who created ARC AGI 3 and what is the ARC Prize?

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

ARC AGI 3 was developed by François Chollet and the ARC Prize team. The ARC Prize is a competition and research initiative offering substantial prize money to incentivize AI systems that can genuinely solve ARC tasks through reasoning. The competition is designed to push the field toward AI that generalizes, with ARC AGI 3 being the latest and most demanding benchmark in that series.

Key Takeaways

ARC AGI 3 is an interactive AI benchmark where solvers probe the environment and receive feedback — a structure that current AI systems find nearly impossible to navigate effectively.
Humans solve ARC AGI 3 tasks at ~100% accuracy. Frontier AI models score under 1%.
The gap exists because ARC AGI 3 tests fluid intelligence — reasoning about genuinely novel situations — rather than pattern matching against training data.
The benchmark challenges the assumption that scaling alone (bigger models, more data, more compute) will produce general reasoning capabilities.
Solving ARC AGI 3 likely requires new approaches: program synthesis, neuro-symbolic reasoning, or architectures built for active, hypothesis-driven learning.
For practical AI work, the lesson is the same: model selection matters, and testing models against real tasks is more reliable than trusting any single benchmark.

If you want to experiment with frontier reasoning models and build multi-step AI workflows without juggling multiple API subscriptions, MindStudio gives you access to 200+ models in one place. It’s free to start.