ARC AGI 2 vs Pencil Puzzle Bench: The Benchmarks That Expose AI Capability Gaps

Why Most AI Benchmarks Are Lying to You

The standard way to measure AI progress is increasingly unreliable. Benchmarks like MMLU, HumanEval, and GSM8K were useful when they launched, but they’ve been compromised by a fundamental problem: models train on data that includes those tests, their formats, and closely related examples. Strong benchmark scores no longer tell you whether a model can actually reason — they tell you how well it pattern-matched to a particular corpus.

Two benchmarks are cutting through that noise right now: ARC-AGI-2 and Pencil Puzzle Bench. Both test reasoning that can’t be faked by memorization. Both are exposing capability gaps between frontier models that feel similar on standard evals but diverge sharply when the shortcuts disappear.

This article breaks down what each benchmark tests, how major models — including GPT-4o and the o-series, Claude 3.5/3.7 Sonnet, Gemini, and Chinese models like DeepSeek and Qwen — actually perform, and what those results reveal about where AI reasoning really stands.

What ARC-AGI-2 Actually Tests

The Abstraction and Reasoning Corpus (ARC) was created by François Chollet in 2019 as a direct challenge to language model capabilities. The core premise: if a model has truly learned to reason, it should be able to solve novel visual puzzles with no prior examples — the same way a human can.

The Structure of the Task

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

ARC tasks present a set of input-output grid pairs, each made up of colored cells on a 10×10 (or smaller) canvas. Your job is to identify the transformation rule connecting input to output, then apply that rule to a new input grid.

The rules are always learnable from a handful of examples — typically 3 to 5 pairs. But the rules change entirely between tasks. There’s no recurring pattern across the dataset. Every puzzle is novel.

This design forces genuine generalization. You can’t win by recognizing “I’ve seen this type of problem before.” You have to infer the rule from scratch each time.

Why ARC-AGI-2 Is Harder

ARC-AGI-1 became progressively less useful as a benchmark. Models trained on synthetic data resembling ARC tasks improved dramatically, and OpenAI’s o3 model (using high-compute settings) hit around 75% on ARC-AGI-1 — approaching human-level performance of roughly 85%.

So the ARC Prize team released ARC-AGI-2 in early 2025. It makes three key changes:

More compositional rules. Each transformation involves multiple layered operations — not just “rotate the shape” but “rotate the shape, then apply it only to cells that match a condition, then reflect the result.”
Smaller training signal. The example pairs are designed to be less redundant, meaning there’s less information per example to work with.
Novel object types and relationships. The visual vocabulary is expanded in ways that specifically resist solutions based on prior ARC-style training.

The result: even o3, which handled ARC-AGI-1 respectably, struggles on ARC-AGI-2. Early reported scores put top models in the low single digits — below 10% accuracy on the public evaluation set. Human solvers still manage around 60-80% depending on the task subset.

That gap is the point.

What Pencil Puzzle Bench Tests

Pencil Puzzle Bench takes a different approach. Instead of visual grid transformations, it evaluates models on classic pen-and-paper logic puzzles — the kind you’d find in puzzle magazines or dedicated puzzle apps.

The Puzzle Types

The benchmark includes a wide range of constraint-satisfaction puzzles:

Sudoku — fill a 9×9 grid so every row, column, and box contains digits 1–9 exactly once
Nonograms (Picross) — determine which cells to fill based on row and column clue sequences
Kakuro — fill digit sums in a crossword-style grid with no repeated digits per entry
Slitherlink — draw a single closed loop on a grid following numerical constraints
Nurikabe — determine which cells are “water” and which form numbered islands

What these puzzles share: they have explicit, fully-specified rules. There’s no ambiguity about what you’re supposed to do. Every puzzle has exactly one correct solution derivable through logical deduction.

Why This Is Hard for LLMs

The difficulty isn’t the rules. It’s the process.

Solving a nonogram or a Slitherlink puzzle requires maintaining a consistent internal state across many interdependent steps. You fill in one cell, which constrains another, which opens up a deduction in a third row, and so on. Any inconsistency — holding a wrong assumption for even one step — propagates into a wrong answer.

Language models are famously bad at this kind of multi-step constraint tracking. They tend to make early assumptions that feel locally plausible but violate global constraints they haven’t checked yet. By the time the error is detectable, the model has committed to a path.

Pencil Puzzle Bench measures whether models can maintain logical consistency across the full solve, not just recognize the type of puzzle or generate plausible-looking output.

How the Major Models Actually Compare

Here’s where the benchmark data gets informative — and occasionally surprising.

ARC-AGI-2 Performance

OpenAI o-series (o1, o3): The o-series reasoning models were designed with exactly this kind of task in mind. They use extended chain-of-thought generation and search-like procedures to work through problems. On ARC-AGI-1, o3 was the clear leader. On ARC-AGI-2, performance drops sharply — reports put the best o-series results at 4–8% on the public eval, with higher numbers possible on easier subsets.

GPT-4o: Without extended reasoning, GPT-4o performs poorly on ARC-AGI-2. The model tends to identify plausible-sounding rules but apply them inconsistently. Scores below 5% are typical.

Claude 3.5 / 3.7 Sonnet (Anthropic): Claude’s extended thinking feature (available in Claude 3.7 Sonnet) gives it a modest edge over non-reasoning GPT-4o on ARC-AGI-2. But Claude still struggles with the most compositional puzzles. Scores are broadly similar to the o-series on the public benchmark — in the low single digits for the hardest tasks.

Gemini 1.5/2.0 Pro (Google): Google’s Gemini models perform competitively on reasoning tasks generally, but ARC-AGI-2 exposes a spatial reasoning gap. The visual grid format, even when rendered as text tokens, doesn’t play to Gemini’s strengths. Scores are in a similar range to other frontier models — below 10%.

DeepSeek and Qwen (Chinese labs): DeepSeek’s R-series reasoning models have attracted significant attention for their performance on logic and math tasks. On ARC-AGI-2, DeepSeek R1 and its successors perform roughly in line with o1 — better than standard LLMs, but still well below human-level. Qwen’s reasoning variants show similar profiles.

The key pattern: ARC-AGI-2 compresses the performance differences between frontier models. Models that look meaningfully different on MMLU or GPQA all cluster in a narrow range on ARC-AGI-2, suggesting that what separates them on standard benchmarks is largely recall and formatting rather than abstract reasoning.

Pencil Puzzle Bench Performance

Pencil Puzzle Bench tells a somewhat different story, because different puzzle types play differently to model strengths.

Sudoku: Frontier models — including Claude 3.5, GPT-4o, and Gemini — solve straightforward Sudoku puzzles with moderate reliability. Accuracy drops sharply when puzzles are harder (fewer starting clues, requiring backtracking). Reasoning models perform better here than standard chat models.

Nonograms: Accuracy is generally poor across the board. Nonograms require simultaneously satisfying row and column constraints while filling cells, which is extremely difficult for models to track in text space. Even strong reasoning models fail frequently on medium-difficulty nonograms.

Kakuro: Similar to Sudoku in structure, but requires more simultaneous constraint tracking. Models often generate locally plausible moves that violate global constraints later in the solve.

Slitherlink and Nurikabe: These are particularly difficult for LLMs. The spatial reasoning required — tracking a connected loop or identifying island boundaries — doesn’t map cleanly to token-based generation. Accuracy on non-trivial instances is low across all major models.

Hermes, walked through line by line — free 1-hour workshop

Overall takeaway from Pencil Puzzle Bench: Even the best models are unreliable on medium-to-hard logic puzzles. Chinese reasoning models (DeepSeek R-series, Qwen) perform notably well on more formulaic puzzles like Sudoku and Kakuro, likely because their training emphasized systematic problem-solving. But on puzzles requiring spatial or topological reasoning, the advantage disappears.

What the Gaps Reveal About AI Reasoning

The combined picture from ARC-AGI-2 and Pencil Puzzle Bench makes a few things clear.

Reasoning Models Help, But Don’t Solve the Problem

Extended chain-of-thought reasoning — the approach used by o1, o3, Claude’s extended thinking, and DeepSeek R1 — provides a real benefit on structured problems. More compute at inference time, used to explore possible paths and check intermediate steps, does improve accuracy.

But the improvement is insufficient for hard reasoning tasks. On ARC-AGI-2, even extended thinking doesn’t consistently close the gap with human performance. The models aren’t running out of tokens or time — they’re generating plausible-sounding reasoning chains that lead to wrong conclusions.

Spatial and Visual Reasoning Remains a Genuine Weakness

ARC-AGI-2 is nominally a visual task, but it’s typically evaluated via text representations of the grids. Even so, models struggle with the spatial relationships involved. Pencil Puzzle Bench’s Slitherlink and Nurikabe results confirm this: any task requiring reasoning about spatial adjacency, connectivity, or topology is currently hard for all major models.

This isn’t surprising given that LLMs are trained primarily on text, but it’s worth being explicit about: the spatial reasoning gap is real and not obviously closing.

Consistency Across Steps Is the Core Bottleneck

Both benchmarks ultimately test the same underlying capability: maintaining logical consistency across many interdependent reasoning steps. A model that generates a correct hypothesis about an ARC rule in step 2 but applies it incorrectly in step 6 fails the task. A model that solves 30 Sudoku cells correctly but places the 31st in a way that violates a row constraint fails the puzzle.

This is different from what standard benchmarks test. MMLU, GPQA, and similar evals mostly test single-step recall or single-hop reasoning. They’re not designed to catch compounding errors over long inference chains.

Chinese Models Closing the Gap on Structured Problems

DeepSeek and Qwen deserve specific mention. On benchmarks that reward systematic, rule-following reasoning — math olympiad problems, formal logic, structured Sudoku-type puzzles — the leading Chinese models are genuinely competitive with OpenAI and Anthropic’s best. In some cases, they outperform them.

The gap tends to reappear on open-ended, novel, or spatially-complex reasoning. ARC-AGI-2 is a good example of a task where the advantage of heavy reasoning-focused training is less clear, and all models end up in a narrow performance band.

The Benchmark Design Problem

It’s worth stepping back and acknowledging something: benchmarks are always an approximation. ARC-AGI-2 and Pencil Puzzle Bench are better approximations than MMLU — but they’re still proxies.

ARC-AGI-2 could theoretically be “solved” (in the sense of achieving high benchmark scores) by training heavily on synthetic ARC-style tasks with diverse compositional rules, without the model actually developing general reasoning. François Chollet and the ARC Prize team have been explicit that they’re in an adversarial relationship with this possibility — ARC-AGI-2 was designed specifically to be harder to exploit, but it’s not immune.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

Pencil Puzzle Bench faces a related issue: puzzle solvers can be implemented as algorithms. A model that learned to call a constraint propagation algorithm internally (or that was fine-tuned heavily on puzzle examples) might score well without demonstrating general reasoning.

These limitations don’t make the benchmarks useless — they make them useful in a specific way. ARC-AGI-2 and Pencil Puzzle Bench are valuable because they’re currently hard to game, which means strong performance is informative even if it’s not definitive proof of general intelligence.

How This Connects to Real AI Applications

If you’re building AI-powered systems, these benchmarks should inform which models you select for which tasks.

The capability gaps exposed by ARC-AGI-2 and Pencil Puzzle Bench map onto real-world failures you’ve probably seen:

Multi-step workflow automation that breaks down when an agent encounters an unexpected state it hasn’t been explicitly prepared for
Data analysis agents that generate plausible-looking conclusions from inconsistent intermediate steps
Planning and scheduling tools where constraint violations accumulate undetected until the output is obviously wrong

The question isn’t just “which model scored higher on this benchmark” — it’s “what does this tell me about failure modes I should architect around?”

For tasks that require consistent constraint satisfaction across many steps, you need either a strong reasoning model with extended thinking enabled, explicit verification steps built into the workflow, or both.

Testing Model Reasoning in Your Own Workflows with MindStudio

One concrete way to put this into practice: when you’re building an AI agent or workflow, you should be testing it against hard cases — not just typical inputs.

MindStudio makes this easier because you have access to 200+ AI models in a single environment — including the reasoning variants of GPT, Claude, and Gemini alongside open-source alternatives. You can build a workflow once, then run the same agent against o3, Claude 3.7 Sonnet with extended thinking, DeepSeek R1, and a lightweight Qwen model, and compare outputs side by side.

That kind of cross-model evaluation — the same task, multiple models, consistent conditions — is the closest thing to running your own benchmark on the specific reasoning demands of your application. You don’t need API keys for each provider, and you don’t need to rebuild the workflow for each model. The model selector works at the workflow level.

For reasoning-intensive applications — multi-step planning, constraint-based data validation, complex document analysis — this kind of model comparison upfront is worth the investment. The benchmark results above suggest which models are likely to hold up better, but your specific task has its own demands that aggregate benchmarks won’t capture.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is ARC-AGI-2 and how is it different from ARC-AGI-1?

Wondering what the Hermes hype is about? Free 60-minute primer

ARC-AGI-2 is the second version of the Abstraction and Reasoning Corpus benchmark, released by François Chollet and the ARC Prize team in 2025. It tests abstract visual reasoning by presenting AI models with grid-based puzzles where they must infer a transformation rule from a few examples and apply it to a new input. ARC-AGI-2 is significantly harder than ARC-AGI-1: it uses more compositional rules, provides less redundant information per example, and introduces novel object types designed to resist solutions based on synthetic ARC-style training data. Where top models like o3 approached 75% on ARC-AGI-1, the same models score below 10% on ARC-AGI-2.

What is Pencil Puzzle Bench?

Pencil Puzzle Bench is a benchmark that evaluates AI models on classic logic puzzles — Sudoku, nonograms, Kakuro, Slitherlink, Nurikabe, and similar formats. These puzzles have fully specified rules and exactly one correct solution derivable through logical deduction. The benchmark tests whether models can maintain constraint consistency across many interdependent reasoning steps, a skill that standard chat and instruction-following benchmarks don’t measure well.

Why do AI models score so low on ARC-AGI-2?

The primary reason is that ARC-AGI-2 is specifically designed to prevent solution-by-memorization. Each task presents a novel rule that changes between puzzles, so models can’t succeed by recognizing a familiar pattern from training. The benchmark also requires composing multiple operations and tracking their interactions — a form of multi-step, spatially-grounded reasoning that current language models handle poorly regardless of their performance on text-based tasks.

Do Chinese AI models like DeepSeek perform better on these benchmarks?

On structured logic puzzles (Sudoku, Kakuro) and formal reasoning tasks, Chinese reasoning models like DeepSeek R1 and Qwen’s reasoning variants are genuinely competitive with OpenAI and Anthropic’s best offerings. On ARC-AGI-2 and spatially complex puzzles, the advantage is less consistent, and most frontier models cluster in a similar performance range. The gap between Chinese and Western frontier models on these specific benchmarks is smaller than marketing comparisons suggest.

Can reasoning models like o3 or Claude 3.7 Sonnet solve these benchmarks?

Reasoning models — those using extended chain-of-thought processing — do perform better than standard chat models on both benchmarks. But the improvement doesn’t close the gap with human performance. On ARC-AGI-2, humans score 60-80% on typical task subsets while the best AI models remain below 10%. Reasoning models also fail frequently on medium-difficulty Pencil Puzzle Bench instances. The extended thinking helps, but it doesn’t address the underlying issue of maintaining logical consistency across many interdependent steps.

Are there other benchmarks that test genuine reasoning like ARC-AGI-2?

Several benchmarks attempt to measure reasoning beyond recall, including BIG-Bench Hard, which collects tasks that standard models struggled on at the time of its creation, and LiveBench, which uses continuously updated questions to prevent data contamination. ARC-AGI-2 and Pencil Puzzle Bench are distinctive because they test visual/spatial reasoning and constraint satisfaction specifically — dimensions where the capability gap between models and humans remains pronounced.

Key Takeaways

ARC-AGI-2 and Pencil Puzzle Bench are among the most reliable current benchmarks for measuring genuine AI reasoning, specifically because they resist solution by memorization.
On ARC-AGI-2, even the best frontier models — o3, Claude 3.7 Sonnet, Gemini 2.0 Pro, DeepSeek R1 — score below 10%, compared to human baselines of 60-80%.
Pencil Puzzle Bench reveals that constraint-satisfaction and spatial reasoning remain weak across all major models, with performance deteriorating sharply as puzzle difficulty increases.
Chinese reasoning models (DeepSeek, Qwen) are competitive on structured logic puzzles but show similar limitations to Western models on novel, spatially complex tasks.
The core bottleneck for all models is maintaining logical consistency across many interdependent reasoning steps — a failure mode that shows up in real-world AI workflows, not just academic benchmarks.
When building reasoning-intensive AI applications, running cross-model comparisons on your specific task — rather than relying on aggregate benchmark rankings — is the most reliable way to identify which model holds up.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

If you’re building workflows that depend on AI reasoning, testing across multiple models in a controlled environment is worth the effort. MindStudio lets you do exactly that — swap models, compare outputs, and identify failure modes before they reach production.

ARC AGI 2 vs Pencil Puzzle Bench: The Benchmarks That Expose AI Capability Gaps

Why Most AI Benchmarks Are Lying to You