What Is the Pencil Puzzle Benchmark? The Test That Measures Pure Multi-Step Logical Reasoning

Why Most AI Benchmarks Are Already Broken Before They’re Published

Every few months, a new AI benchmark drops. Within weeks, someone notices the top models have already seen the test data. Scores inflate. Rankings shift. And developers are left asking the same question: what does any of this actually measure?

The Pencil Puzzle Benchmark was built to answer that. It tests multi-step logical reasoning on constraint satisfaction problems — specifically Japanese-style pencil puzzles — that can be generated fresh for each evaluation. No memorization. No pattern matching. Just reasoning.

The results are stark. GPT-4 family models score around 56%. Several prominent Chinese models score under 7%. This isn’t a small gap — it’s a window into which models can actually think versus which ones have learned to appear like they can.

What “Pencil Puzzles” Actually Are

Pencil puzzles are a category of logic puzzle popularized by Japanese publishers like Nikoli. They include well-known formats like Sudoku, Nonograms (also called Picross), Nurikabe, Slitherlink, Hitori, and dozens of others.

What makes them distinctive isn’t just their difficulty. It’s their structure.

Constraint Satisfaction at Its Core

Every pencil puzzle is a constraint satisfaction problem (CSP). You’re given a grid and a set of rules. The solution requires you to apply those rules consistently across every cell, row, column, or region — simultaneously satisfying all constraints without violating any.

There’s no partial credit for “mostly right.” One broken constraint invalidates the entire solution.

Hermes, walked through line by line — free 1-hour workshop

This makes pencil puzzles an ideal test for logical reasoning because:

They have a unique correct answer — ambiguity can’t mask a wrong output
They require backtracking and hypothesis testing — you can’t just scan the grid once
They scale in difficulty — puzzle generators can tune complexity without changing the rules
Novel instances are cheap to generate — the same rule set produces infinite unique puzzles

That last point matters enormously for benchmark design.

Why Novelty Is the Whole Point

A benchmark that uses fixed puzzles can be contaminated. If a model’s training data includes solutions to Sudoku puzzles from web scrapes, it might “solve” them by recall rather than reasoning. The same problem applies to math benchmarks, coding benchmarks, and reading comprehension tests.

The Pencil Puzzle Benchmark sidesteps this by generating new puzzle instances programmatically. The rules stay constant — the grid contents change every time. There’s no specific answer for a model to have memorized, because the specific puzzle being evaluated didn’t exist during training.

This is the cleanest methodological approach to contamination-proofing a reasoning benchmark that’s been proposed to date.

How the Benchmark Works

The benchmark evaluates models on their ability to solve pencil puzzles presented as text (or in some configurations, structured data). Models receive:

A description of the puzzle rules
A grid in a parseable format (typically a structured text grid or coordinate-based representation)
A prompt asking for the completed solution

The model must output a valid solution that satisfies all constraints. Partial or approximate answers don’t count.

Puzzle Types Tested

The benchmark typically includes multiple puzzle formats to avoid specialization bias. Common puzzle types include:

Sudoku — Fill a 9×9 grid so each row, column, and 3×3 box contains digits 1–9
Nonograms — Color cells to match row and column clues describing consecutive filled-cell runs
Nurikabe — Shade cells to form “rivers” around numbered islands of exact sizes
Hitori — Eliminate numbers so no row or column contains duplicates, without isolating cells
Slitherlink — Draw a single closed loop using clues about how many sides of each cell the loop touches

Each puzzle type tests a slightly different constraint structure. Nonograms require managing overlapping row/column constraints. Nurikabe adds a connectivity requirement. Slitherlink requires reasoning about a topological structure.

A model that solves one type well but fails on others reveals something about which constraint patterns its reasoning handles cleanly.

Scoring Methodology

The benchmark uses exact-match scoring at the grid level. A solution is correct only if every cell matches the verified answer. This is deliberately strict — it’s the standard applied to human solvers, and there’s no reason to apply a softer standard to models claiming human-level reasoning.

Some variants of the evaluation include:

Step-by-step scoring — Does the model’s chain of thought reflect valid logical deductions?
Error analysis — Where in the puzzle does reasoning typically break down?
Scaling tests — How does performance change as grid size increases?

The combination gives researchers more than a single number. It shows how models fail, not just that they fail.

What the Scores Reveal

Catch up on Hermes — free 60-minute live workshop

The headline numbers are striking enough on their own. But the pattern of results tells a more nuanced story about the state of AI reasoning.

The 56% Ceiling

GPT-4-class models achieve around 56% on the benchmark. That’s a meaningful number to sit with. It means the most capable publicly available models fail nearly half of all logic puzzles that a patient human solver could complete with pen and paper.

This isn’t a matter of speed or token limits. These are puzzles designed to be solvable in minutes by humans applying consistent rules. A 56% pass rate suggests that even frontier models hit reasoning walls when constraint chains grow long enough.

The ceiling also hasn’t moved dramatically despite significant capability improvements in other areas. Models that dramatically outperform their predecessors on reading comprehension, coding, or factual recall don’t show the same gains on pencil puzzles. That asymmetry is informative — it suggests the underlying constraint-satisfaction skill isn’t improving at the same rate as surface-level task performance.

Chinese Models and the Sub-7% Result

Several prominent Chinese-developed models score under 7% on the benchmark. This result generated significant attention in AI research circles, partly because these same models post competitive scores on many other standard benchmarks.

The discrepancy points to a known problem: benchmark leakage. Many standard evaluations — including translated versions of popular English-language tests — have made their way into Chinese-language pretraining corpora. Models trained on this data can score well on contaminated benchmarks without having developed the underlying skills those benchmarks were meant to test.

The Pencil Puzzle Benchmark, with its generated-fresh instances, cuts through this. You can’t memorize your way to a correct solution for a puzzle that didn’t exist until five seconds ago.

This doesn’t mean all Chinese models are weak at reasoning, or that all high-scoring models on other benchmarks are inflated. But it does suggest that benchmark diversity — and specifically, contamination-resistant benchmarks — should be standard practice for any honest model evaluation.

What Failure Looks Like

When models fail pencil puzzles, they tend to fail in recognizable ways:

Constraint violation at a distance — The model fills in a cell correctly based on nearby information, then forgets that constraint when it returns to a different region of the grid.
Hallucinated constraints — The model invents rules that don’t exist in the puzzle specification, then applies them consistently to produce a wrong but internally coherent answer.
Premature commitment — Rather than maintaining multiple hypotheses about uncertain cells, the model picks one value early and sticks with it even when later steps reveal a contradiction.
Lost context — In large grids, earlier deductions get dropped as the reasoning chain extends, leading the model to re-derive (sometimes incorrectly) information it had already established.

These failure modes map to broader limitations in multi-step reasoning tasks. They’re not unique to puzzles — they show up in complex code generation, multi-hop question answering, and agentic tasks that require holding many interdependent constraints in mind simultaneously.

Why This Benchmark Matters Beyond Puzzles

The Pencil Puzzle Benchmark isn’t really about puzzles. It’s about what puzzles reveal.

A Proxy for Real Reasoning Tasks

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Multi-step constraint satisfaction appears constantly in real-world applications:

Scheduling — Book meetings across time zones with overlapping constraints on availability, priority, and duration
Code debugging — Trace a bug through a call stack while maintaining hypotheses about where state gets corrupted
Legal or financial analysis — Apply a set of rules to a complex fact pattern and derive a valid conclusion
Supply chain optimization — Satisfy delivery, capacity, and cost constraints simultaneously

Models that can’t reliably solve a Nonogram probably struggle with these tasks too — at least at the level of precision these applications require.

Separating Recall from Reasoning

One of the persistent challenges in AI evaluation is separating what a model knows from what a model can figure out. Benchmarks that draw from existing knowledge — trivia, reading comprehension, standard math problems — can’t cleanly separate these.

Pencil puzzles eliminate the knowledge variable almost entirely. You don’t need to know anything about the world to solve a Nurikabe. You need to apply the rules. This isolation makes the benchmark a much cleaner signal for the specific capability of reasoning under constraint.

The Benchmark Contamination Problem

The broader research community has documented benchmark contamination extensively. Recent analyses of popular LLM evaluation sets have shown that test data appears in pretraining corpora at rates that meaningfully inflate scores. The Pencil Puzzle Benchmark’s generated-instance approach is one of the few robust solutions to this problem.

Other contamination-resistant approaches include:

Time-gated benchmarks — Evaluations based on events after a model’s training cutoff
Adversarial generation — Crafting questions specifically designed to probe weaknesses
Human-generated novel tasks — Tasks that haven’t yet been scraped from the web

Each has tradeoffs. The pencil puzzle approach is particularly clean because the rules are well-defined, the correct answer is verifiable, and instance generation is cheap enough to produce evaluation sets at scale.

How This Compares to Other Reasoning Benchmarks

The Pencil Puzzle Benchmark isn’t the only tool researchers use for reasoning evaluation. Understanding how it fits into the broader evaluation landscape is useful.

GSM8K and MATH

These grade-school and competition math benchmarks test multi-step numerical reasoning. They’ve been widely used, but both suffer from documented contamination. Models can achieve high scores partly through memorization of problem types and solution templates.

Math benchmarks also have a different structure than pencil puzzles — math problems typically have a single chain of deduction, while puzzles require managing a branching constraint graph.

ARC and BIG-Bench Hard

The AI2 Reasoning Challenge and BIG-Bench Hard both target reasoning tasks that resist surface-level pattern matching. They’re valuable, but many tasks in these sets are still drawn from existing corpora, leaving some contamination risk.

LogiQA and ReClor

These benchmarks test logical reasoning through natural language arguments. They’re useful for evaluating reading comprehension paired with logic, but they can be influenced by language understanding shortcuts rather than pure logical deduction.

Where Pencil Puzzles Fit

The Pencil Puzzle Benchmark complements these evaluations rather than replacing them. Its unique strength is clean separation of reasoning from knowledge, with built-in contamination resistance. Its limitation is that it doesn’t test reasoning in natural language or applied domains.

Used together, these benchmarks give a more complete picture of model capability than any single evaluation can provide.

Choosing and Testing Models for Reasoning Tasks in MindStudio

If you’re building applications where multi-step reasoning actually matters — and the benchmark results make clear that not all models are equal here — model selection becomes a meaningful decision.

MindStudio gives you access to 200+ models out of the box, including the full GPT-4 family, Claude, Gemini, and others, without managing separate API keys or accounts. You can swap models in and out of your workflow with a few clicks and compare their outputs directly.

This matters for the kind of tasks the Pencil Puzzle Benchmark is designed to probe. If you’re building an agent that needs to resolve scheduling conflicts, trace dependencies in a codebase, or apply a complex rule set to incoming data, you want to know — before you ship — whether your model of choice can actually handle the constraint chain.

With MindStudio, you can build a workflow that routes the same reasoning task to multiple models in parallel, captures their outputs, and surfaces differences. It’s a practical way to run your own reasoning evaluation without building evaluation infrastructure from scratch.

You can start building for free at mindstudio.ai. The average workflow takes less than an hour to put together, and you can test model behavior against real reasoning tasks before committing to a production setup.

If you’re specifically interested in comparing frontier AI models for agent workflows, MindStudio makes that comparison straightforward without requiring you to manage multiple accounts or write API integration code.

FAQ

What is the Pencil Puzzle Benchmark?

The Pencil Puzzle Benchmark is an AI evaluation framework that tests large language models on Japanese-style logic puzzles — Sudoku, Nonograms, Nurikabe, and similar formats. These are constraint satisfaction problems with unique, verifiable solutions. The benchmark generates novel puzzle instances for each evaluation run, making it highly resistant to training data contamination.

Why do some models score so low on pencil puzzles?

Low scores — like the sub-7% results seen in several Chinese-developed models — typically reflect a combination of two things: limited multi-step reasoning ability and benchmark inflation on other evaluations due to training data contamination. Pencil puzzles can’t be solved by recall, so models that have inflated scores elsewhere can’t carry that advantage here.

How is the Pencil Puzzle Benchmark different from other AI benchmarks?

Most benchmarks draw from existing text — math problems, reading passages, trivia questions — that may have appeared in training data. The Pencil Puzzle Benchmark generates fresh puzzle instances programmatically, so there’s no specific answer a model could have memorized. It also produces a verifiable correct answer, eliminating ambiguity in scoring.

What does a 56% score on the Pencil Puzzle Benchmark actually mean?

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

It means frontier models solve roughly half of the logic puzzles presented. Given that these puzzles are solvable by humans with pen, paper, and time, a 56% pass rate reflects a genuine ceiling in constraint-satisfaction reasoning — not a speed or knowledge gap. The score also suggests that reasoning capability isn’t improving at the same rate as other model capabilities.

Can the Pencil Puzzle Benchmark be used to evaluate reasoning for real-world applications?

Not directly — the benchmark is a research tool, not a production readiness test. But the reasoning skills it probes (managing multiple constraints simultaneously, avoiding premature commitment, maintaining consistency over long deduction chains) are directly relevant to scheduling, debugging, legal analysis, and other structured real-world tasks. High scores on the benchmark correlate with stronger performance on these applied reasoning challenges.

What types of puzzles are included in the benchmark?

The benchmark typically covers multiple puzzle formats to avoid specialization effects. Common types include Sudoku, Nonograms, Nurikabe, Hitori, and Slitherlink. Each tests a slightly different constraint structure — some require managing overlapping row/column rules, others require reasoning about connectivity or topology. Testing across multiple formats gives a more complete picture of reasoning capability than any single puzzle type would.

Key Takeaways

The Pencil Puzzle Benchmark tests multi-step logical reasoning through constraint satisfaction problems — pencil puzzles like Sudoku, Nonograms, and Nurikabe — with fresh-generated instances that resist training data contamination.
Frontier models score around 56% on the benchmark, revealing a genuine ceiling in constraint-satisfaction reasoning even as performance improves elsewhere.
Several prominent Chinese-developed models score under 7%, a result that strongly suggests benchmark inflation on other evaluations caused by training data contamination.
Common failure modes — constraint violation at a distance, hallucinated rules, premature commitment — map directly to limitations in real-world reasoning tasks like scheduling, debugging, and complex analysis.
Contamination-resistant benchmarks like this one are increasingly essential for understanding what models can actually do, not just what they’ve memorized.

If you’re building AI applications where reasoning depth matters, MindStudio makes it practical to test multiple models side by side on your actual use cases — before committing to one. You can start for free and have a working workflow in under an hour.

What Is the Pencil Puzzle Benchmark? The Test That Measures Pure Multi-Step Logical Reasoning

Why Most AI Benchmarks Are Already Broken Before They’re Published