DeepSeek Vision Beats GPT-5.4 by 17 Points on Maze Navigation — The Topological Reasoning Benchmark Explained

That 17-point gap is the most specific thing you can say about DeepSeek’s new vision model, and it’s the right place to start. On the maze navigation benchmark — a topological reasoning task where models must trace paths through spatial layouts — DeepSeek’s vision model scores 67% versus GPT-5.4 at 50%, Gemini Flash 3 at 49%, and Sonnet 4.6 at 49%. All three frontier competitors cluster within a point of each other. DeepSeek is 17 points clear.

If you’re choosing a vision model for any task involving spatial layout, path tracing, or dense scene counting, that number matters. This post explains why the gap exists, what the model is actually doing differently, and where the benchmark claims stop.

What Topological Reasoning Actually Tests

“Topological reasoning” sounds abstract. The maze navigation benchmark makes it concrete: given an image of a maze, trace a valid path from start to finish. This requires the model to track which cells are connected, which walls block movement, and which sequence of moves leads somewhere. Language is genuinely bad at this.

Consider what happens when a model tries to describe a path in words: “go right, then down, then right again, then up.” By the time you’re four moves in, the model has lost track of which cell it’s currently in. There’s no anchor. The description drifts.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Path tracing is the canonical example of what the paper calls the “reference gap” — the problem where even a model that sees perfectly can’t reliably point to things in its reasoning chain. You can describe the third bear from the left on a rocky ledge, but as your chain-of-thought extends, you lose track of which entity you’re actually talking about.

The maze benchmark is designed to stress exactly this failure mode. It’s not testing whether the model can see the maze clearly. It’s testing whether the model can maintain spatial reference across a multi-step reasoning trace.

How DeepSeek Closes the Gap: Visual Primitives

The paper — titled Thinking with Visual Primitives, published then removed and hard to find as of the video recording — describes a specific mechanism for solving the reference gap.

Instead of describing spatial relationships in prose, the model outputs bounding boxes inline in its chain-of-thought. The format is straightforward: <ref>label</ref><box>x1,y1,x2,y2</box>. These are special vocabulary tokens. Not function calls. Not a separate tool invocation. The coordinates appear mid-thought, the same way a human might tap a finger on a map while reasoning aloud.

For maze navigation, this means the model can literally point to the cell it’s currently reasoning about. Each step in the path trace gets a bounding box. The model doesn’t have to maintain a purely verbal description of its position — it has a coordinate anchor.

For dense counting tasks (the paper uses a team photo example), the model writes out a bounding box for every person it identifies before attempting a count. This is why the approach also helps on counting and spatial QA benchmarks, not just maze navigation.

The technique is called “visual primitives” because the coordinates are first-class primitives in the model’s vocabulary, not post-hoc annotations. The chain-of-thought and the spatial references are the same stream.

The Architecture That Makes This Cheap

The benchmark numbers are interesting. The efficiency numbers are arguably more interesting.

For an 80x80 image, DeepSeek’s vision model uses approximately 90 entries in its KV cache. Claude Sonnet 4.6 uses around 870. Gemini Flash 3 uses around 1,000. That’s roughly 10x more efficient on the same input — meaning the model costs about a tenth as much to run per image at equivalent resolution.

The compression pipeline behind this is worth understanding. A 756x756 image produces approximately 571,000 pixels. The model’s custom vision encoder — a “DeepSeek Vision Transformer” built from scratch to support arbitrary resolution — processes this through 14x14 patches, producing 2,916 patch tokens. A 3x3 spatial compression step folds nine adjacent patches into one, bringing the count to 324 tokens. Then a compressed sparse attention mechanism (carried over from the V4 architecture) compresses the KV cache by another factor of four. Final result: 81 KV cache entries for the full image. That’s approximately a 7,000x total compression ratio from raw pixels to KV cache.

The language backbone is DeepSeek V4 Flash — a 284-billion parameter mixture-of-experts model with 13 billion active parameters at inference. You’re getting frontier-grade reasoning at the compute cost of a 13B dense model.

If you’re building applications that process images at scale, this architecture changes the cost math significantly. Platforms like MindStudio that support 200+ models let you swap between vision models without rewriting orchestration — which means you can benchmark DeepSeek’s vision model against Sonnet 4.6 on your actual workload once it’s in wider rollout.

The Training Pipeline Behind the Benchmark

The 17-point gap on maze navigation didn’t come from architecture alone. The training pipeline is specifically designed to produce reliable grounding and pointing behavior.

There are five stages. First, multimodal pre-training on trillions of tokens. Second, supervised fine-tuning — but notably, two separate models are trained here: one for grounding (bounding boxes) and one for pointing (coordinate points). Third, reinforcement learning with GRPO on each specialist, using three reward heads: format correctness, output quality, and task accuracy. Fourth, a unified RFT step that merges the two specialists. Fifth, on-policy distillation into a single student model.

The separation of grounding and pointing into distinct SFT tracks is deliberate. Boxes and points have different geometric properties and different failure modes. Training them separately before merging avoids the compromise you’d get from training a single model on both simultaneously.

The GRPO RL stage with three reward heads is also worth flagging. Format, quality, and accuracy are rewarded independently, which means the model gets penalized for producing malformed <ref><box> tokens even when the underlying reasoning is correct. This is what keeps the visual primitives output clean enough to be useful.

What the Paper Actually Claims (And What It Doesn’t)

There’s a footnote in the paper that a lot of coverage skips. It reads: “reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model’s overall capabilities.”

The paper is not claiming DeepSeek beats GPT-5.4 across the board. It’s claiming DeepSeek beats GPT-5.4 on visually grounded reasoning tasks — specifically the topological tasks where inline coordinate reasoning provides a structural advantage. That’s a narrower and more defensible claim.

The tasks where the advantage is clearest are maze navigation and path tracing — exactly the tasks where language-only reasoning breaks down most severely. On raw counting QA, Gemini Flash 3 is still ahead. The visual primitives technique helps most where trajectory description is involved, because that’s where the reference gap bites hardest.

For context on how GPT-5.4 and Sonnet 4.6 compare on other dimensions, the GPT-5.4 vs Claude Opus 4.6 benchmark comparison covers coding, creative writing, and research tasks — none of which are in scope for this paper’s claims.

Three Limitations the Paper Admits

The paper lists three limitations explicitly, which is worth crediting.

First, the model is resolution-bound. Fine-grained scenes — dense text, small objects, high-detail diagrams — can still fail. The 7,000x compression ratio is efficient, but it loses information. At some point, the compression is too aggressive for the task.

Second, visual primitives mode must be triggered explicitly. The model doesn’t auto-decide when to use bounding boxes versus prose reasoning. This is a significant practical limitation: for the benchmark numbers to translate to real-world use, you need to know when to invoke the mode. That’s not always obvious.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Third, point-based topological reasoning doesn’t generalize well across scenarios. The maze navigation benchmark is a controlled environment. The same technique applied to arbitrary spatial reasoning tasks — floor plans, circuit diagrams, geographic maps — doesn’t transfer cleanly. The paper is honest that this is a research result, not a production-ready capability.

These limitations matter if you’re evaluating whether to route vision tasks to this model. The benchmark gap is real, but it’s specific to a class of tasks under specific conditions.

The Lineage That Explains the Direction

This model didn’t appear in isolation. DeepSeek has shipped roughly seven vision-related models in 24 months, and each one advances the same thesis: find the cheapest representation that still works.

DeepSeek VL (March 2024) established the foundation with hybrid SigLIP and SAM encoders. Janus (October 2024) decoupled the visual encoder for understanding versus generation — two encoders, shared transformer. VL2 (December 2024) ported mixture-of-experts and multi-head latent attention into vision, producing a 1B activated parameter model scoring 80.9 on OCR Bench. Janus Pro 7B (January 2025) went viral during the R1 moment, running on a single consumer GPU. Then DeepSeek OCR (October 2025) reframed the compression question entirely: render 1,000 text tokens as an image, encode the image, recover the text at 97% accuracy from 100 vision tokens — 10x compression on long context.

Andrej Karpathy’s reaction to the OCR paper — “the tokenizer must go, pixel may be better inputs to language models than text” — pointed at something real. The visual primitives model is the next step in that direction: not just compressing text into pixels, but making spatial coordinates first-class tokens in the reasoning chain.

The through-line across all seven models is compression. Every release asks: what can we throw away without losing what matters?

The Rollout Status

As of April 29, 2026, DeepSeek started rolling out vision mode in the app and on the web alongside fast and expert modes — a limited test, not a full release. The paper itself is hard to find. The model behind it appears to be in gradual rollout.

This matters for anyone trying to reproduce the benchmark numbers. You can’t just call an API and run the maze navigation benchmark today. The model isn’t publicly available at scale yet.

When it does become available, the comparison that will matter most isn’t the benchmark score in isolation — it’s the cost-adjusted score. At 10x lower KV cache usage, DeepSeek’s vision model could score 15% lower than Sonnet 4.6 on topological tasks and still be the better choice for high-volume spatial reasoning workloads, purely on economics.

That’s the calculation worth running. For teams building applications that process spatial layouts, floor plans, or any image type where path tracing or dense counting is involved, the benchmark gap plus the efficiency gap is a meaningful signal.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

If you’re at the stage of speccing out what such an application should do before committing to a model stack, tools like Remy take a different approach to that planning step: you write the application as annotated markdown — a spec where prose carries intent and annotations carry precision — and Remy compiles it into a complete TypeScript backend, database, auth, and deployment. The spec is the source of truth; the model routing decisions live there, not scattered across implementation files.

Where the Benchmark Gap Comes From, in One Sentence

Language is bad at trajectory description. Bounding boxes are not.

The 17-point gap on maze navigation is not a mystery once you understand the reference gap. When a model can anchor each reasoning step to a coordinate, it doesn’t lose track of where it is in a spatial layout. When it can only describe position in words, it does.

The visual primitives technique is a clean solution to a specific problem. The benchmark reflects that cleanness. The limitations reflect the fact that the problem is harder than one benchmark can capture.

For anyone building on vision models right now, the practical question is: does your task involve trajectory description, path tracing, or dense spatial counting? If yes, this benchmark gap is directly relevant to you. If your task is document extraction, image captioning, or visual QA on isolated objects, the GPT-5.4 vs Claude Opus 4.6 comparison and the Claude Mythos multimodal benchmark results are probably more useful anchors for your model selection.

The maze navigation result is real. It’s also narrow. Both things are true.