DeepSeek V4 Vision Model: 10x KV-Cache Efficiency and 67% Maze Navigation vs GPT-5.4's 50%

DeepSeek’s Vision Model Uses 90 KV-Cache Entries Where Sonnet Uses 870

DeepSeek’s vision variant — built on the V4 Flash backbone — processes an 80×80 image using approximately 90 KV-cache entries. Claude Sonnet 4.6 uses around 870 for the same image. That’s a 10x difference in memory footprint per image, and it’s not an accident. It’s the result of a deliberate multi-stage compression pipeline that DeepSeek has been quietly building toward for two years.

If you’re building vision-heavy pipelines and you’re watching inference costs, this is the number that matters.

The maze navigation benchmark makes the efficiency story even stranger: DeepSeek’s vision model scores 67% on topological reasoning tasks, against GPT-5.4’s 50% and Gemini Flash 3’s 49%. A model that costs a tenth as much to run on images is also outperforming frontier models on the class of spatial reasoning tasks where visual grounding matters most.

How DeepSeek Gets to 90 Cache Entries

The compression isn’t a single trick. It’s a pipeline of four stages, each multiplying the reduction.

Start with a 756×756 image. That’s 571,000 pixels. The DeepSeek vision transformer — which they call a “DeepSeek Vision Transformer” and built from scratch to support arbitrary resolution — processes this using 14×4 patches. That initial patch tokenization produces around 2,916 patch tokens.

Then a 3×3 spatial compression step runs along the channel dimension, collapsing nine adjacent patches into one. That brings the token count down to roughly 324.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Then the compressed sparse attention mechanism from the V4 paper applies another 4× compression to the KV cache.

End result: approximately 81 entries in the KV cache for a full image. The paper describes this as roughly a 7,000× total compression ratio from raw pixels to KV-cache entries.

The language backbone underneath all of this is DeepSeek V4 Flash — a 284B parameter mixture-of-experts model with 13B active parameters at inference. You’re getting a model that reasons at frontier quality but only activates 13B parameters per forward pass, combined with a vision encoder that represents each image in a fraction of the memory that competing models require.

The “Thinking With Visual Primitives” Paper

The vision model isn’t just about efficiency. The paper — titled Thinking with Visual Primitives — argues that current multimodal models have two distinct gaps, not one.

The first is the perception gap: models can’t always see fine-grained detail. Most of the 2024 work on high-resolution cropping and dynamic patching was aimed at this.

The second is what the paper calls the reference gap: even when a model sees an image correctly, natural language is too imprecise to point at things reliably. If you ask a model to count the third bear from the left on a rocky ledge, it can describe what it sees, but it loses track of which entity it’s actually referring to as its reasoning chain gets longer. Humans solve this with a finger. Models, until now, didn’t have an equivalent.

DeepSeek’s solution is to make spatial coordinates first-class tokens in the chain of thought. When the model reasons about an image, it emits bounding boxes inline — a reference tag with a label, followed by a box tag with two corner coordinates. These are special tokens in the model’s vocabulary, not function calls, not a separate tool. The model literally writes <ref>person_3</ref><box>(x1,y1),(x2,y2)</box> mid-thought, then continues reasoning with that anchor in place.

This is why the maze navigation benchmark matters. Maze path tracing and topological reasoning are exactly the tasks where language is uniquely bad at trajectory description. When the model can point to a cell, mark it, and reason forward from that mark, it doesn’t lose its place. The 67% vs 50% gap over GPT-5.4 isn’t surprising once you understand the mechanism — it’s the expected result of having a reference primitive that GPT-5.4 lacks.

Two Years of the Same Story

The paper didn’t come out of nowhere. DeepSeek has shipped roughly seven vision-related models in 24 months, and the through-line across all of them is the same question: what’s the cheapest representation that still works?

In March 2024, DeepSeek VL used a hybrid SigLIP and SAM encoder. Nothing flashy, but it set the foundation. In October 2024, Janus decoupled the visual encoder for understanding versus generation — most unified multimodal models at the time had a single encoder bottleneck, and Janus said no to that. In December 2024, the VL2 model ported mixture-of-experts and multi-head latent attention from V2 and V3 into vision. A 1B activated-parameter version was scoring 80.9 on OCR bench.

Then in October 2025, the DeepSeek OCR paper landed. The framing was strange — they called it an OCR paper, but the actual idea was: take 1,000 text tokens, render them as an image, encode the image, and get back 100 vision tokens that reconstruct the original text at 97% accuracy. That’s 10× compression on long context. Andrej Karpathy’s reaction was: “the tokenizer must go, pixels may be better inputs to language models than text.” That quote spread fast, and it’s what put DeepSeek’s vision team on a lot of people’s radar.

The Thinking with Visual Primitives paper is the next chapter in that same story. Each release has been asking the same question and finding a more aggressive answer.

What the Benchmarks Actually Claim (and Don’t)

The paper is honest about scope in a way that a lot of coverage will skip. There’s a footnote that says the reported scores cover only a subset of evaluation dimensions directly relevant to the research focus, and are therefore not indicative of the model’s overall capabilities.

DeepSeek is not claiming this beats GPT-5.4 across the board. They’re claiming it beats GPT-5.4 on visually grounded reasoning tasks — maze navigation, path tracing, counting in dense scenes. That’s a narrower and more defensible claim.

On raw count QA, Gemini Flash 3 is still ahead. On general vision benchmarks, the paper doesn’t make sweeping claims. The three limitations they explicitly flag: the model is resolution-bound (fine-grained scenes can still trip it up), the visual primitives mode has to be triggered explicitly rather than being auto-selected, and the point-based topological reasoning doesn’t generalize well across all scenarios.

That kind of honesty is worth crediting. The paper is telling you where the model works and where it doesn’t. That’s more useful than a leaderboard screenshot.

For comparison, if you’re evaluating where this sits relative to other frontier models on general tasks, the GPT-5.4 vs Claude Opus 4.6 comparison covers the broader capability landscape — DeepSeek V4 Flash slots in below both on general benchmarks but significantly undercuts them on cost.

Why the Efficiency Gap Has Practical Consequences

The 10× KV-cache reduction isn’t just a benchmark curiosity. KV-cache size directly affects how many concurrent image requests you can serve on a given GPU, how much memory you need to batch vision requests, and therefore what your actual cost per image looks like at scale.

If you’re running a pipeline that processes thousands of images — document extraction, visual QA over product catalogs, screenshot analysis for agents — the difference between 90 and 870 cache entries per image is the difference between fitting 9× more requests in the same memory budget or needing 9× fewer GPUs to hit the same throughput.

DeepSeek V4 Flash is already priced at $1.74/M input tokens and $3.48/M output tokens. Compare that to Claude Opus 4.7 at $5/M input and $25/M output, or GPT-5.5 at $5/M input and $30/M output. The vision model inherits that pricing, and then the KV-cache efficiency multiplies the effective cost advantage further when you’re serving images at volume.

Platforms like MindStudio that support 200+ models and let you route requests across providers make this kind of cost arbitrage practical without writing custom routing logic — you can swap the vision backend and measure the quality difference without rebuilding your pipeline.

The Architecture Decision Worth Watching

The training pipeline for the vision model is worth a closer look because it’s unusual.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

DeepSeek trained two separate specialist models first: one for “thinking with grounding” using bounding boxes, one for “thinking with pointing” using point coordinates. Each got its own supervised fine-tuning pass, then its own reinforcement learning with GRPO using three reward heads (format, quality, and accuracy). Then a unified reward-free distillation step merged them into a single student model.

The result is a model that can do both box-based and point-based visual reasoning, but was trained as two specialists before being consolidated. This is a more expensive training approach than training a single model end-to-end, but it avoids the compromise that comes from training a single model to do two different things simultaneously.

The same logic — decouple the hard parts, train specialists, then merge — showed up in Janus with the dual encoder design. It’s becoming a pattern in how DeepSeek approaches multimodal architecture.

This is also relevant context for anyone building vision-heavy agents. The model isn’t a general-purpose vision model that happens to do spatial reasoning. It’s a spatial reasoning model that was explicitly trained to ground its chain of thought in coordinates. If your use case involves counting, path tracing, or any task where “which specific thing am I talking about” is the hard part, that’s the design target.

For teams evaluating open-weight alternatives for vision tasks, the Gemma 4 vs Qwen 3.5 open-weight comparison covers the other main contenders in the sub-frontier open-weight space — neither has published comparable KV-cache efficiency numbers.

The Rollout Status

As of late April 2025, DeepSeek started rolling out vision mode in the app and on the web, at least in limited testing alongside their fast and expert modes. The paper itself became hard to find shortly after publication — it appeared briefly and then was pulled back, which is unusual.

The model behind the paper appears to be in a gradual rollout. The OpenRouter model ID for DeepSeek V4 Flash is deepseek/deepseek-v4-flash, which is the text backbone. The vision variant isn’t separately listed yet on OpenRouter as of this writing, but the underlying architecture is the same V4 Flash base.

If you’re building something that needs to go to production before the vision variant is widely available, the text model is accessible now. The vision capabilities are the layer on top.

What to Watch For

The immediate thing to test, once the vision model is in wider rollout, is whether the KV-cache efficiency holds at higher resolutions. The paper reports ~90 entries for 80×80. The architecture supports arbitrary resolution, but the compression ratios are resolution-dependent. A 1024×1024 image will produce more patch tokens before compression, and the final cache size will be larger — the question is whether the compression ratio stays roughly constant or degrades.

The second thing to watch is the “visual primitives mode must be triggered explicitly” limitation. Right now, the model doesn’t auto-decide when to use bounding box reasoning versus standard vision processing. That’s a usability gap for agent use cases where you want the model to self-select the right reasoning mode. When that becomes automatic, the maze navigation numbers will likely improve further.

For anyone building vision agents specifically, the GPT-5.4 Mini vs Claude Haiku 4.5 sub-agent comparison is a useful reference for thinking about where a highly efficient vision model like this fits in a multi-tier agent architecture — the cost profile of V4 Flash makes it a plausible sub-agent for visual tasks even when you’re using a more capable model as the orchestrator.

The broader question the DeepSeek OCR paper raised — whether pixels are better inputs than tokens for certain tasks — is now being answered in practice. A model that represents an image in 90 KV-cache entries and outperforms GPT-5.4 on spatial reasoning is a concrete data point in that argument. The tokenizer may not be going anywhere soon, but the case for pixel-native representations is getting harder to dismiss.

If you’re building applications where the spec is “process images, reason about spatial relationships, do it cheaply at scale,” the architecture described in Thinking with Visual Primitives is worth understanding in detail. Tools like Remy take a different approach to the spec-to-production problem — you write annotated markdown and compile a full TypeScript stack from it — but the underlying principle is similar: the right representation at the right level of abstraction produces better results than forcing everything through a single bottleneck.

The DeepSeek vision team has been asking the same question for two years. The answer keeps getting more interesting.