DeepSeek Vision vs. Claude Sonnet 4.6 vs. Gemini Flash 3: Which Vision Model Uses 10x Less KV Cache?

The Vision Model You’re Paying 10x Too Much to Run

DeepSeek’s new vision model uses roughly 90 KV cache entries for an 80×80 image. Claude Sonnet 4.6 uses around 870. Gemini Flash 3 uses approximately 1,000. That’s not a rounding error — that’s a structural difference in how these models represent visual information, and it has direct implications for what you pay every time you process an image.

If you’re choosing a vision model for a production workload right now, that number matters more than almost any benchmark score.

The paper behind this is titled “Thinking with Visual Primitives.” It was published, then became difficult to find. The model itself started rolling out in limited form on April 29, 2026, alongside DeepSeek’s fast and expert modes in the app and on the web. The paper is hard to locate; the efficiency numbers are not.

Why KV Cache Is the Number That Actually Matters

Most vision model comparisons lead with accuracy benchmarks. That’s fine for research papers. For production systems, the KV cache size is closer to the real cost driver.

The KV cache is what the model holds in memory while processing a sequence. Larger KV cache means more memory, more compute, more cost per inference. When you’re running thousands of image queries per day, a 10x difference in cache size is a 10x difference in what you’re paying to serve those queries.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

This is why the 80×80 comparison is the right anchor. It’s a controlled, apples-to-apples test: same image resolution, same task, three different models. DeepSeek comes in at ~90 entries. Sonnet 4.6 comes in at ~870. Gemini Flash 3 lands at ~1,000. The gap is not marginal.

The underlying reason for DeepSeek’s efficiency is architectural, not accidental. Their custom vision transformer — they call it the DeepSeek Vision Transformer — applies a three-stage compression pipeline. A 756×756 image first becomes 2,916 patch tokens via 14×14 patches. A 3×3 spatial compression along the channel dimension reduces that to 324 tokens. Then a compressed sparse attention mechanism from the V4 paper compresses the KV cache by another factor of four. End result: 81 entries. Total compression ratio from raw pixels to KV cache: approximately 7,000x.

That’s not a trick. That’s an architecture decision made consistently across multiple model generations.

Three Models, Three Different Philosophies

DeepSeek Vision: Efficiency as a First Principle

DeepSeek’s vision work didn’t start with this model. The lineage runs back to March 2024 with DeepSeek VL, through Janus in October 2024 (which decoupled visual encoders for understanding versus generation), through VL2 in December 2024 (which ported mixture-of-experts and multi-head latent attention into vision), through Janus Pro 7B in January 2025, and through the DeepSeek OCR paper in October 2025.

That OCR paper is worth pausing on. The framing was strange — they called it OCR but the actual idea was: take 1,000 text tokens, render them as an image, encode the image, get back 100 vision tokens that reconstruct the original text at 97% accuracy. Ten-to-one compression on long context. Andrej Karpathy’s reaction: “the tokenizer must go, pixel may be better inputs to language models than text.”

That’s the through-line. Every model in this lineage asks the same question: what’s the cheapest representation that still works? The current vision model is the answer applied to spatial reasoning.

The language backbone is DeepSeek V4 Flash — a 284B parameter mixture-of-experts model with 13B active parameters at inference. Frontier-grade reasoning at a fraction of the activation cost. The vision encoder supports arbitrary resolution. The whole system is designed to be cheap to run.

The technique the paper introduces — “visual primitives” — is also worth understanding in this context. When the model reasons about an image, it can output inline tokens in the format <ref>label</ref><box>x1,y1,x2,y2</box>. These are special vocabulary tokens embedded directly in the chain of thought. Not function calling. Not a separate tool. The model literally points to things as it reasons, the same way a human uses a finger to track objects in a dense scene. Counting people in a team photo, tracing a path through a maze, distinguishing a Chihuahua from a muffin — all of these become more reliable when the model can anchor its reasoning to coordinates rather than language descriptions alone.

The training pipeline that produces this behavior is five stages: multimodal pre-training, then separate supervised fine-tuning for grounding (boxes) and pointing (points), then GRPO reinforcement learning with three reward heads (format, quality, accuracy) applied to each, then a unified RFT that merges both specialists, then on-policy distillation into a single student model. Two specialists trained separately, then consolidated. It’s an elegant design.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

Claude Sonnet 4.6: Capable, But Expensive Per Token

Sonnet 4.6’s ~870 KV cache entries for the same 80×80 image reflects a different set of architectural priorities. Anthropic’s vision models are strong — particularly on document understanding, complex visual QA, and tasks that require integrating visual and textual reasoning over long contexts. If you’re building something where accuracy on nuanced visual tasks is the primary constraint and cost is secondary, Sonnet 4.6 is a reasonable choice.

But ~870 entries versus ~90 is a real cost difference. If you’re running image processing at scale — document pipelines, visual inspection workflows, anything that processes hundreds or thousands of images — you’re paying roughly 10x more per image for the same input. That’s not a small consideration.

For teams evaluating GPT-5.4 vs Claude Opus 4.6 on document processing, the KV cache efficiency question is increasingly part of the calculus, not just benchmark accuracy.

Gemini Flash 3: Fast, But Not Cheap on Vision

Gemini Flash 3 at ~1,000 KV cache entries is the least efficient of the three on this dimension. That’s somewhat surprising given that Flash variants are typically positioned as the cost-efficient option in Google’s lineup. The efficiency story here is about latency and throughput, not about memory footprint per image.

Gemini Flash 3 is still ahead of DeepSeek on raw count QA tasks, according to the paper’s benchmarks. If your use case is straightforward visual question answering — “how many items are in this image,” “what text appears here” — Flash 3 remains competitive. But on topological reasoning tasks, the gap is significant: DeepSeek scores 67% on maze navigation versus 49% for Gemini Flash 3. That’s not a close race.

The 1,000-entry KV cache also means that at scale, Gemini Flash 3 is the most expensive of the three to run on vision tasks, despite being positioned as the budget option for text.

What the Benchmarks Actually Show (and What They Don’t)

The paper is honest in a way that deserves acknowledgment. There’s a footnote that reads: “reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model’s overall capabilities.”

They are not claiming to beat GPT-5.4 across the board. They are claiming to beat it on visually grounded reasoning tasks — specifically topological reasoning, maze navigation, and path tracing. On maze navigation: DeepSeek 67%, GPT-5.4 50%, Gemini Flash 3 49%, Sonnet 4.6 49%. A 17-point gap over GPT on the tasks where pointing-based reasoning has the most leverage.

On counting and spatial reasoning more broadly, the results are mixed. DeepSeek wins some, ties some. Gemini Flash 3 leads on raw count QA. The visual primitives approach has the most impact precisely where language is worst at describing spatial relationships — trajectories, topology, multi-hop spatial chains.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

There are also three admitted limitations. First, the model is resolution-bound: fine-grain scenes can still fail. Second, visual primitives mode must be triggered explicitly — the model doesn’t auto-decide when to use it. Third, point-based topological reasoning doesn’t generalize well across all scenarios. The maze benchmark is strong; the generalization isn’t yet universal.

This kind of honesty about limitations is not universal among model providers. DeepSeek consistently publishes what doesn’t work alongside what does.

Which Model to Use, and When

Use DeepSeek Vision if: you’re processing images at scale and cost is a real constraint. The ~10x KV cache efficiency advantage is structural, not situational. For document pipelines, visual inspection, or any workflow where you’re running thousands of image queries, the economics are materially different. The visual primitives technique also makes it the right choice for tasks involving spatial reasoning, counting in dense scenes, or any problem where you need the model to track specific entities through a chain of reasoning.

Use Claude Sonnet 4.6 if: you need strong performance on nuanced visual QA, complex document understanding, or tasks that require integrating visual and textual reasoning over long contexts where accuracy is the primary constraint. The ~870 KV cache entries are a real cost, but Sonnet 4.6’s overall capability profile is broad. For teams already building on Anthropic’s stack, the integration cost of switching may outweigh the efficiency gains on vision specifically. If you’re evaluating Claude Opus 4.7 vs 4.6 and what changed, the vision efficiency question is worth adding to that comparison.

Use Gemini Flash 3 if: your primary use case is straightforward visual QA at speed, and you’re already integrated into Google’s infrastructure. Flash 3 leads on raw count QA and has strong throughput characteristics. But the ~1,000 KV cache entries make it the most expensive of the three on vision at scale, which undercuts its positioning as the budget option.

The honest summary: DeepSeek Vision is the right choice for cost-sensitive, scale-sensitive vision workloads, particularly those involving spatial reasoning. Sonnet 4.6 is the right choice when you need broad visual capability and can absorb the cost. Gemini Flash 3 is competitive on simple visual QA but loses on both efficiency and topological reasoning.

The Broader Pattern

DeepSeek’s vision work is not a series of one-off papers. It’s a consistent research program with a single thesis: find the cheapest representation that still works. MoE for language. Decoupled encoders for vision. Text-as-pixels for long context. Visual primitives as first-class chain-of-thought tokens. Each paper is a chapter in the same argument.

The KV cache numbers are the clearest expression of that thesis. When you’re building a system that processes images — whether it’s a document pipeline, a visual inspection tool, or an agent that needs to reason about spatial relationships — the cost per image is a real constraint. A 10x difference in KV cache size is a 10x difference in what you pay to run it.

For teams building multi-model workflows, this is exactly the kind of tradeoff that matters at the orchestration layer. MindStudio supports 200+ models including the major vision providers, which means you can route image tasks to the most cost-efficient model for the job without rebuilding your pipeline every time the efficiency landscape shifts.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The paper is hard to find. The model is in limited rollout. But the architecture is documented, the benchmarks are published, and the KV cache numbers are real. If you’re making vision model decisions today, those numbers belong in your evaluation.

For teams building applications that incorporate vision pipelines — document processors, inspection tools, spatial reasoning agents — the spec-driven approach is worth considering. Remy compiles annotated markdown specs into complete TypeScript applications with backend, database, and auth, which means the model routing logic and cost constraints can live in the spec rather than being scattered across infrastructure code.

The 7,000x compression ratio from raw pixels to KV cache entries is the headline. The implication is simpler: DeepSeek built a vision model that costs roughly a tenth of what its competitors cost to run on the same image. That’s not a benchmark. That’s an invoice.

For teams comparing GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro on real benchmarks, DeepSeek Vision is now a fourth option that belongs in that conversation — not because it wins every benchmark, but because it wins the one that shows up on your bill.