DeepSeek Vision's 7,000x Image Compression Pipeline: From 756px Input to 81 KV Cache Entries

A 756-Pixel Image Walks Into a KV Cache

DeepSeek’s vision model compresses a 756×756 image down to just 81 KV cache entries — a total compression ratio of roughly 7,000x. That number sounds made up until you trace the pipeline step by step and see exactly where each compression happens.

This post does that. You’ll understand each stage, why it exists, and what the tradeoffs are. By the end, you’ll have a concrete mental model of how DeepSeek built a vision system that costs about a tenth as much to run as Claude Sonnet 4.6 or Gemini Flash 3 on the same image.

Why Compression Matters Here (Not Just “Efficiency”)

Before walking through the pipeline, it helps to understand what you’re actually paying for when you run a vision model.

Every image token that enters an LLM’s attention mechanism gets stored in the KV cache — the key-value pairs that attention reads during generation. More image tokens means more memory, more compute, and more cost per inference call. For an 80×80 image, Claude Sonnet 4.6 generates around 870 KV cache entries. Gemini Flash 3 generates around 1,000. DeepSeek’s new vision model generates about 90.

That’s not a rounding difference. That’s a structural difference in how the model was designed.

The compression pipeline is the mechanism behind that gap. It’s a four-stage funnel that takes raw pixel data and aggressively reduces it at each step, while trying to preserve the information that actually matters for reasoning. Understanding each stage tells you something real about the architectural choices DeepSeek made — and why those choices compound.

What You Need to Follow This

You don’t need to run any code to follow along. This is an architectural walkthrough, not a tutorial with a terminal.

It helps to have a rough mental model of how vision transformers work: images get divided into patches, patches get embedded as vectors, and those vectors flow into an attention mechanism. If you’ve read anything about ViT (Vision Transformer) or CLIP, you’re in good shape. If not, the key intuition is just that “tokens” in a vision model are small square chunks of the image, not words.

The paper this is based on is titled Thinking with Visual Primitives. It was published and then became difficult to find — as of the video this analysis draws from, it had been removed from easy circulation. The architecture details here come from that paper and the DeepSeek V4 Flash technical documentation it references.

One more thing worth knowing: the language backbone here is DeepSeek V4 Flash, a 284-billion-parameter mixture-of-experts model with 13 billion active parameters at inference. The vision encoder is a custom component called the DeepSeek Vision Transformer. Both matter for understanding the pipeline.

The Four-Stage Compression Pipeline

Stage 1: Raw image → patch tokens (571,000 pixels → 2,916 tokens)

Start with a 756×756 image. That’s 571,536 pixels. Raw pixel data is not what goes into the model — it’s too large and too redundant.

The DeepSeek Vision Transformer divides the image into non-overlapping 14×14 pixel patches. At 756×756, that gives you 54 patches along each dimension, for a total of 54 × 54 = 2,916 patch tokens.

Each patch token is a vector embedding representing one 14×14 square of the image. This is standard ViT behavior. The notable thing here is that the DeepSeek Vision Transformer supports arbitrary resolution — you’re not forced to resize everything to a fixed square before encoding. That matters for documents, screenshots, and anything with fine text.

Now you have: 2,916 patch tokens representing the full image.

The compression so far is about 196x (571,536 pixels → 2,916 tokens), but this is mostly just the standard patch embedding step. The interesting compression happens next.

Stage 2: Spatial compression (2,916 tokens → 324 tokens)

This is where DeepSeek starts doing something non-standard.

After patch embedding, they apply a 3×3 spatial compression along the channel dimension. The idea: take nine adjacent patch tokens and merge them into one. A 3×3 grid of patches becomes a single token.

Starting from 54×54 patches, a 3×3 merge gives you 18×18 = 324 tokens.

This is roughly analogous to pooling in a convolutional network — you’re trading spatial resolution for a more compact representation. The bet is that for most reasoning tasks, you don’t need per-patch granularity; you need neighborhood-level structure.

Now you have: 324 tokens, down from 2,916. That’s a 9x reduction at this stage.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

This is also where the resolution-bound limitation the DeepSeek team admits to comes from. If you have a very fine-grained scene — dense text at small size, or a crowd of small faces — this compression step can lose the detail you need. The model’s three admitted limitations include exactly this: fine-grain scenes can still trip it up.

Stage 3: Compressed sparse attention (324 tokens → 81 KV cache entries)

The third stage is the most architecturally distinctive, and it comes directly from the DeepSeek V4 paper.

DeepSeek V4 introduced a compressed sparse attention mechanism as part of its multi-head latent attention (MLA) design. Applied to the vision tokens, this mechanism compresses the KV cache by another factor of 4.

324 ÷ 4 = 81 KV cache entries.

The mechanism works by learning which token interactions are most important and compressing the key-value representations accordingly — rather than storing a full KV entry for every token, it stores a compressed latent that can be expanded during attention computation. This is different from simply dropping tokens; the information is still there, just in a more compact form.

Now you have: 81 KV cache entries for the full image.

For comparison: Sonnet 4.6 uses ~870 entries for an 80×80 image. DeepSeek uses ~90 for the same size. The gap isn’t primarily about the image being smaller — it’s about this compression mechanism being applied at the KV cache level, not just at the token level.

Stage 4: The full stack (81 entries → LLM reasoning)

The 81 KV cache entries feed into the DeepSeek V4 Flash language backbone alongside the text tokens. From here, the model reasons normally — except it has access to a compact but information-dense visual representation.

The total compression ratio from raw pixels to KV cache entries: approximately 7,000x.

To put that in concrete terms: a 756×756 image has 571,536 pixels. The model reasons over 81 KV entries. That’s 7,055x compression. The claim is that 81 entries is enough to support the kinds of visual reasoning tasks the model was trained for.

Now you have: a complete picture of the pipeline — patch embedding, spatial compression, compressed sparse attention, and LLM reasoning.

Where This Fits in DeepSeek’s Vision History

This pipeline didn’t appear from nowhere. DeepSeek has been publishing a consistent thread of vision work since March 2024, and each release pushed the same underlying question: what’s the cheapest representation that still works?

DeepSeek VL (March 2024) was a modest 1.3B and 7B model using a hybrid SigLIP and SAM encoder. Nothing flashy, but it established the foundation.

Janus (October 2024) decoupled the visual encoder for understanding versus generation — most unified multimodal models at the time used a single encoder and forced a compromise. Janus ran two encoders sharing one transformer.

VL2 (December 2024) ported mixture-of-experts and multi-head latent attention from V2 and V3 into the vision stack. A 1B activated parameter version was scoring 80.9 on OCR Bench and 88.9 on DocVQA — small activations, large numbers.

Then in October 2025, DeepSeek OCR reframed the whole project. The paper’s actual finding: take 1,000 text tokens, render them as an image, encode the image, and you get back 100 vision tokens that reconstruct the original text at 97% accuracy. That’s 10x compression on long context. Andrej Karpathy’s reaction was direct: “the tokenizer must go, pixel may be better inputs to language models than text.”

The Visual Primitives model is the current endpoint of that lineage. The compression pipeline described above is the vision side of the same story the OCR paper started telling.

What the Compression Enables (and What It Doesn’t)

The efficiency numbers are real, but they’re not the whole story. The compression pipeline is what makes the model cheap to run. The Thinking with Visual Primitives technique is what makes it useful for spatial reasoning tasks.

The visual primitives approach lets the model output bounding box coordinates as special vocabulary tokens inline in its chain of thought — <ref>label</ref><box>x1,y1,x2,y2</box> — not as function calls, not as a separate tool. The model literally points to things mid-thought. On the maze navigation benchmark, this combination of efficient encoding and grounded reasoning produces a 67% score versus 49% for Gemini Flash 3, 50% for GPT-5.4, and 49% for Sonnet 4.6. That’s a 17-point gap over GPT on topological reasoning tasks. (For a direct comparison of GPT-5.4 and Claude Sonnet 4.6 on other dimensions, this breakdown of GPT-5.4 vs Claude Opus 4.6 covers the broader capability picture.)

But the paper is honest about where this breaks down. Three admitted limitations:

Resolution-bound. Fine-grain scenes still fail. The 3×3 spatial compression at Stage 2 is the culprit — it loses per-patch detail that dense scenes require.
Visual primitives mode must be triggered explicitly. The model doesn’t auto-decide when to use bounding box reasoning. You have to ask for it.
Point-based topological reasoning doesn’t generalize well across scenarios. The maze results are impressive, but the paper’s own footnote says: “reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model’s overall capabilities.” They are not claiming to beat GPT-5.4 across the board.

That footnote matters. A lot of coverage will skip it.

The Training Pipeline Behind the Compression

The architecture is one piece. The training pipeline is the other.

DeepSeek used five stages to get here:

Multimodal pre-training on trillions of tokens — standard foundation work.
Specialized SFT — two separate models trained, one for grounding (bounding boxes), one for pointing (coordinate points).
GRPO reinforcement learning on each specialist, with three reward heads: format, quality, and accuracy.
Unified RFT merging both specialists.
On-policy distillation into a single student model.

The key insight is stages 2–4: train two specialists, then consolidate. This is cleaner than trying to train a single model to do both from the start, because the grounding and pointing tasks have different failure modes during RL. Merging after RL convergence avoids the interference.

If you’re building multi-model workflows where different models handle different reasoning modalities before a final synthesis step, this is a useful pattern to know. Platforms like MindStudio handle this kind of orchestration — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which makes it practical to experiment with specialist-then-merge architectures without writing the orchestration code yourself.

The Rollout and What’s Still Unknown

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

On April 29, 2026, DeepSeek started rolling out vision mode in the app and on the web alongside their fast and expert modes — a limited test, not a full release. The paper itself is hard to find. The model appears to be in a gradual rollout.

The V4 Flash backbone — 284B parameters, 13B active — is cited in the paper as reference number three, described as “highly efficient million token context intelligence.” That context window matters: the compression pipeline is what makes it practical to run vision over long documents without the KV cache exploding.

For teams building document processing pipelines or visual QA systems, the cost implication is direct. If you’re currently using Sonnet 4.6 or Gemini Flash 3 for image-heavy workloads, the ~10x KV cache difference translates roughly to ~10x lower inference cost on the vision portion of your calls. That’s not a marginal improvement — it changes what’s economically viable to build.

The broader question this raises is about abstraction levels in AI application development. When a model’s compression pipeline is this efficient, the bottleneck shifts from inference cost to application architecture — how you structure the calls, what you do with the outputs, how you chain reasoning steps. Tools like Remy take a different approach to that layer: you write a spec in annotated markdown, and the full-stack application — TypeScript backend, database, auth, deployment — gets compiled from it. The spec is the source of truth; the generated code is derived output. That’s a different kind of compression, applied to the development process rather than the image.

Where the Compression Story Goes Next

The through line across DeepSeek’s vision work — from VL to Janus to VL2 to OCR to Visual Primitives — is a consistent bet that representation efficiency compounds. Each paper asks: can we get the same reasoning quality from a smaller representation?

The OCR paper showed 10x compression on text-as-pixels. The Visual Primitives pipeline shows 7,000x compression from raw image to KV cache. These aren’t independent results — they’re the same research direction applied at different points in the stack.

The three limitations the team admits are real constraints, not false modesty. Resolution-bound failure, explicit mode triggering, and limited generalization on topological tasks are all places where the compression is losing information that harder tasks need. The next version of this work will probably address at least one of them.

For now, the pipeline is: 756×756 → 2,916 patch tokens → 324 tokens → 81 KV cache entries. Each step has a reason. The total is 7,000x. And the cost difference is real enough to matter for anything you’re building at scale.

If you’re evaluating which vision model to use for a production workload, the comparison of Gemma 4’s edge deployment models is worth reading alongside this — it covers a different efficiency approach (running multimodal models on-device) that solves a related but distinct problem. And if you’re tracking the broader model landscape, the Qwen 3.5 overview covers another open-weight model family taking a different path to efficient inference.

The compression pipeline is the mechanism. The question is what you build with it.