DeepSeek's 'Thinking with Visual Primitives': 5 Technical Breakthroughs in the Paper That Briefly Disappeared

DeepSeek Published a Vision Paper, Then Pulled It. Here Are 5 Technical Details Worth Understanding.

The paper was called “Thinking with Visual Primitives.” It appeared, got cited, and then became hard to find. That alone makes it worth paying attention to.

Here are 5 specific technical ideas buried in it — starting with the one that gives the paper its name.

1. Visual Primitives Are Inline Tokens in the Chain of Thought, Not a Tool Call

This is the central idea, and it’s cleaner than it sounds.

When you ask a current multimodal model to count people in a crowded photo, the model tries to do it in language. It describes what it sees, keeps a running tally, and frequently loses track. The problem isn’t that the model can’t see — it’s that language is a bad medium for spatial reference. Words drift. Pronouns get ambiguous. “The third person from the left” stops meaning anything specific after four more sentences of reasoning.

DeepSeek’s answer: give the model a way to point.

The format is <ref>label</ref><box>x1,y1,x2,y2</box>. These are special tokens in the model’s vocabulary. Not a function call. Not a separate grounding module invoked via API. Not a tool the model decides to use. They appear inline in the chain of thought, the same way a human might circle something on a whiteboard mid-sentence.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

So when the model is working through “count the men in this team photo,” it doesn’t just narrate — it emits a bounding box for each person it identifies, right there in the reasoning trace. Each entity gets a coordinate. The model can refer back to it. The reference doesn’t drift.

The paper calls this the “reference gap” — distinct from the “perception gap” that most 2024-era multimodal work was trying to solve. High-resolution cropping, dynamic patching, thinking with images — all of that is about seeing better. Visual primitives are about pointing better. The distinction matters.

2. The Training Pipeline Has Five Stages and Three Reward Heads

The architecture is a 284B parameter mixture-of-experts model with 13B active parameters at inference — the DeepSeek V4 Flash backbone. Frontier-grade reasoning at a fraction of the inference cost. That part is consistent with DeepSeek’s pattern.

The training pipeline is where the design gets interesting.

Stage one: multimodal pre-training on trillions of tokens. Standard.

Stage two: supervised fine-tuning, but split. They train two separate models — one for grounding (bounding boxes) and one for pointing (coordinate points). Not one model trying to do both. Two specialists.

Stage three: reinforcement learning with GRPO on each specialist, using three reward heads: format, quality, and accuracy. Format rewards ensure the model actually emits valid <ref><box> tokens. Quality rewards push toward useful intermediate reasoning. Accuracy rewards measure whether the final answer is correct. Three separate signals, not a single blended loss.

Stage four: unified RFT that merges the two specialists.

Stage five: on-policy distillation into a single student model.

The elegance here is the separation of concerns. Train specialists, then consolidate. It’s the same logic behind mixture-of-experts at the architecture level, applied to the training process itself. You don’t ask one model to learn two different grounding behaviors simultaneously — you let each one get good at its task, then merge.

This is meaningfully different from how most labs approach multimodal fine-tuning, which tends to be a single SFT pass on a combined dataset followed by RLHF. The five-stage pipeline with separate RL reward heads for each grounding modality is a more deliberate design.

3. The Benchmark Numbers Are Specific — and Honestly Scoped

The headline result is maze navigation: DeepSeek scores 67%, against 49% for Gemini Flash 3, 50% for GPT-5.4, and 49% for Claude Sonnet 4.6. That’s roughly a 17-point gap over GPT-5.4 on a task that requires following a path through a spatial structure.

Path tracing shows a similar story. Counting and spatial reasoning are more mixed — Gemini Flash 3 is still ahead on raw count QA.

The reason visual primitives help on maze navigation specifically is that language is uniquely bad at trajectory descriptions. “Go right, then up, then right again at the junction” breaks down fast in a complex maze. A model that can emit coordinate references as it reasons — marking waypoints in its chain of thought — has a structural advantage on exactly these tasks.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The paper is honest about the scope. There’s a footnote: “reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model’s overall capabilities.” They are not claiming to beat GPT-5.4 across the board. They are claiming to beat it on visually grounded spatial reasoning tasks. That’s a narrower, more defensible claim.

This kind of scoping is rare. Most benchmark announcements bury the caveats. DeepSeek put it in a footnote, but they put it in.

4. The Vision Lineage Goes Back Two Years and Tells One Consistent Story

The visual primitives paper didn’t come from nowhere. DeepSeek has shipped roughly seven vision-related models since March 2024, and they all answer the same question: what’s the cheapest representation that still works?

DeepSeek VL (March 2024): 1.3B and 7B models, hybrid SigLIP and SAM encoder. Foundation-setting, nothing flashy.

Janus (October 2024): decoupled visual encoders for understanding versus generation. Most unified multimodal models at the time had a single encoder bottleneck — one encoder trying to serve both comprehension and generation. Janus ran two encoders sharing a transformer. The insight was that the two tasks have different representational needs, and forcing a compromise hurts both.

VL2 (December 2024): ported mixture-of-experts and multi-head latent attention from V2/V3 into vision. A 1B activated-parameter version scored 80.9 on OCR Bench and 88.9 on DocVQA. Small activations, large numbers.

Janus Pro 7B (January 2025): went viral during the R1 moment. 80% on GenEval, runnable on a single consumer GPU.

DeepSeek OCR (October 2025): the real conceptual precursor. Take 1,000 text tokens, render them as an image, encode the image, get back 100 vision tokens that reconstruct the original text at 97% accuracy. Ten times compression on long context. Andrej Karpathy’s reaction: “the tokenizer must go, pixel may be better inputs to language models than text.” That quote circulated widely, and it pointed at something real — the assumption that text tokens are the natural input format for language models may not survive contact with better vision architectures.

The visual primitives paper is the next step in that lineage. OCR said: compress text into pixels. Visual primitives says: make spatial coordinates first-class tokens in reasoning. Both are about representation efficiency. Both are about finding the cheapest form that preserves the information you actually need.

If you’re building multimodal AI applications and following this space, understanding the architecture decisions behind models like Gemma 4 — which also supports arbitrary resolution and native vision — gives useful context for how different labs are converging on similar problems from different directions.

5. Three Admitted Limitations That Most Coverage Will Skip

The model has three limitations the paper explicitly acknowledges.

First: resolution-bound. Fine-grain scenes can still fail. The compression pipeline is aggressive — a 756×756 image ends up at 81 KV cache entries — and at some point that compression loses detail that matters. Dense scenes with small objects are still a problem.

Second: visual primitives mode has to be triggered explicitly. The model doesn’t auto-decide when to use it. This is a significant practical limitation. A model that can point but doesn’t know when to point is a model that requires the user to know when pointing would help. That’s a burden that should eventually be internalized.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

Third: point-based topological reasoning doesn’t generalize well across scenarios. The maze navigation numbers are strong, but the paper acknowledges the approach doesn’t transfer cleanly to all spatial reasoning tasks. It’s not a universal solution.

These are honest admissions. They also sketch the obvious roadmap: better resolution handling, automatic mode selection, and broader generalization. The limitations tell you where the next paper is coming from.

The Compression Architecture Underneath All of This

The efficiency story deserves its own mention because it’s what makes the visual primitives approach viable at scale.

The custom vision transformer — DeepSeek calls it the DeepSeek Vision Transformer — supports arbitrary resolution using 14×14 patches. A 756×756 image produces about 571,000 pixels, which becomes 2,916 patch tokens. A 3×3 spatial compression along the channel dimension takes nine adjacent patches into one, bringing it to 324 tokens. Then compressed sparse attention from the V4 paper compresses the KV cache by another factor of four. Final result: 81 KV cache entries for the entire image.

That’s approximately 7,000× total compression from raw pixels to KV cache entries.

For an 80×80 image, the comparison is stark: DeepSeek uses roughly 90 KV cache entries, Claude Sonnet 4.6 uses around 870, and Gemini Flash 3 uses around 1,000. About 10× more efficient on the same image. Which means roughly one-tenth the cost to run.

This matters for the visual primitives technique specifically because chain-of-thought reasoning with inline bounding boxes is token-intensive. The model is emitting <ref><box> pairs throughout its reasoning trace. If the image representation itself is expensive, the whole approach becomes prohibitive. The compression pipeline is what makes the reasoning approach affordable.

When you’re building applications that chain vision models with other tools — the kind of multi-step workflows where MindStudio’s visual builder lets you connect 200+ models and 1,000+ integrations without writing orchestration code — the per-image cost difference between 90 and 870 KV cache entries compounds fast across thousands of requests.

The Deployment Reality

As of April 29, DeepSeek started rolling out vision mode in the app and on the web alongside fast and expert modes — a limited test, not a full release. The paper itself is hard to find. The model behind it is in gradual rollout.

The three limitations are real constraints on current usefulness. A model that requires explicit triggering of visual primitives mode, that struggles with fine-grain scenes, and that doesn’t generalize its topological reasoning across all scenarios is not a drop-in replacement for anything.

But the direction is clear. Inline spatial tokens in chain-of-thought reasoning is a better design than language-only spatial description. The five-stage training pipeline with separate specialists and three reward heads is a more principled approach than single-pass SFT. The compression architecture makes it economically viable.

The paper was published and pulled. That’s unusual. But the ideas in it are documented, the architecture is described, and the model is rolling out. The details above are what matter — not the publication status.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

For anyone building vision-heavy applications, the question worth sitting with is this: if the reference gap is real — if language genuinely can’t point — then every multimodal model that reasons purely in text is working around a structural limitation. Visual primitives is one answer to that. It probably won’t be the last.

If you’re thinking about how to build production applications on top of models like this, tools like Remy take a different approach to the development layer: you write a spec in annotated markdown, and it compiles into a complete TypeScript stack — backend, database, auth, deployment. The spec is the source of truth; the code is derived output. The abstraction level keeps rising.

The comparison between Claude Sonnet 4.6 and other frontier models is worth tracking alongside DeepSeek’s vision work — the capability gaps between labs are shifting faster than most deployment decisions account for. And if you’re evaluating smaller models for edge deployment alongside cloud-based vision models, Qwen 3.5’s approach to running locally on phones offers a useful contrast in how different labs are thinking about the efficiency-capability tradeoff.

The visual primitives paper may be hard to find. The ideas aren’t.