Andrej Karpathy Said 'The Tokenizer Must Go' — DeepSeek's Vision Architecture Is Starting to Prove Him Right
Karpathy called pixels better inputs than text tokens after DeepSeek's OCR paper. Their new visual primitives model takes that idea further with 7,000x…
Andrej Karpathy Said the Tokenizer Should Die. DeepSeek Is Building the Funeral Pyre.
Andrej Karpathy doesn’t throw around strong claims casually. So when DeepSeek published a paper in October 2025 showing that 1,000 text tokens rendered as an image could be compressed down to 100 vision tokens at 97% accuracy, and Karpathy responded with “the tokenizer must go, pixel may be better inputs to language models than text” — people noticed.
That’s not a throwaway tweet. That’s one of the most prominent figures in AI research saying, in plain language, that a foundational assumption of how language models work might be wrong. The tokenizer — the thing that converts raw text into the integer sequences that LLMs actually process — has been a given since the beginning. Karpathy was suggesting it might be a crutch.
You might have missed what happened next. DeepSeek kept building.
What Karpathy Was Reacting To
The October 2025 paper was called DeepSeek OCR, though the name was a bit of a misdirect. It wasn’t really about optical character recognition in the traditional sense. The actual idea was stranger and more interesting: take a long block of text, render it as a pixel image, run it through a vision encoder, and see what comes out.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
What came out was compression. Roughly 10x compression. A thousand text tokens became a hundred vision tokens, and the model could reconstruct the original content at 97% accuracy. The implication was uncomfortable for anyone who’d spent years thinking about tokenization as the correct abstraction layer: maybe pixels carry more information per unit than tokens do. Maybe the discrete, lossy process of chopping language into subword chunks is leaving signal on the table.
Karpathy’s reaction landed because he’d been thinking about tokenization as a problem for a long time. His work on LLM knowledge bases and structured retrieval reflects a consistent preoccupation with how information gets represented before it reaches a model — and whether the representation is actually optimal. The OCR paper poked directly at that question. It’s also worth noting that Karpathy’s broader interest in efficient knowledge representation connects to the same efficiency-first philosophy that runs through all of DeepSeek’s vision work — the question isn’t just what the model knows, but how cheaply it can be made to know it. For a deeper look at how that philosophy applies to retrieval specifically, Karpathy’s LLM Wiki approach achieves 95% less token use than RAG, which is a related argument about representation efficiency at a different layer of the stack.
The through line across DeepSeek’s vision work, if you trace it back, is a single obsessive question: what’s the cheapest representation that still works? Every paper in their lineage is a different answer to the same question.
Two Years of Telling the Same Story
DeepSeek’s vision lineage is worth tracing because the new model — the one behind the paper titled “Thinking with Visual Primitives” that briefly appeared and then became hard to find — didn’t come from nowhere. It’s the seventh chapter in a story that started in March 2024.
DeepSeek VL came first. Small models, 1.3 and 7 billion parameters, using a hybrid SigLIP and SAM encoder. Nothing flashy, but it set the foundation. Then in October 2024, Janus arrived with an architectural idea that mattered: decouple the visual encoder for understanding versus generation. Most unified multimodal models at the time had a single encoder bottleneck — the model was forced to compromise between seeing and generating. Janus said no, ran two encoders, and shared the transformer.
December 2024 brought VL2, which is where the efficiency story really starts. DeepSeek ported mixture-of-experts and multi-head latent attention from their V2 and V3 language models into the vision architecture. A tiny version had only 1 billion activated parameters but was scoring 80.9 on OCR bench and 88.9 on DocVQA. Small activations, big numbers — that’s the DeepSeek pattern.
January 2025 gave us Janus Pro 7B, which went viral during the R1 moment. It hit 80% on GenEval, beating DALL-E 3, and you could run it on a single consumer GPU. Then October 2025 brought the OCR paper and Karpathy’s reaction. And now there’s “Thinking with Visual Primitives,” which takes the pixel-as-input idea and extends it into something more architecturally ambitious.
Each step is the same argument made more forcefully. Fewer activated parameters. Decoupled encoders. Text compressed into pixels. And now: spatial coordinates as first-class tokens in the chain of thought.
The Reference Gap Nobody Was Talking About
The new paper makes a distinction that’s easy to miss but actually clarifies a lot of the confusion around multimodal model limitations.
There are two different problems with how models process images. The first is the perception gap — the model can’t see fine-grained details. Most of the work in 2024 and into 2025 was aimed at this: high-resolution cropping, dynamic patching, thinking with images. The goal was to make models see better.
DeepSeek’s paper argues there’s a second, more fundamental problem they call the reference gap. Even when a model sees perfectly, language is too imprecise to point. If you ask a model to count the men in a dense team photo, the model has to hold spatial references in language — “the person on the left, the one behind him, the one partially obscured” — and as the reasoning chain gets longer, it loses track of which entity it’s actually talking about. Humans solve this with finger gestures. Models, until now, didn’t have an equivalent.
The solution is what they call visual primitives. When the model reasons about an image, it outputs special tokens inline in its chain of thought: <ref>label</ref><box>x1,y1,x2,y2</box>. These aren’t function calls. They’re not a separate tool invoked via API. They’re part of the model’s vocabulary, woven directly into the reasoning trace. The model literally points to things as it thinks.
This is a cleaner idea than it might sound. Counting in dense scenes, multi-hop spatial reasoning, distinguishing visually similar objects — all of these become more tractable when the model can anchor its reasoning to coordinates rather than trying to hold spatial relationships in language alone.
The Architecture That Makes It Cheap
Here’s where the Karpathy thread connects to something concrete and measurable.
The compression pipeline in this model is worth understanding in detail. A 756x756 image starts with roughly 571,000 pixels. The vision transformer converts that into 2,916 patch tokens using 14x14 patches. Then a 3x3 spatial compression collapses nine adjacent patches into one, bringing the count down to 324 tokens. Then a compressed sparse attention mechanism — borrowed from the V4 paper — compresses the KV cache by another factor of four. End result: 81 KV cache entries for the entire image. That’s approximately a 7,000x compression ratio from raw pixels to the representation the model actually attends over.
The practical consequence shows up in a simple comparison. For an 80x80 image, DeepSeek’s model uses roughly 90 KV cache entries. Claude Sonnet 4.6 uses around 870. Gemini Flash 3 uses around 1,000. That’s not a marginal efficiency gain — it’s an order of magnitude. Running this model on the same image costs roughly a tenth of what you’d pay for comparable frontier models.
The language backbone is DeepSeek V4 Flash: a 284-billion parameter mixture-of-experts model with 13 billion active parameters at inference. You’re getting frontier-grade reasoning while only paying for 13 billion parameters. The vision encoder is a custom architecture they call the DeepSeek Vision Transformer, built from scratch to support arbitrary resolution.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
The training pipeline has five stages: multimodal pre-training on trillions of tokens, then specialized supervised fine-tuning for grounding (boxes) and pointing (points) separately, then GRPO reinforcement learning with three reward heads covering format, quality, and accuracy, then a unified RFT pass that merges the two specialists, then on-policy distillation into a single student model. They train two specialists and consolidate them. It’s an elegant design, and the separation of grounding from pointing in the SFT stage is the kind of detail that suggests they thought carefully about where the failure modes actually live.
What the Benchmarks Actually Show
The headline benchmark number is 67% on maze navigation — a topological reasoning task — against 49% for Gemini Flash 3, 50% for GPT-5.4, and 49% for Sonnet 4.6. That’s a 17-point gap over GPT-5.4 on a task that specifically rewards the ability to trace paths and reason about spatial relationships.
The paper is honest about what this means and doesn’t mean. There’s a footnote that reads: “reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model’s overall capabilities.” They’re not claiming to beat GPT-5.4 across the board. They’re claiming to beat it on visually grounded reasoning tasks where language-based spatial description breaks down. That’s a narrower and more defensible claim.
For context on how GPT-5.4 and Sonnet 4.6 compare more broadly, there’s a detailed comparison of GPT-5.4 vs Claude Opus 4.6 that covers the general-purpose tradeoffs — which is a different question than what DeepSeek is optimizing for here. The broader model landscape also matters for understanding where DeepSeek’s efficiency gains fit: open-weight models like Qwen 3.5 from Alibaba are pursuing similar compression-first philosophies on the language side, which suggests the efficiency-over-scale argument is gaining traction across multiple labs simultaneously.
The limitations the paper admits are also worth taking seriously. The model is resolution-bound, meaning fine-grained scenes can still trip it up. The visual primitives mode has to be triggered explicitly — the model doesn’t auto-decide when to use it. And the point-based topological reasoning doesn’t generalize well across all scenarios. These aren’t minor caveats. They’re the places where the architecture’s assumptions break down.
What This Actually Adds Up To
The Karpathy quote is doing a lot of work in this story, but it’s worth being precise about what he was and wasn’t saying.
He wasn’t claiming that tokenization is definitively wrong or that pixels will replace text tokens in general-purpose language models. He was reacting to a specific empirical result — 10x compression of text via pixel encoding at 97% accuracy — and drawing a directional conclusion: the tokenizer as a fixed assumption deserves more scrutiny than it’s getting.
DeepSeek’s “Thinking with Visual Primitives” paper is a different kind of evidence for the same intuition. It’s not compressing text into pixels. It’s treating spatial coordinates as native vocabulary tokens, which is a related move: instead of forcing spatial information through the bottleneck of language, give the model a representation that’s native to the domain.
How Remy works. You talk. Remy ships.
The efficiency numbers are the part that should get more attention than they do. A 7,000x compression ratio isn’t just an interesting architectural choice — it’s a cost structure that changes what’s economically viable to build. If vision inference costs a tenth as much, you can run it in contexts where you currently wouldn’t. You can call it more frequently in an agentic loop. You can afford to let a model reason visually over a longer chain of thought without the token budget becoming prohibitive.
Platforms like MindStudio that support 200+ models and 1,000+ integrations let you chain vision models visually across complex workflows — and that’s exactly where this kind of efficiency differential starts to matter in practice. When you’re building a pipeline that calls a vision model repeatedly across a document corpus, a 10x cost difference compounds quickly. Tools like Remy, MindStudio’s spec-driven full-stack app compiler, take this further: you write a markdown spec with annotations describing what you want to build, and Remy compiles it into a complete TypeScript app with backend, database, auth, and deployment — meaning the cost structure of the underlying vision models directly affects what’s economically viable to ship.
The rollout is still limited. As of April 29, 2026, DeepSeek started making vision mode available in their app and web interface alongside fast and expert modes — a gradual release, not a full launch. The paper itself is hard to find. The model isn’t widely accessible yet.
But the direction is clear. DeepSeek has been telling the same story for two years: find the cheapest representation that still works, and then make it cheaper. The OCR paper said pixels might beat tokens for text. The visual primitives paper says coordinates should be first-class tokens in reasoning. Karpathy’s reaction to the first paper was essentially: yes, keep going.
The interesting question isn’t whether this specific model beats GPT-5.4 on a benchmark. It’s whether the underlying compression philosophy — treat every representation as provisional, always ask if there’s a cheaper one that preserves the signal — turns out to be the right way to build multimodal systems. The efficiency numbers suggest it might be.
For engineers building on top of these models, the practical implication is that the cost curve for vision inference is not fixed. The assumption that vision is expensive — that you should use it sparingly, cache aggressively, avoid calling it in loops — is being actively undermined by an architecture that achieves 7,000x compression before the model even starts reasoning. That’s not a minor optimization. That’s a different set of constraints.
The tokenizer, as Karpathy put it, may have to go. DeepSeek is at least making a serious argument for what comes next. And if you’re building tools that need to reason about images — document processors, visual agents, anything that has to count objects or trace paths or identify spatial relationships — the architecture described in “Thinking with Visual Primitives” is worth understanding in detail, even if the model itself isn’t fully available yet.
The paper appeared, then disappeared. The model is rolling out slowly. But the ideas in it are already in the world, and they’re not going back.
The broader question Karpathy raised — whether the tokenizer is a necessary abstraction or a historical accident — doesn’t have a clean answer yet. But it has more evidence than it did a year ago. That’s usually how these things go: not a single decisive moment, but a series of results that make the old assumption harder to defend. DeepSeek’s vision lineage is starting to look like exactly that kind of accumulation.