Andrej Karpathy on DeepSeek's OCR Paper: Why Pixels May Beat Tokens as AI Inputs

Andrej Karpathy Read the DeepSeek OCR Paper and Said the Tokenizer Must Go

Andrej Karpathy read DeepSeek’s October 2025 OCR paper and wrote something that stopped a lot of people mid-scroll: “the tokenizer must go, pixels may be better inputs to language models than text.”

That’s not a throwaway observation. Karpathy built the tokenizer infrastructure that powers much of modern NLP. When he says the tokenizer must go, you should probably think carefully about what he saw in that paper.

Here’s what the paper actually showed.

What DeepSeek’s OCR Paper Actually Did

The framing in the paper is a little strange at first. They called it an OCR paper, but it wasn’t really about optical character recognition in the traditional sense.

The actual idea: take 1,000 text tokens, render them as an image, encode that image through a vision encoder, and you get back 100 vision tokens that reconstruct the original text at 97% accuracy. That’s 10x compression on long-context text, with almost no information loss.

Read that again. You’re taking text, converting it to pixels, running it through a vision model, and getting a representation that is ten times smaller than the tokenized version — and still reconstructs the original at 97% fidelity.

This is not a trick. It’s a signal about something fundamental.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The paper sits inside a longer story DeepSeek has been telling for about two years. They shipped DeepSeek VL in March 2024, then Janus in October 2024 (which decoupled visual encoders for understanding vs. generation), then a vision-language model in December 2024 that was scoring 80.9 on OCR bench with only 1 billion activated parameters. The through line across all of it is one question: what’s the cheapest representation that still works?

The OCR paper is the most direct answer they’ve given yet.

Why Karpathy’s Reaction Matters

Karpathy’s comment wasn’t just enthusiasm. It was a specific technical claim about the architecture of language models.

Tokenizers exist because we needed a way to convert text into discrete symbols that neural networks could process. The BPE tokenizer, which underlies most modern LLMs, chops text into subword units — roughly 100,000 possible tokens for something like GPT-4. This works, but it introduces a layer of arbitrary discretization. The word “tokenization” might be one token or four depending on context. Numbers are notoriously badly handled. Code with unusual syntax gets fragmented in ways that hurt reasoning.

The deeper problem is that tokenization is a lossy, brittle preprocessing step that was designed around the constraints of 2018-era neural networks. We’ve been carrying it forward mostly because it works well enough and changing it is expensive.

What DeepSeek’s paper suggests is that a vision encoder operating on rendered text might be a better compression function than a tokenizer. Not because pixels are magic, but because a learned visual encoder can find structure in text that a fixed vocabulary can’t. The 10x compression ratio at 97% accuracy is evidence that the visual representation is capturing something real about the information content of text — not just encoding it differently, but encoding it more efficiently.

This connects directly to the broader efficiency story in DeepSeek’s vision work. Their visual primitives paper (“Thinking with Visual Primitives”) describes a vision encoder that uses 14×4 patches, applies 3×3 spatial compression, and then a 4× KV-cache compression — resulting in roughly 7,000× total compression ratio from raw pixels to KV-cache entries. For an 80×80 image, the model uses about 90 KV-cache entries. Claude Sonnet 4.6 uses around 870 for the same image. That’s a 10× efficiency gap, and it’s not an accident — it’s the result of two years of systematic work on representation efficiency.

The OCR paper is the same philosophy applied to text. If you can compress 1,000 tokens into 100 vision tokens with 97% fidelity, you’ve effectively built a better tokenizer. One that’s learned rather than hand-designed. One that scales with model capability rather than being fixed at training time.

The Non-Obvious Part

Here’s what most coverage of this paper missed.

The 97% accuracy number sounds impressive, but the more important number is what happens at scale. Long-context models are expensive because attention scales quadratically with sequence length. If you can compress 1,000 tokens into 100 vision tokens, you’ve just made a 1-million-token context window 10× cheaper to operate. That’s not a marginal improvement — it changes the economics of long-context inference entirely.

DeepSeek V4 already has a 1-million-token context window, priced at $1.74 per million input tokens and $3.48 per million output tokens. Compare that to GPT-5.5 at $5 per million input and $30 per million output, or Claude Opus 4.7 at $5 per million input and $25 per million output. The pricing gap is already significant. Now imagine what happens if the effective context cost drops by another 10× through better input representations.

The companies that figure out how to replace tokenization with learned visual encoders will have a structural cost advantage that compounds. Every token you don’t have to process is money you don’t spend. At the scale these models operate, that math gets large fast.

There’s also a quality argument, not just a cost argument. Karpathy’s point about pixels being better inputs — not just cheaper inputs — is the more provocative claim. The idea is that a vision encoder operating on rendered text might preserve information that tokenization destroys. Formatting, spatial relationships between words, the visual structure of code or math — these are things tokenizers actively discard. A vision encoder might not.

This is speculative, but it’s grounded. DeepSeek’s maze navigation benchmark results are suggestive: their vision model scores 67% on topological reasoning tasks where GPT-5.4 scores 50% and Gemini Flash 3 scores 49%. The gap is largest on tasks where spatial reasoning matters — exactly the domain where visual representations should have an advantage over token sequences.

What This Means for How You Build

If you’re building systems that process long documents, the practical implication is straightforward: watch this space closely, because the input representation layer is about to get more interesting.

Right now, if you’re building a document processing pipeline, you’re probably chunking text, tokenizing it, and feeding it into a model. That pipeline assumes tokenization is the right abstraction. DeepSeek’s work suggests it might not be — that rendering text as images and encoding those images could give you better compression, lower cost, and potentially better reasoning on structured content like code, tables, and math.

This is relevant to how you think about Karpathy’s LLM wiki approach, which already cuts token use by up to 95% on small knowledge bases through careful document structuring. The OCR paper suggests a complementary direction: rather than reducing what you feed in, change how you encode what you feed in. It’s also worth reading alongside Karpathy’s broader thinking on personal knowledge bases built with Claude Code, which explores similar questions about what the right input representation looks like for retrieval-heavy workflows.

The DeepSeek V4 Flash model that underlies this vision work is a 284-billion-parameter mixture-of-experts model with 13 billion active parameters at inference. You’re getting frontier-grade reasoning at the cost of a 13B dense model. That architecture — large total parameter count, small active parameter count — is what makes the pricing possible. The vision encoder efficiency compounds on top of that.

For teams building production AI applications, the near-term action is to start experimenting with vision-based document processing for text-heavy inputs. Not because it’s definitively better today, but because the trajectory is clear. DeepSeek has been publishing this story in installments for two years, and each installment has been more efficient than the last.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

If you’re orchestrating multiple models across a pipeline — say, a vision encoder for input compression feeding into a reasoning model — MindStudio handles that kind of multi-model composition without requiring you to write the orchestration layer from scratch. With 200+ model integrations and a visual builder for wiring together agents and workflows, you can swap in different vision encoders as the landscape evolves, which matters when the field is moving this fast.

The Broader Architecture Question

Karpathy’s comment points at something that the field has been circling for a while: the components of a language model that were designed around 2018-era constraints are starting to look like technical debt.

Tokenizers were designed when we needed fixed vocabularies and couldn’t afford learned compression at scale. Positional encodings were designed when we didn’t know how to handle arbitrary-length sequences. Attention was designed before we understood how to compress KV caches. All of these are being revisited simultaneously, and DeepSeek is doing some of the most systematic work on the compression side.

The OCR paper is one data point. The visual primitives paper is another. The KV-cache compression in V4 Flash is a third. They’re all pointing at the same thing: the information bottleneck in language models is not in the transformer itself, it’s in the representations going in and out.

If you’re building tools that generate code from structured inputs, this has direct implications. Remy is a spec-driven full-stack app compiler — you write an annotated markdown spec and it compiles into a complete TypeScript application with backend, database, auth, and deployment. The interesting question the OCR paper raises is whether that spec, rendered as a structured visual document, might be a better input to a code-generating model than its tokenized text equivalent. The spatial structure of a well-formatted spec carries information that token sequences flatten.

This is not a solved problem. DeepSeek’s paper is honest about the limitations: the model is resolution-bound, fine-grained scenes can trip it up, and the visual primitives mode has to be triggered explicitly rather than being applied automatically. The 97% reconstruction accuracy is impressive but not perfect, and the failure modes matter for production use.

But the direction is clear. The tokenizer has been a fixed point in language model architecture for long enough that people have stopped questioning it. Karpathy questioning it — based on a specific paper with specific numbers — is the kind of signal worth taking seriously.

The question isn’t whether tokenization will eventually be replaced. It’s how fast, and by what. DeepSeek’s answer, at least for now, is: by learned visual encoders that compress text into image representations, with 10× efficiency gains and 97% fidelity. That’s a credible answer.

The follow-on question is what happens when that compression is applied not just to text, but to the full multimodal input stream — text, images, code, structured data, all encoded through the same visual representation pipeline. That’s the experiment DeepSeek is running, and the results so far are worth watching carefully.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

For context on how open-weight models are changing the cost calculus for AI builders more broadly, the breakdown of what Qwen 3.5 represents as an open-weight model that runs on consumer hardware is worth reading alongside this. The pricing story and the architecture story are connected — you can only charge $1.74 per million tokens if you’ve solved the efficiency problem at multiple levels of the stack simultaneously. And for a sharper look at how agentic coding models are absorbing these efficiency gains in practice, the Qwen 3.6 Plus review covers what frontier-level coding performance looks like when the underlying representation work is done right.

The tokenizer is one of those levels. DeepSeek is working on it. Karpathy noticed. That’s usually a good sign that something real is happening.

Andrej Karpathy on DeepSeek's OCR Paper: Why Pixels May Beat Tokens as AI Inputs

Andrej Karpathy Read the DeepSeek OCR Paper and Said the Tokenizer Must Go

What DeepSeek’s OCR Paper Actually Did

Other agents ship a demo. Remy ships an app.

Why Karpathy’s Reaction Matters

The Non-Obvious Part

What This Means for How You Build

Day one: idea. Day one: app.

The Broader Architecture Question

Not a coding agent. A product manager.

Related Articles

Mac Mini M4 Pro vs RTX 5090 vs DGX Spark: Which Local AI Hardware Is Right for You in 2026?

What Is the Jagged Frontier? Why AI Models Improve Unevenly

What Is Gemma 4's Mixture of Experts Architecture? How 26B Parameters Run Like a 4B Model

What Is Anthropic's Prompt Caching and Why Does It Affect Your Claude Subscription Limits?