Andrej Karpathy's Verifiability Thesis: Why AI Is Superhuman at Code and Fails at Car Washes

Andrej Karpathy Just Explained Why Claude Can Refactor Your Codebase and Still Tell You to Walk to a Car Wash

Andrej Karpathy, at Sequoia’s annual AI event, offered the cleanest explanation yet for why frontier AI models feel simultaneously superhuman and baffling. The thesis is simple enough to fit in a sentence: traditional computers automate what you can specify in code; LLMs automate what you can verify.

That’s the Karpathy verifiability thesis, and once you internalize it, the jagged capability profile of modern AI stops feeling random. It has a structure. And that structure has direct implications for what you build, what you automate, and where you should expect the model to embarrass you in front of a client.

This post is about understanding that structure — not as philosophy, but as a practical mental model for anyone building with or on top of LLMs.

The Thesis, Stated Precisely

Here’s the exact framing Karpathy used at the Sequoia event:

“Traditional computers can easily automate what you can specify in code. LLMs can easily automate what you can verify.”

The shift from specify to verify is doing a lot of work. When you write a sorting algorithm, you’re specifying every step. When you ask an LLM to sort a list, you’re not specifying anything — you’re trusting that you can verify the output afterward. The LLM figures out the steps. You check the result.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

This works because of how frontier labs train these models. The training loop is a giant reinforcement learning environment. The model produces outputs, those outputs get evaluated, and the model gets reward signals based on whether the outputs are correct. For that loop to function, you need a verifier — something that can reliably say “yes, this is right” or “no, this is wrong.”

Code has a verifier built in. You run it. It either works or it throws an error. Math has a verifier. The answer is either correct or it isn’t. These domains don’t require a human in the loop to close the RL feedback cycle. You can generate millions of training examples, verify them automatically, and run RL at scale.

That’s why models like Claude Opus 4.7 can refactor a 100,000-line codebase. The capability was trained into existence by a feedback loop that could operate without human bottlenecks.

Why the Jaggedness Isn’t a Bug

The same mechanism that creates peak capability in verifiable domains creates rough edges everywhere else.

Karpathy’s phrase for this is “jagged entities” — models that spike in capability in verifiable domains like math and code, and stagnate or underperform in domains where verification is hard or absent. The jaggedness isn’t a flaw in the architecture. It’s a direct consequence of where RL reward signals are available.

The car wash example makes this concrete in a way that’s almost uncomfortable. Opus 4.7 — the same model that can find zero-day vulnerabilities in production code — will tell you to walk 50 meters to a car wash rather than drive. Because you drive a car to a car wash. That’s the entire point.

The model isn’t stupid. It’s undertrained on a domain where verification is hard. “Should I drive or walk to a nearby car wash?” doesn’t have a clean verifier. There’s no unit test for common sense about automotive cleaning logistics. The RL signal never reached that corner of the capability space, so the model’s behavior there is rough.

This is also why the famous strawberry example existed — counting the letters in “strawberry” was a domain where the model’s tokenization-level processing conflicted with what the task required, and there was no RL pressure to fix it until the labs specifically targeted it.

The Verifiability Spectrum

Here’s where the thesis gets more interesting, and more unsettling.

Karpathy’s claim isn’t just that verifiable domains get automated first. It’s that almost everything is eventually verifiable — it’s a spectrum of difficulty, not a binary.

Code: highly verifiable. Run it, check the output. Math: same. Legal document review: harder, but you can build evaluation rubrics and have lawyers grade outputs at scale. Medical diagnosis: harder still, but outcomes are measurable. Art and taste: seemingly unverifiable — but human feedback is a form of verification, and RLHF already uses it.

The implication is that no domain is permanently safe from RL-driven capability improvement. The question is just how expensive it is to build the verifier, and whether the economic incentive exists to do so.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Karpathy was explicit about the economic incentive piece. Code became the first domain where models reached escape velocity not just because it’s verifiable, but because enterprise companies would pay enormous amounts for tokens that accelerated software development. The labs were incentivized to build the verifier infrastructure, run the RL, and push capability hard. Show me the incentive and you’ll see the outcome.

This is also why Karpathy’s advice to founders at the event was pointed: if you’re building in a domain that’s clearly verifiable, the labs will eventually own it. The interesting opportunities are in domains that are verifiable but where the verification infrastructure hasn’t been built yet — where you could potentially do your own fine-tuning with domain-specific RL environments.

What This Means for Building Agents Today

If you’re building agents or workflows on top of LLMs, the verifiability thesis gives you a practical heuristic for where to trust the model and where to add guardrails.

Tasks with built-in verifiers — code generation, data transformation, structured extraction, math — are where you should lean hardest on the model and give it the most autonomy. The model has been trained extensively in these domains. The RL signal was strong. You can run the output and check it.

Tasks without clean verifiers — judgment calls, aesthetic decisions, common-sense reasoning about physical-world logistics, anything involving nuanced human context — are where you need more oversight, tighter prompts, and human review in the loop.

This maps directly to how Karpathy distinguishes vibe coding from agentic engineering. Vibe coding raises the floor — anyone can build software without understanding syntax. Agentic engineering raises the ceiling — professionals like Peter Steinberger running dozens or sometimes a hundred agents in parallel can go faster without sacrificing the quality bar. The difference is that agentic engineering requires understanding where to trust the model and where to stay in the loop.

Platforms like MindStudio handle the orchestration layer here: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows. When you’re composing agents across verifiable and less-verifiable tasks, having that routing and oversight infrastructure matters more than the individual model choices.

The Software 3.0 Angle

Karpathy’s Software 1.0 / 2.0 / 3.0 framing is worth understanding alongside the verifiability thesis, because they’re connected.

Software 1.0 is explicit rules — you specify every step in code. Software 2.0 is learned weights — you program by curating datasets and training neural networks. Software 3.0 is prompting and context — the LLM is the interpreter, the context window is your RAM, the model weights are your CPU.

The OpenClaw installation example is a clean illustration. You’d expect installing a tool to involve a bash script — a Software 1.0 artifact specifying every step. Instead, the OpenClaw installation is a block of text you paste into your agent. The agent reads your environment, figures out what needs to happen, debugs in the loop, and makes it work. You described the outcome; the agent handled the specification.

This is only possible because installation is a verifiable domain. Either the tool is installed and working or it isn’t. The agent can check its own work. The RL training that produced the model’s capability here was grounded in exactly this kind of feedback loop.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The same principle applies to how production apps get built from high-level intent. Tools like Remy take this further: you write a spec — annotated markdown where prose carries intent and annotations carry precision — and it compiles into a complete TypeScript backend, SQLite database, auth, and deployment. The spec is the source of truth; the generated code is derived output. It’s a direct extension of the Software 3.0 logic: describe the outcome precisely, let the system handle the specification.

Where the Thesis Has Edges

The verifiability thesis is a strong mental model, but it has limits worth naming.

First, it explains where capability peaks, but not when. Knowing that a domain is verifiable doesn’t tell you how long it takes for the labs to build the RL infrastructure and run enough training to reach escape velocity. Legal reasoning is verifiable in principle; the models aren’t there yet in practice.

Second, the thesis assumes the verifier is reliable. Code has a nearly perfect verifier — the compiler and test suite don’t lie. But in domains where the verifier is human feedback (RLHF), the quality of the verifier matters. Inconsistent human raters produce inconsistent reward signals, which produces inconsistent capability. This is probably part of why taste and aesthetics remain rough even though they’re technically verifiable via human feedback.

Third, Karpathy’s own observation about code quality is worth taking seriously. He noted that even in the domain where models are strongest, the output can be “bloaty,” with “a lot of copy-paste and awkward abstractions that are brittle.” The model passes the verifier (the code runs) but fails on dimensions the verifier doesn’t measure (elegance, maintainability, security posture). Verification is necessary but not sufficient for quality.

This connects to the Claude Opus 4.6 vs newer model capability comparisons that have been circulating — benchmark scores on verifiable tasks like SWE-bench are climbing fast, but the gap between “passes tests” and “production-ready code” remains real. The Claude Mythos benchmarks showing 93.9% on SWE-bench are impressive, but SWE-bench is itself a verifiable domain by design.

The Quote That Changes How You Use These Tools

Karpathy mentioned a tweet he thinks about every other day:

“You can outsource your thinking but you can’t outsource your understanding.”

This lands differently once you’ve internalized the verifiability thesis. The domains where you can most safely outsource your thinking are exactly the verifiable ones — where the output can be checked, where errors surface automatically, where the model has been trained hard. Code, math, structured data.

The domains where outsourcing your thinking is most dangerous are the unverifiable ones — where you can’t easily check the output, where the model is undertrained, where errors are subtle and don’t announce themselves. Judgment, taste, common-sense reasoning about the physical world.

The car wash failure isn’t a random glitch. It’s a signal about where the model’s training didn’t reach. If you’re building workflows that involve that kind of reasoning, you need to either keep a human in the loop or find a way to make the domain more verifiable — which is itself a design problem worth solving.

For anyone building on top of models like the ones compared in GPT-5.4 vs Claude Opus 4.6, the verifiability lens is more useful than raw benchmark scores for predicting where a model will succeed or fail in your specific use case. Benchmarks measure performance in verifiable domains almost by definition. Your use case might not be one.

The Bitter Lesson, Applied

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Karpathy referenced the bitter lesson — never bet against end-to-end neural networks over human heuristics. The Tesla autopilot transition is the canonical example: years of rules plus neural net hybrid, scrapped in favor of pure end-to-end neural net, immediate improvement.

The verifiability thesis is the mechanism behind the bitter lesson. End-to-end neural networks win in the long run because they can be trained with RL wherever a verifier exists. Human heuristics are static; RL-trained models keep improving as long as the compute and data keep flowing.

The corollary for builders: the domains where you’re currently writing explicit rules — Software 1.0 style — are candidates for replacement by models trained with RL, if those domains are verifiable. The question to ask about any rule-based system you’re maintaining is: could the output of this system be verified automatically? If yes, there’s probably a model that will eventually outperform your rules, or already does.

Karpathy also noted that even Sergey Brin confirmed on record that AI models perform better when threatened with physical violence in prompts — which is funny, but also points to how much we still don’t understand about what’s actually happening inside these systems. The verifiability thesis explains the capability profile. It doesn’t explain why threatening a language model works. Some things remain genuinely mysterious.

The December 2025 inflection point Karpathy described — where agentic coding fundamentally changed and he “can’t remember the last time he corrected the model” — is a direct consequence of models getting better in verifiable domains. Code is the most verifiable domain there is. The models got good enough that the verifier (running the code, checking the output) almost always comes back clean. That’s not magic. That’s RL working as intended, in the domain where it works best.

Understanding that is more useful than any benchmark number.