How to Use Karpathy's Verifiability Framework to Decide What to Automate in Your Workflow Today
Karpathy's rule: automate what you can verify, keep what requires judgment. Here's a practical guide to applying his framework to your own work in under an…
You Can Outsource Your Thinking, But You Can’t Outsource Your Understanding
There’s a tweet Andrej Karpathy says he thinks about every other day. It goes something like this: you can outsource your thinking but you can’t outsource your understanding. If you’ve been using AI agents seriously for more than a few months, you’ve probably felt the tension it describes — the creeping sense that you’re moving faster but understanding less, that the agent is doing the work while you’re just watching.
This post is about fixing that in under an hour. Specifically, it’s about applying Karpathy’s verifiability framework to your own workflow so you can make principled decisions about what to hand off to AI and what to keep. Not as a philosophy exercise — as a practical audit you can run today.
The framework is simple enough to state in one sentence: traditional computers automate what you can specify in code; LLMs automate what you can verify. But the implications of that sentence are not obvious, and most people building with AI right now are either under-automating (because they don’t trust the model) or over-automating (because they do trust it too much, in the wrong places). Both failure modes cost you.
Why Getting This Wrong Is Expensive
The car wash example is the clearest illustration of the problem. Claude Opus 4.7 — the same model that can refactor a 100,000-line codebase and find zero-day vulnerabilities — will tell you to walk 50 meters to a car wash rather than drive, because driving a car to a car wash is not a domain where reinforcement learning has created strong signal. The model has no verification environment for “sensible car-related logistics.” It has an extremely strong verification environment for code correctness.
This isn’t a bug. It’s the direct output of how frontier labs train these models. They are giant reinforcement learning environments. The model gets rewarded when it produces outputs that can be verified as correct. Code compiles or it doesn’t. Math checks out or it doesn’t. These domains get sharp, reliable capability. Everything else — common sense, spatial reasoning, taste, judgment — gets what Karpathy calls “rough edges.”
The practical problem for builders: if you don’t understand this jaggedness, you’ll trust the model in domains where it’s unreliable and distrust it in domains where it’s superhuman. You’ll manually review every line of generated code (unnecessary) while blindly accepting its advice on product strategy (dangerous).
The framework gives you a way to sort your work before you hand it off.
What You Need Before You Start
You don’t need any specific tool to run this audit. You need:
- A list of tasks you currently do in your work — ideally 20–30 items, written down
- Honest answers to two questions per task (covered in the steps below)
- About 45 minutes
If you’re already using an agent platform like MindStudio to orchestrate workflows across models and integrations, you’ll find the audit maps directly onto decisions you’re already making — which steps in a workflow to automate, which to keep human-in-the-loop, which to run in parallel. But the audit itself is model-agnostic.
One thing that helps: before you start, separate your tasks into two rough buckets — tasks where you already know what “correct” looks like, and tasks where you’re making a judgment call. This pre-sort will make the framework click faster.
The Audit: Four Steps to Decide What to Automate
Step 1: Ask the verifiability question for each task
For every task on your list, ask: can I verify the output without re-doing the work?
This is more precise than “is this task creative or analytical?” Verification means you can check the output against a standard that doesn’t require your full judgment to apply. Code either passes tests or it doesn’t. A summary either contains the key facts from the source document or it doesn’t. A data transformation either produces the right schema or it doesn’t.
Tasks that pass this test are candidates for automation. Tasks that fail it — where checking the output requires the same expertise as doing the work — are not, at least not fully.
Write “V” (verifiable) or “J” (judgment) next to each item.
By the end of this step, you have a sorted list. Most people are surprised by how many of their tasks are actually verifiable once they think carefully about it.
Step 2: Score verifiability on a spectrum, not a binary
Karpathy’s point — and this is the part that’s easy to miss — is that verifiability is a spectrum, not a binary. He said explicitly that almost everything can be made verifiable to some extent. Some things are easier than others.
So for every “V” task, add a score from 1 to 3:
- V1: Fully automatable now. The verification signal is clear, fast, and doesn’t require a human. (Code tests, data validation, format checks, factual lookups.)
- V2: Automatable with a human checkpoint. The output can be verified, but the verification requires some judgment or context. (Draft emails, meeting summaries, research briefs — you can check these quickly, but you need to read them.)
- V3: Verifiable in principle, but the verification environment doesn’t exist yet. (Taste, aesthetics, long-horizon strategy — you could imagine building a feedback loop, but it’s not there today.)
One coffee. One working app.
You bring the idea. Remy manages the project.
For “J” tasks, ask whether you could build a verification environment. Could you define a rubric? Could you create test cases? Could you collect human feedback systematically? If yes, the task might be V3. If no, it stays J.
Now you have a tiered list: V1 tasks to automate immediately, V2 tasks to automate with oversight, V3 tasks to watch, J tasks to keep human.
Step 3: Apply the “outsource thinking vs. outsource understanding” test
Here’s where Karpathy’s tweet becomes operational.
For every task you’ve marked V1 or V2, ask: do I need to understand the output, or just use it?
This sounds like a subtle distinction. It isn’t. If you’re automating code review and you never read the output, you’ve outsourced your understanding of your own codebase. If you’re automating competitive research and you never engage with the findings, you’ve outsourced your understanding of your market. The agent can do the thinking. You cannot delegate the understanding.
The practical rule: automate the production of outputs freely. Be careful about automating the consumption of outputs. If a task produces something you need to understand in order to make decisions, build in a step where you actually read it, not just approve it.
This is what distinguishes agentic engineering from vibe coding. Vibe coding raises the floor — anyone can build software without understanding syntax. Agentic engineering raises the ceiling — professionals move faster without sacrificing the quality bar. The ceiling rises because the engineer still understands what’s being built; they’ve just offloaded the mechanical production. Peter Steinberger running dozens, sometimes a hundred agents in parallel is an example of the ceiling rising — but he’s not absent from the loop. He’s operating at a higher layer of abstraction, not opting out of understanding.
Mark each V1/V2 task with either “auto-consume” (you don’t need to understand the output to use it) or “human-consume” (you need to read and understand it before acting). This shapes how you build the workflow, not just whether you automate.
Step 4: Build your automation map
Now you have enough signal to make decisions.
V1 + auto-consume: Automate fully. No human checkpoint needed. These are your highest-leverage automations — run them in parallel, run them overnight, run them at scale. Examples: data formatting, test generation, log summarization, schema validation, boilerplate code generation.
V1 + human-consume: Automate production, but build in a mandatory review step. The agent does the work; you read the output before it propagates. Examples: draft communications, generated documentation, research summaries.
V2 + auto-consume: Automate with a lightweight verification step — ideally automated verification, not human review. If you can’t build automated verification, treat this as V2 + human-consume. Examples: content classification, entity extraction, structured data generation from unstructured input.
V2 + human-consume: Automate with a human-in-the-loop checkpoint. This is where most knowledge work lives. The agent drafts; you review and approve. The key is making the review fast — if review takes as long as doing the work, you haven’t gained much.
V3: Don’t automate yet. Watch the domain. As verification environments improve (better evals, better feedback loops, better RL signal), V3 tasks will migrate to V2. The AutoResearch loop pattern is one way to start building that verification infrastructure for domains that don’t have it yet.
J: Keep human. Revisit quarterly. Some J tasks will become V3 as you develop better ways to define what “correct” looks like.
By the end of this step, you have an automation map: a prioritized list of what to automate, how to automate it, and where to keep yourself in the loop.
Where This Goes Wrong
Treating “I can’t verify it quickly” as “it’s not verifiable”
The most common mistake is conflating slow verification with no verification. A lot of V2 tasks feel like J tasks because checking the output takes effort. But effort is not the same as impossibility. If you can verify it at all — even slowly, even imperfectly — you can automate production and improve your verification process over time.
Automating the consumption step
This is the failure mode Karpathy’s tweet is warning against. You set up an agent to research competitors, summarize findings, and push a report to Notion. You never read the report. Six months later, your product strategy is based on a hallucinated market analysis you never caught because you outsourced your understanding along with your thinking.
The fix is structural: build workflows where human consumption is a required step, not an optional one. If you’re using an AI agent workflow for product management tasks, make sure the human review gate is in the workflow, not just implied.
Trusting peak capability in adjacent domains
The car wash problem generalizes. A model that’s superhuman at code is not superhuman at everything adjacent to code — system design, architecture decisions, security tradeoffs, organizational dynamics. These are J tasks or V3 tasks dressed up as V1 tasks because they feel technical. Apply the verifiability test, not the “this seems like something AI is good at” heuristic.
Not updating your map
The jaggedness profile of these models changes fast. Karpathy described December 2025 as a clear inflection point — he said he can’t remember the last time he corrected the model on a coding task. Things that were V3 six months ago are V1 today. Run this audit quarterly, not once.
Where to Take This Further
The audit gives you a static map. The next step is building the infrastructure to act on it.
For V1 tasks, the question is throughput — how many can you run in parallel, how do you handle failures, how do you chain outputs into downstream tasks. The WAT framework (Workflows, Agents, and Tools) is a useful mental model for structuring this: workflows for repeatable sequences, agents for tasks requiring judgment within a bounded scope, tools for deterministic operations.
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
For V2 tasks, the question is verification quality — how do you make human review fast and reliable? The answer is usually better output structure (make it easy to scan), better diff views (show what changed), and better escalation paths (make it easy to reject and retry). If you’re building full-stack applications that need to encode these review flows, tools like Remy take a different approach: you write a spec in annotated markdown and the full-stack app — backend, database, auth, deployment — gets compiled from it. The spec becomes the source of truth, which means your verification logic lives in a readable document, not scattered across implementation files.
For V3 tasks, the question is how to build the verification environment. This usually means defining rubrics, collecting labeled examples, and building feedback loops. Karpathy’s point about the bitter lesson applies here: never bet against end-to-end neural networks improving on a domain once you give them a clear verification signal. The Tesla autopilot transition — scrapping the rules-plus-neural-net hybrid for a pure end-to-end neural net — happened because someone built the right training environment. Your V3 tasks are waiting for the same thing.
The Software 3.0 paradigm — where prompting and context are your programming interface — means the OpenClaw installation model applies broadly: instead of writing a bash script that specifies every step, you write a skill file that describes the outcome and lets the agent figure out the path. That only works in verifiable domains. Your audit tells you which of your tasks are in that category.
The understanding, though, stays with you. That’s not a limitation to work around. It’s the thing that makes the whole system useful.