Karpathy's Sequoia Talk: 5 Predictions About Agentic Engineering That Should Change How You Work
Karpathy named December 2025 as the inflection point for agentic coding and says he can't remember the last time he corrected the model.
Andrej Karpathy Said December 2025 Changed Everything. Here Are 5 Predictions Buried in That Claim.
Andrej Karpathy — co-founder of OpenAI, the person who invented Tesla’s self-driving program, the person who coined “vibe coding” — gave an exclusive talk at Sequoia’s annual AI event and said something that should stop you mid-scroll. He said December 2025 was a “clear point” where agentic coding fundamentally changed, and that he “can’t remember the last time he corrected the model.” That’s not a marketing line. That’s Karpathy, one of the most technically credible people in this industry, describing a personal inflection point.
He had 5 specific things to say about where agentic engineering is going. They’re worth taking seriously — not because Karpathy is always right, but because when he’s describing something he experienced firsthand, the specificity tends to hold up.
December 2025 Was Not a Vibe — It Was a Measurable Shift
Karpathy described the transition clearly. He was on a break, had more time than usual, started using the latest models with agentic tooling, and noticed something different: the chunks just came out fine. He kept asking for more. It kept coming out fine. And then — his words — “I can’t remember the last time I corrected it.”
How Remy works. You talk. Remy ships.
This isn’t someone describing a gradual improvement curve. He’s describing a threshold moment. The models crossed some line where the error rate dropped low enough that the correction loop — the thing that made agentic coding feel like babysitting — effectively disappeared.
If you tried agentic coding tools a year before December 2025 and weren’t impressed, Karpathy is specifically telling you that you need to look again. The product you tried no longer exists. What replaced it is qualitatively different.
The practical implication: if your mental model of agentic coding is still “it writes snippets I have to stitch together,” you’re operating on outdated information. The current state is end-to-end application generation, not autocomplete.
The LLM-as-Computer Architecture Is Already Here, Not Coming
Karpathy posted a diagram nearly three years ago — before most people were thinking seriously about agents — that described the LLM as a new kind of computer. Audio and video in. Peripherals. File systems. Browser. But the LLM handles everything in the middle. No traditional operating system. The context window is RAM. The model weights are the CPU.
This wasn’t speculative when he posted it. It was descriptive. He was mapping what was already becoming true.
The reason this matters for builders is that it changes what “programming” means. Software 1.0 was explicit rules — you write code that specifies every step. Software 2.0 was learned weights — you arrange datasets and training objectives and the neural network learns the behavior. Software 3.0, which is where we are now, is prompting. Your lever over the system is what you put in the context window.
The OpenClaw installation example is the clearest illustration of this shift. When OpenClaw launched, you’d expect a bash script — a shell script that runs a series of explicit installation steps. Instead, the installation is a copy-paste of text you give to your agent. A skill file. Here are your tools, here’s the outcome I want, go figure it out. The agent looks at your environment, debugs in the loop, and makes it work. No bash script. No explicit steps. Just outcome description.
That’s not a convenience feature. That’s a different model of what software is.
Verifiability Explains the Jaggedness — and Predicts What Gets Automated Next
Here’s the thing about Karpathy’s car wash example that people keep glossing over: Claude Opus 4.7 — the same model that can refactor a 100,000-line codebase or find zero-day vulnerabilities — will tell you to walk 50 meters to a car wash rather than drive your car there. Because driving a car to a car wash is not a verifiable domain. There’s no RL training signal for “is this good advice about car washing logistics.”
Karpathy’s explanation for this jaggedness is the verifiability thesis: traditional computers automate what you can specify in code; LLMs automate what you can verify. Code is easily verifiable — you compile it, run it, get an error or you don’t. Math is easily verifiable. These are domains where reinforcement learning can run at scale because the reward signal is unambiguous. The models peak in those domains because that’s where the training signal is strongest.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
The car wash problem isn’t verifiable in the same way. There’s no ground truth dataset of “should I drive or walk to a nearby car wash” with clear right/wrong labels. So the model, despite being superhuman at code, gives you advice that a ten-year-old would know is wrong.
This has a direct prediction embedded in it: the next domains to get automated are the ones where someone figures out how to make verification tractable. If you can build an RL environment for a domain — even a messy, imperfect one — you can throw compute at it and get capability. If you can’t, the model stays rough around the edges there regardless of how good it gets at code.
For builders, this is a framework, not just an explanation. Ask yourself: can I verify the output of this task without a human in the loop? If yes, it’s automatable now or very soon. If no, you still need judgment in the chain. That’s not a permanent safe harbor — Karpathy explicitly said he thinks almost everything is eventually verifiable, it’s just a spectrum of difficulty — but it tells you where you are on the timeline.
Vibe Coding and Agentic Engineering Are Not the Same Thing, and Confusing Them Is Expensive
Karpathy drew a distinction that’s worth repeating precisely because it’s being blurred constantly in the discourse.
Vibe coding raises the floor. Anyone can build software without understanding syntax, without knowing how the code works, without being able to debug it. That’s genuinely useful. It’s also not what professional engineers should be doing when they’re building production software.
Agentic engineering raises the ceiling. It’s how professionals go faster without sacrificing the quality bar. You’re still responsible for the software. You’re still responsible for security. You’re still responsible for the architecture. What changes is the speed at which you can execute, and the layer of abstraction at which you’re operating.
The distinction matters because the failure modes are completely different. Vibe coding failure mode: you ship something that works until it doesn’t, and you can’t debug it because you don’t understand it. Agentic engineering failure mode: you move fast but lose track of what the agents are actually doing, and quality degrades in ways you don’t catch until they’re expensive.
Peter Steinberger is the example Karpathy cited for what frontier agentic engineering actually looks like: running dozens, sometimes a hundred agents in parallel, automating different parts of the development flow — not just code writing, but deployment sequences, bug detection, PR management. That’s not vibe coding. That’s orchestration at scale, and it requires understanding the system deeply enough to direct it.
If you’re building production software and you’re treating agentic tools as a way to avoid understanding what you’re building, you’re doing vibe coding and calling it agentic engineering. Those are different bets with different risk profiles.
The Bitter Lesson Has a Corollary That Most People Are Ignoring
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
The bitter lesson in ML research is simple: never bet against end-to-end neural networks over human heuristics. It’s been proven repeatedly. Tesla’s autopilot ran for years as a hybrid — neural net plus human-written rules. If you see a stop sign, stop. One engineer eventually argued for scrapping the rules entirely and going pure end-to-end neural net. They did it. The autopilot immediately improved and became easier to maintain.
Karpathy’s corollary, which he didn’t state explicitly but which runs through everything he said: the same dynamic applies to software architecture. Every place where you’re using an LLM for one piece and traditional code for everything else, you’re running the hybrid autopilot. The end-to-end neural network version is coming, and it will outperform the hybrid.
His menu generation example makes this concrete. He built an app that takes a photo of a menu, OCRs all the items, generates images for each, and renders them. Traditional code orchestrating multiple steps. Then he saw the Software 3.0 version: take the photo, give it to Gemini, say “use NanoBanana to overlay the items onto the menu,” and get back the rendered image. The entire pipeline collapsed into a single model call. His words: “All of my menu gen is spurious. It’s working in the old paradigm. That app shouldn’t exist.”
That’s a strong statement. He’s saying his own code was already obsolete by the time he finished writing it.
For builders, this means the architecture question isn’t just “where do I add AI?” It’s “what parts of this system can I collapse into a single model call, and what’s stopping me?” The answer to the second question is usually verifiability — if you can’t verify the output, you can’t trust the collapse. But the direction of travel is clear.
This is also where the tooling layer matters. Platforms like MindStudio handle the orchestration problem when you’re not ready to collapse everything into one call — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows. It’s the practical middle ground between “I wrote all this orchestration code” and “I handed it all to one model and hoped.”
You Can Outsource the Thinking. You Cannot Outsource the Understanding.
Karpathy said there was a tweet he thinks about every other day. The tweet: “You can outsource your thinking but you can’t outsource your understanding.”
He said it in the context of feeling like he was becoming a bottleneck — not because he couldn’t use the tools, but because information still has to make it into his brain. He still has to know what they’re trying to build, why it’s worth doing, how to direct the agents. The thinking can be delegated. The understanding cannot.
This is the prediction that’s easiest to dismiss and hardest to argue with. As models get better at execution, the scarce resource shifts toward direction. Knowing what to build, why it matters, whether the output is actually good — those require understanding, not just verification. And understanding requires engagement with the material, not just review of the output.
The practical implication for engineers: the people who will do best in an agentic engineering world are not the ones who can prompt most fluently. They’re the ones who understand the systems deeply enough to know when the agent is wrong, when the architecture is brittle, when the abstraction is leaking. Karpathy noted that the code agents produce is often “bloaty” with “awkward abstractions that are brittle” — it works, but it’s not good code. Catching that requires taste, which requires understanding.
This connects directly to how the next layer of tooling is being built. Remy takes a different approach to this problem: you write a spec — annotated markdown where readable prose carries intent and annotations carry precision — and it compiles into a complete TypeScript stack with backend, database, auth, and deployment. The spec is the source of truth; the code is derived output. The understanding lives in the spec, not in the generated code, which means the human stays in the loop at the level that matters.
The Claude Code source code leak surfaced something similar — the system prompts and tool definitions that make Claude Code work are themselves a form of spec, a structured description of intent that the model executes against. The understanding is encoded in the structure, not delegated away.
What Karpathy Is Actually Predicting for 2026
Pull the thread on everything he said and the prediction is coherent: agentic engineering becomes a real discipline with real skill differentiation. Not everyone running agents is doing the same thing. The gap between someone vibe coding and someone running a hundred parallel agents with quality controls is the same gap as between someone who learned to type and someone who can architect distributed systems.
The verifiability framework tells you which domains get automated first. The Software 3.0 architecture tells you what “programming” looks like when it gets there. The bitter lesson tells you not to bet on the hybrid approach surviving. And the outsourcing-understanding distinction tells you what remains human even when everything else gets automated.
December 2025 was the inflection point Karpathy named. The question for 2026 is whether you’re building skills for the world that existed before it, or the one that came after.
For a closer look at how the models driving this shift actually compare in practice, the GPT-5.4 vs Claude Opus 4.6 comparison is worth reading alongside this — the capability differences Karpathy describes as “jagged” show up clearly in head-to-head agentic tasks. And if you want to understand Karpathy’s broader thinking on knowledge infrastructure for agents, his LLM wiki approach is the practical companion to everything he said at Sequoia.
The models are not going to stop improving. The question is whether your mental model of what they can do is keeping pace.