Software 3.0 Explained: Why Karpathy Says the Context Window Is Your New RAM

Andrej Karpathy Just Redrew the Map of What a Computer Is

Andrej Karpathy — co-founder of OpenAI, the person who built Tesla Autopilot, the person who coined “vibe coding” — stood up at Sequoia’s annual AI event and said something that sounds simple but isn’t: context window = RAM, model weights = CPU, prompting = programming. That’s Software 3.0.

If you’ve been building with LLMs for a while, you’ve probably felt this intuitively. But Karpathy gave it a precise architecture, and the precision matters. It changes what you build, how you build it, and what you stop building entirely.

This post is about that architecture — what it actually means, where it came from, and why it explains things that were previously confusing.

The diagram Karpathy drew (nearly three years ago)

Here’s what’s easy to miss: Karpathy first sketched this idea on Twitter almost three years before the Sequoia talk. The diagram shows a new computing platform built around neural networks. It still has the familiar peripherals — audio in, video in, keyboard, mouse, file systems, a browser. Those pieces haven’t gone away.

But everything in the middle is different.

In a classical computer, the CPU does the processing and RAM holds the working state. In Karpathy’s LLM-as-new-computer diagram, the model weights are the CPU — they’re the fixed processing substrate, the thing that does the computation. The context window is RAM — it’s short-term, working memory, the thing that holds what’s currently relevant to the task.

And if the context window is RAM, then what you put in it is your program.

That’s the Software 3.0 claim. Software 1.0 was explicit rules — you write code, you specify every step. Software 2.0 was learned weights — you arrange datasets and training objectives, and the neural network learns the behavior. Software 3.0 is prompting — your lever over the system is the context window, and the “interpreter” is the LLM itself.

The shift from 1.0 to 2.0 to 3.0 isn’t just a change in tools. It’s a change in what programming is.

Why this is harder to see than it sounds

The problem is that Software 3.0 looks, on the surface, like “just typing instructions.” That framing makes it easy to underestimate.

Here’s a concrete example that reframes it. When OpenClaw shipped, the natural assumption was: installation = bash script. That’s how software has always been installed. You write a shell script, you handle edge cases for different platforms, the script balloons to hundreds of lines, and you maintain it forever.

The Software 3.0 version of OpenClaw installation is a copy-paste of text that you give to your agent. Not a script. A skill file. It describes the outcome and gives the agent the tools it needs. The agent looks at your environment, figures out what’s needed, debugs in the loop, and makes it work.

Karpathy’s framing: “What is the piece of text to copy paste to your agent? That’s the programming paradigm now.”

This is disorienting if you’ve been writing software for years. You’re trained to specify. You’re trained to think in steps. Software 3.0 asks you to think in outcomes and let the weights handle the path.

The OpenClaw example is small, but the principle scales. Karpathy described building a menu-rendering app — OCR the menu, generate images for each item, overlay them — as a multi-step traditional software pipeline. Then he saw the Software 3.0 version: give the photo to a multimodal model, describe the outcome, get the result. His reaction: “All of my menu gen is spurious. That app shouldn’t exist.”

The architecture in detail

Let’s be precise about each component.

Model weights = CPU. The weights are fixed at inference time. They encode everything the model learned during training — language, reasoning patterns, world knowledge, code syntax, domain expertise. You don’t change the weights when you use the model. They’re the processing substrate. Like a CPU, they’re always there, always doing the computation, but you don’t reprogram them at runtime.

Context window = RAM. RAM is volatile, working memory. It holds what the current process needs right now. The context window is exactly this — it holds the current conversation, the documents you’ve loaded, the tool outputs, the instructions you’ve given. When the context clears, it’s gone. When you start a new session, you start with empty RAM. The context window is also where you do your “programming” — loading the right information into working memory is how you steer the computation.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Prompting = programming. In Software 1.0, you write code. In Software 2.0, you curate data and define training objectives. In Software 3.0, you write prompts and manage context. The prompt is the program. The context window is where it runs.

This maps cleanly to what Karpathy said: “Your programming now turns to prompting and what’s in the context window is your lever over the interpreter that is the LLM.”

One thing worth sitting with: the CPU analogy for weights is imperfect in one interesting way. You can change the “CPU” — through fine-tuning, through RL training, through choosing a different model entirely. When Frontier Labs do reinforcement learning on verifiable domains like math and code, they’re essentially upgrading the CPU for those specific workloads. The weights get better at certain tasks. This is why Claude Opus 4.6 performs so differently from earlier models on agentic coding tasks — the underlying “CPU” has been retrained.

What the bitter lesson has to do with this

Karpathy referenced the bitter lesson in the context of end-to-end neural networks. The lesson, roughly: never bet against neural networks continuing to improve over human-written heuristics.

The Tesla Autopilot story is the clearest illustration. For years, Autopilot was a hybrid: neural net plus human-written rules. If you see a stop sign, stop. The rules were explicit, maintained by engineers, and they covered the cases engineers thought to cover.

One day, an engineer proposed scrapping the hybrid and going pure end-to-end neural network. They made the transition. Autopilot immediately got better — and became easier to maintain.

The human-written rules were the Software 1.0 layer. The end-to-end neural net was Software 2.0. The lesson: the more you let the neural network learn from data rather than encoding human heuristics, the better the outcome.

Software 3.0 extends this. If the weights are the CPU and the context is RAM, then the “operating system” — the layer that used to be human-written rules and explicit logic — is now emergent from the model. You don’t write an OS. You prompt an interpreter.

This is also why Karpathy said his menu-rendering app “shouldn’t exist.” It was a Software 1.0 solution to a problem that a Software 3.0 system handles natively. The explicit pipeline — OCR, image generation, overlay — was human heuristics encoding a task that a multimodal model can do end-to-end.

The jaggedness problem (and why the CPU analogy explains it)

Here’s something that confused a lot of people: how can Claude Opus 4.7 simultaneously refactor a 100,000-line codebase and tell you to walk 50 meters to a car wash rather than drive?

The car wash example is real. You tell a state-of-the-art model that a car wash is 50 meters away and ask whether to drive or walk. The model tells you to walk. Because it’s close. It doesn’t register that you need to drive into a car wash for it to wash your car.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

The CPU analogy actually explains this. The weights — the CPU — were trained heavily on verifiable domains. Code is verifiable: you run it, it works or it errors. Math is verifiable: 2 + 2 = 4. Reinforcement learning rewards the model for getting these right, and the model gets very good at them.

“Drive to a car wash” is not a verifiable domain in the RL training sense. There’s no reward signal for getting car wash logistics right. The CPU is optimized for certain workloads and rough on others.

Karpathy’s framing: these models are “jagged entities that really peak in capability in verifiable domains like math and code and kind of stagnate and are a little rough around the edges when things are not in that space.”

The jaggedness isn’t a bug in the architecture. It’s a direct consequence of how the CPU was built. You get peak performance where the training signal was strong, and you get gaps everywhere else.

This is also why Qwen 3.6 Plus’s 1M token context window matters for agentic coding specifically — more RAM means more of the codebase fits in working memory at once, which directly affects what the CPU can reason about in a single pass.

What changes when you actually believe this

Karpathy was asked directly: “If this is actually true, what does a team build differently the day they actually believe this?”

The answer is implicit in everything he said, but here’s the concrete version.

You stop writing installation scripts. You write skill files. You describe outcomes and give the agent tools. The agent handles the path.

You stop building multi-step pipelines for tasks models can do end-to-end. If a multimodal model can take an image and return a transformed image with overlaid content, you don’t build an OCR step, an image generation step, and an overlay step. You describe the outcome.

You think about what’s in the context window as carefully as you think about code. The context window is RAM. What you load into it — documents, instructions, examples, tool outputs — determines what the computation can do. Managing context is a first-class engineering concern, not an afterthought.

You distinguish between vibe coding and agentic engineering. Karpathy’s definitions are precise: vibe coding raises the floor (anyone can build software without understanding syntax). Agentic engineering raises the ceiling (professionals go faster without sacrificing quality). The Software 3.0 architecture underlies both, but they’re different disciplines. Peter Steinberger running dozens — sometimes a hundred — agents in parallel is agentic engineering. Someone building their first app with Claude is vibe coding. Both are valid. Neither is the other.

Platforms like MindStudio handle the orchestration layer that Software 3.0 requires: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — so you can work at the context-and-outcome level without writing the plumbing.

The abstraction ladder keeps going

There’s a pattern here that goes back further than LLMs.

Punch cards → assembly → C → higher-level languages → TypeScript. Each step up the abstraction ladder meant writing less low-level code while gaining more expressive power. You gave up some control, but you got speed and clarity.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Software 3.0 is the next rung. You’re not writing code to specify behavior — you’re writing context to steer a pre-trained interpreter. The “source of truth” moves up the stack.

This is also where tools like Remy fit: you write a spec — annotated markdown where prose carries intent and annotations carry precision — and Remy compiles it into a complete full-stack application: TypeScript backend, SQLite database, auth, tests, deployment. The spec is the source of truth; the code is derived output. It’s the same abstraction shift Karpathy is describing, applied to full-stack app development.

The question Karpathy left open is how far this goes. He said “almost everything can be made verifiable to some extent” — including taste, including art. If that’s true, the CPU (model weights) will eventually be trained on those domains too, and the jaggedness will smooth out.

He also said something that cuts the other direction: “You can outsource your thinking but you can’t outsource your understanding.” It’s a tweet he thinks about every other day. The point is that even in Software 3.0, someone has to direct the computation. Someone has to know what outcome to describe, why it’s worth building, and whether the result is correct.

The context window is RAM. The weights are the CPU. But you’re still the one who decides what program to run.

The part that’s still genuinely unsettled

Karpathy was careful not to claim this architecture is complete or final. The weights-as-CPU analogy has limits — you can retrain the CPU, which you can’t do with physical silicon. The context-as-RAM analogy has limits — RAM is deterministic, context is probabilistic.

And the jaggedness problem is real. Comparing models like Gemma 4 and Qwen 3.5 on local deployment reveals exactly this: different “CPUs” with different training histories produce different jagged profiles. There’s no universal chip yet.

What Karpathy is confident about is the direction. The bitter lesson says don’t bet against end-to-end neural networks. The Software 3.0 framing says the programming model has already changed — most people just haven’t updated their mental model to match.

The OpenClaw installation file is a skill file now, not a bash script. That’s not a prediction. That already happened.