The 9 Components Every Production Agent Harness Needs (and What Breaks Without Each One)
From while-loops to lifecycle hooks: the exact nine components that separate a toy agent from a production harness, with failure modes for each.
Nine Components That Separate a Toy Agent from a Production Harness
@engineerprompt’s breakdown of agent harness architecture — nine components: while-loop, context management, skills/tools, sub-agent management, built-in skills, session persistence, system prompt assembly, lifecycle hooks, and permissions/safety — is the clearest taxonomy I’ve seen for what actually makes an agent work in production. Most people building agents today have assembled some of these pieces without naming them. Naming them matters, because you can’t debug what you haven’t defined.
If you’ve shipped an agent that works in demos but breaks in production, you’re almost certainly missing one of these nine. This post walks through each one, what it does, and what breaks when it’s absent.
The While-Loop: The Engine Everything Else Runs Inside
The while-loop is the foundation. An LLM without a harness is a one-shot text generator — you send a message, it responds, it stops. The while-loop is what turns that into an agent: the model reads its system prompt, decides which tool to call, runs the tool, feeds the result back into context, and loops again until it produces a text-only response or hits a maximum iteration cap.
Every production coding harness — Claude Code, Codex, Cursor — is, at its core, a while-loop with a tool registry and a permission layer. The sophistication is in what surrounds that loop, not the loop itself.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
What breaks without it: you get a chatbot, not an agent. The model can describe what it would do but can’t actually do it across multiple steps.
One implementation detail worth getting right: always cap the iteration count. An uncapped loop will run until it hits a context limit or your API budget. Set a hard ceiling and surface it to the user when it’s reached.
Context Management: The Hardest Problem Nobody Talks About
On every turn, the conversation tree grows. More user messages, more tool calls, more results — and eventually you hit the context limit of whatever model you’re running. The harness has to decide what to keep verbatim, what to summarize, and what to discard.
Claude Code’s approach: when you approach roughly 80-90% of the context window, it triggers a compaction. Recent messages stay in full. Everything older gets summarized. The current limit for Opus is 1 million tokens, but compaction still matters — not because you’ll always hit the ceiling, but because models attend unevenly across long contexts. A 900k-token context window doesn’t mean the model is equally attentive to token 1 and token 899,000.
What breaks without it: sessions that work fine for 20 minutes and then start producing incoherent or contradictory outputs as the context fills with stale, conflicting information.
The design decision that trips people up: when you compact, do you include full tool call inputs and outputs, or just the results? Including everything is more faithful but more expensive. Including only inputs and outputs is cheaper but loses intermediate reasoning. Neither is universally correct — it depends on your task type.
Skills and Tools: The Difference Between Universal and Specific
Tools are the primitives: read a file, edit a file, run bash, search code. They’re universal — any agent that touches a filesystem needs them. Skills are a layer on top: organizational knowledge encoded in markdown files, specific to your team’s workflow, your codebase conventions, your deployment process.
The registry is what ties them together. It tells the agent what’s available, what permission each tool requires, and how to dispatch calls. Without a registry, you’re hardcoding tool availability into the system prompt, which doesn’t scale.
Claude Code’s progressive context loading for skills is worth understanding in detail. Level 1 is the initial search: the model only reads the YAML front matter — name and description — roughly 100 tokens per skill. Level 2 loads the full skill.md, typically 1,000–2,000 tokens, only when the model has identified this as the right skill. Level 3 loads reference files only when the specific request requires them. This three-level approach keeps skill discovery cheap while keeping execution rich.
What breaks without it: agents that can describe a workflow but can’t execute it, or agents that execute it inconsistently because the knowledge lives in the system prompt instead of a structured, loadable skill.
Sub-Agent Management: When One Thread Isn’t Enough
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
At some point, a task gets too big or too parallel for a single conversation thread. The harness needs to create sub-agents that work in isolation — each with its own session, its own restricted tool set, and a focused system prompt scoped to a specific subtask.
The pattern is: spawn, restrict, collect. You spawn a sub-agent with a narrow mandate, restrict its tool access to only what that mandate requires, and collect its output back into the parent context as a summary rather than a full transcript.
This is why sub-agents are so useful for research tasks. When you’re exploring a codebase or doing web research, you might load tens of thousands of tokens of raw information. Doing that in your main agent’s context window is expensive and degrades quality. A sub-agent does the heavy lifting and returns a 2,000-token summary. The main agent stays focused.
The WAT framework — Workflows, Agents, and Tools — is a useful mental model here: sub-agents are the “A” layer, sitting between deterministic workflows and raw tool calls.
What breaks without it: agents that try to do everything in a single context window, producing lower-quality outputs as the window fills, and failing on tasks that require genuine parallelism.
Built-In Skills: The Non-Negotiables
Every harness ships with a baseline set of skills that work out of the box. For a coding agent: file read, file write, file edit, search, bash execution, code navigation. These are non-negotiable. If your agent can’t read or edit files, it isn’t a coding agent.
Beyond the primitives, modern harnesses ship higher-level built-in skills: how to make a git commit, how to open a pull request, how to run tests and parse results. Some of these are vendor-specific — Cursor’s built-in skills reflect Cursor’s particular model of how coding work flows.
The Cursor SDK is an interesting case here. When Jack Driscoll embedded a Cursor agent in Gmail, he wasn’t just calling an LLM with tools — he was exposing the same coding agent runtime Cursor already uses internally: repo context, edit, search, terminal workflow, streaming status, model choice, and local/hosted execution. The built-in skills came with the harness, not with the model.
What breaks without it: agents that can reason about code but can’t act on it, or that require you to manually wire every primitive before you can do anything useful.
Session Persistence: Durability When Things Go Wrong
A long agent session is stateful. If the process crashes, you lose everything unless the harness writes state to disk. The modern approach is append-only JSON or markdown files: every message, every tool result, every compaction event gets written as one line. If the process crashes after writing line N, line N is already safe on disk.
The replay method reads the file back line by line and reconstructs the full session. Because the file is append-only, two runs of the harness can share the same log without stepping on each other.
Anthropic’s managed agents product takes an interesting architectural position here: session management is separate from the harness itself, not embedded in it. That separation makes the session layer independently scalable and replaceable.
What breaks without it: any session longer than a few minutes becomes a liability. Users lose work. Agents lose context. You end up with agents that are only reliable for short, bounded tasks.
For more on how Claude Code’s memory architecture handles this at a deeper level, the Claude Code source leak analysis covers the three-layer memory system in detail.
System Prompt Assembly: Not a Static String
This is the component that surprises most people. The system prompt in a production harness isn’t a static string — it’s a pipeline. The harness walks ancestor directories looking for specific instruction files: agents.md, claude.md, CLAUDE.md. It finds them, reads them, and injects them into the system prompt dynamically.
This is how organizational knowledge gets encoded at the project level. Your team’s coding conventions, your deployment process, your testing strategy — all of it lives in markdown files that the harness discovers and assembles at runtime.
One critical implementation detail: order matters. Keep the static part of the system prompt first, then inject dynamic content. If you reverse this, you break prefix caching. Most production harnesses use aggressive prompt caching, and dynamic content injected before static content invalidates the cache on every turn.
What breaks without it: agents that ignore project-specific conventions, or that require you to manually paste context into every new session.
Lifecycle Hooks: Extensibility Without Forking
Hooks let you inject custom logic before or after a tool runs without modifying the harness itself. A pre-tool hook fires before execution — it receives the tool name and input, and can allow, deny, or modify the call. A post-tool hook fires after and can inspect results for logging or auditing.
The protocol is structured: typically a JSON response with exit codes for allow or deny. This is how enterprises adopt harnesses without forking them. You want to log every bash command to your security audit system? Pre-tool hook. You want to post tool results to an observability platform? Post-tool hook. You want to block certain file paths from being read? Pre-tool hook with a deny response.
This is also how the PIV loop (plan-implement-validate) methodology integrates with external systems like Jira. The Atlassian MCP server can be wired through hooks to automatically create Jira tickets, update issues, and post comments as the agent works — without the agent needing to know anything about Jira’s internal structure.
What breaks without it: every enterprise customization requires forking the harness. Forks diverge. Maintenance becomes a nightmare.
Permissions and Safety: The Layer That Makes It Deployable
The permission layer is what makes the difference between a useful tool and a dangerous one. Production harnesses define a hierarchy of permission modes — read-only, workspace, full access — and each tool declares the minimum permission it requires. The harness enforces this at dispatch time, before the tool runs.
For bash specifically, the harness classifies commands dynamically. ls and grep stay at read-only. rm and sudo jump to full access. Anything else gets workspace level. The harness parses the command string to make this determination.
On top of static permission rules, production harnesses support interactive approvals: the agent can pause and ask “should I run this?” before executing anything destructive. This is the safety valve that makes autonomous operation acceptable in production environments.
Microsoft’s hosted agents in Foundry — Satya Nadella’s framing was “every agent will need its own computer” — takes this further with dedicated enterprise-grade sandboxes, durable state, and built-in identity and governance. The permission layer isn’t just about what tools can do; it’s about who authorized them and under what conditions.
What breaks without it: agents that delete things they shouldn’t, read files they shouldn’t, or execute commands that violate security policy. One bad run in production and the whole program gets shut down.
Why the Harness Architecture Is Now the Product
Sam Altman told Ben Thompson that he can no longer always tell whether an impressive Codex result came from the model or the harness. That’s not a confession of ignorance — it’s a description of how tightly coupled these two things have become.
The Endor Labs benchmark data makes this concrete. GPT-5.5 running in Codex’s native harness scored 61.5% on functionality. The same model, the same week, running in Cursor’s harness scored 87.2%. That’s a 25-point swing from switching harnesses, not models. Opus 4.7 jumped from 87.2% to 91.1% by moving from Claude Code’s native harness to Cursor’s. The harness is doing real work.
Akshay’s three-phase framework for agent evolution captures why: we moved from the weights phase (better models) to the context phase (better prompts) to the harness engineering phase (better environments). Each phase layered on top of the previous one. Weights still matter. Context still matters. But the center of gravity has moved to the harness.
Platforms like MindStudio handle this orchestration layer for builders who don’t want to wire it themselves: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows. The nine components described above are all present — they’re just abstracted behind a builder interface rather than implemented in Python.
The skill systems concept from @simonscrapes illustrates what this looks like at the application layer. A YouTube short-form clip system chains five modular skills — transcript extraction, clip selection, face-tracked reframe, illustrated editing, publishing — and runs fully autonomously. Each skill is a focused component. The orchestrator skill wires them together. The harness provides the while-loop, context management, session persistence, and permission enforcement that make the whole thing reliable.
What to Actually Build Next
If you’re building agents and you haven’t audited your harness against these nine components, that’s the first thing to do. Go through the list and ask: do I have this? Is it production-grade or is it a stub?
The most commonly missing components in agent projects I’ve seen: session persistence (people assume the process won’t crash), lifecycle hooks (people hardcode customizations into the harness itself), and dynamic system prompt assembly (people use static strings and wonder why the agent ignores project conventions).
The self-evolving Claude Code memory system with Obsidian and hooks is a good reference implementation for the session persistence and lifecycle hook components working together — hooks capture session logs, session persistence makes them durable, and the memory system builds on both.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
For teams thinking about how to structure the planning and validation layer around their harness, the PIV loop methodology — running an AI engineering team with heartbeat scheduling — shows how the harness components map to a real development workflow.
If you’re at the point where you want to compile a full application from a spec rather than wire individual harness components, Remy takes a different approach: you write annotated markdown describing your application, and it compiles a complete TypeScript backend, SQLite database, frontend, auth, and deployment from that spec. The spec is the source of truth; the generated code is derived output.
The nine components aren’t a checklist you complete once. They’re a framework for diagnosing what’s wrong when your agent breaks in production — and for understanding why switching harnesses can matter as much as switching models.