How to Build a Minimal Agent Harness in Python: Step-by-Step with Session Persistence

You Can Build a Working Agent Harness in Under an Hour

Most developers who’ve spent time with Claude Code or Codex have had the same experience: the agent does something impressive, and you genuinely can’t tell whether to credit the model or the scaffolding around it. Sam Altman said exactly this to Ben Thompson — “I no longer think of the harness and the model as these entirely separable things.” That’s not a throwaway observation. It’s a description of where the leverage actually lives.

This post is about the scaffolding. Specifically, a Python reference implementation built around two ideas: append-only JSON session persistence and dynamic system prompt assembly from agents.md and claude.md files. You can have a working version running in under an hour. The goal isn’t to replicate Claude Code or Cursor — it’s to understand what those tools are actually doing, so you can build something that fits your specific problem.

If you want to understand why this matters before touching any code, the Endor Labs benchmark data is clarifying. GPT-5.5 running in Cursor’s harness scored 87.2% on functionality tests. The same model, the same week, running in the Codex harness scored 61.5%. That’s a 25-point swing from infrastructure, not from the model. Opus 4.7 jumped from 87.2% to 91.1% just by switching harnesses. The harness is doing real work.

What You’re Actually Building (and Why It’s Worth the Hour)

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

A harness is not a framework. LangChain, LangGraph, CrewAI — those are frameworks. They give you abstractions you wire together. A harness ships a working agent. The distinction matters because frameworks assume you’re the architect; a harness assumes the agent is the one doing the task.

At its core, a harness is a while-loop with a tool registry and a permission layer. Everything else — context management, session persistence, system prompt assembly, lifecycle hooks, sub-agent management — is scaffolding around that loop. The nine components that show up in every production harness (the while-loop, context management, skills/tools, sub-agent management, built-in skills, session persistence, system prompt assembly, lifecycle hooks, and permissions/safety) aren’t arbitrary. They’re the minimum viable set for an agent that can do real work without losing state or going off the rails.

The specific thing you’re building here is the subset that gives you the most leverage with the least complexity: durable session state and dynamic context injection. These two features together mean your agent can resume after a crash, accumulate knowledge across runs, and adapt its behavior based on project-specific instructions — without you having to re-prompt it every session.

For teams building on top of Claude Code, understanding this architecture also unlocks the three-layer memory system that Claude Code uses internally — the same principles apply to any harness you build yourself.

Prerequisites

You need Python 3.10 or later, an Anthropic API key (set as ANTHROPIC_API_KEY in your environment), and the anthropic Python SDK (pip install anthropic). That’s the full dependency list for the minimal implementation. No vector databases, no orchestration frameworks, no external state stores.

You should be comfortable reading Python and have a basic mental model of how LLM tool-calling works — the model returns a structured tool call, you execute it, you feed the result back. If you’ve used the Anthropic Messages API directly, you have everything you need.

One conceptual prerequisite: understand the difference between a session (a single continuous run) and a conversation (the message history within that session). The persistence layer you’re building tracks sessions; the context management layer handles conversations within sessions.

Building the Harness

Step 1: The While-Loop Engine

The entire harness runs on a single loop. On every iteration, the model reads its assembled system prompt, decides which tool to call, the harness executes the tool, the result gets appended to the message history, and the loop continues until the model produces a text-only response or hits a maximum iteration cap.

def run(self, user_message: str, max_iterations: int = 50) -> str:
    self.session.append({"role": "user", "content": user_message, "ts": time.time()})
    
    for _ in range(max_iterations):
        response = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=8096,
            system=self.assemble_system_prompt(),
            messages=self._get_context(),
            tools=self.registry.descriptors(),
        )
        
        self.session.append({"role": "assistant", "content": response.content, "ts": time.time()})
        
        if response.stop_reason == "end_turn":
            return self._extract_text(response)
        
        # Handle tool calls
        tool_results = self._dispatch_tools(response)
        self.session.append({"role": "user", "content": tool_results, "ts": time.time()})
    
    return "Max iterations reached."

The cap matters. Without it, a confused agent can loop indefinitely. Set it conservatively at first — 50 iterations is plenty for most tasks — and raise it only when you have observability into what the agent is actually doing.

Now you have: a loop that drives the agent forward and stops cleanly.

Step 2: Append-Only JSON Session Persistence

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

This is the feature that makes the harness durable. Every event — user messages, tool calls, tool results, compaction events — gets written to disk as a single JSON line immediately after it occurs. If the process crashes after writing line 47, line 47 is safe. You resume from line 47.

class SessionStore:
    def __init__(self, path: str):
        self.path = path
    
    def append(self, event: dict) -> None:
        with open(self.path, "a") as f:
            f.write(json.dumps(event) + "\n")
            f.flush()  # Critical: flush immediately
    
    def replay(self) -> list[dict]:
        if not os.path.exists(self.path):
            return []
        events = []
        with open(self.path, "r") as f:
            for line in f:
                line = line.strip()
                if line:
                    events.append(json.loads(line))
        return events

The flush() call is not optional. Without it, the OS may buffer writes and you lose events on crash. The append-only structure means two concurrent runs can share the same log without corrupting each other — each appends, neither overwrites.

On startup, call replay() to reconstruct the session state. The agent picks up exactly where it left off.

Now you have: crash-safe session state that survives process restarts.

Step 3: Dynamic System Prompt Assembly

Most people treat the system prompt as a static string. That’s a mistake. Claude Code’s approach — and the approach you should copy — is to walk the directory tree looking for claude.md and agents.md files and inject their contents into the system prompt at runtime.

def assemble_system_prompt(self) -> str:
    parts = [BASE_SYSTEM_PROMPT]  # Static part first — preserves prefix caching
    
    # Walk ancestor directories for instruction files
    search_dirs = self._get_ancestor_dirs()
    for directory in search_dirs:
        for filename in ["agents.md", "claude.md", "AGENTS.md", "CLAUDE.md"]:
            filepath = os.path.join(directory, filename)
            if os.path.exists(filepath):
                with open(filepath, "r") as f:
                    content = f.read().strip()
                if content:
                    parts.append(f"# Instructions from {filepath}\n\n{content}")
    
    return "\n\n".join(parts)

def _get_ancestor_dirs(self) -> list[str]:
    dirs = []
    current = os.getcwd()
    while True:
        dirs.append(current)
        parent = os.path.dirname(current)
        if parent == current:
            break
        current = parent
    return list(reversed(dirs))  # Root first, most specific last

The order matters: static content first, dynamic content second. Reversing this breaks prefix caching, which means you pay full token costs on every call instead of cache hit costs. Keep the static base prompt stable and append dynamic content after it.

This is how you encode project-specific conventions, team workflows, and domain knowledge into the agent without touching the harness code. Drop an agents.md in your project root and the agent picks it up automatically.

Now you have: a system prompt that adapts to the project without code changes.

Step 4: The Tool Registry

Tools are the primitives — read a file, write a file, run a bash command. Skills sit on top: they’re tools whose handlers read a markdown file at invocation time, encoding organizational knowledge as executable instructions.

@dataclass
class Tool:
    name: str
    permission: str  # "read", "workspace", "full"
    handler: Callable
    description: str
    schema: dict

class Registry:
    def __init__(self):
        self._tools: dict[str, Tool] = {}
    
    def register(self, tool: Tool) -> None:
        self._tools[tool.name] = tool
    
    def get(self, name: str) -> Tool | None:
        return self._tools.get(name)
    
    def descriptors(self) -> list[dict]:
        # Lightweight version sent to the model
        return [
            {
                "name": t.name,
                "description": t.description,
                "input_schema": t.schema,
            }
            for t in self._tools.values()
        ]

Register your built-in primitives at startup: read_file, write_file, run_bash. These are non-negotiable for a coding agent. Without them, the agent can reason but not act.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Now you have: a tool dispatch layer the model can call against.

Step 5: Context Compaction

Context windows are finite. When the message history grows past a threshold — say, 80% of the model’s context limit — you need to summarize older messages and discard the originals. Keep recent messages verbatim; summarize everything older.

def _get_context(self) -> list[dict]:
    messages = self.session.replay()
    
    # Rough token estimate: 4 chars ≈ 1 token
    total_chars = sum(len(json.dumps(m)) for m in messages)
    estimated_tokens = total_chars / 4
    
    if estimated_tokens > COMPACTION_THRESHOLD:
        # Keep last N messages verbatim, summarize the rest
        recent = messages[-KEEP_RECENT:]
        older = messages[:-KEEP_RECENT]
        summary = self._summarize(older)
        
        compaction_event = {
            "role": "system",
            "content": f"[Context compacted. Summary of earlier conversation: {summary}]",
            "ts": time.time(),
            "type": "compaction"
        }
        self.session.append(compaction_event)
        return [compaction_event] + recent
    
    return messages

Bad compaction is one of the most common ways agents go wrong. If you summarize too aggressively, the agent loses critical context. If you never compact, you hit the context limit and the call fails. Log compaction events to your session store — they’re important for debugging.

Now you have: a harness that handles long-running sessions without hitting context limits.

Step 6: Permission Enforcement

Each tool declares the minimum permission it needs. The harness enforces this at dispatch time, before the tool runs. For bash commands, classify dynamically: ls, cat, grep are read-only; rm, sudo, shutdown require full access.

SAFE_COMMANDS = {"ls", "cat", "grep", "find", "echo", "pwd", "head", "tail"}
DANGEROUS_COMMANDS = {"rm", "sudo", "shutdown", "reboot", "mkfs", "dd"}

def classify_bash_permission(command: str) -> str:
    first_token = command.strip().split()[0] if command.strip() else ""
    if first_token in SAFE_COMMANDS:
        return "read"
    if first_token in DANGEROUS_COMMANDS:
        return "full"
    return "workspace"

Add an interactive approval gate for destructive operations. The agent pauses, prints the proposed command, and waits for explicit confirmation. This is the difference between a useful tool and one you’re afraid to run.

Now you have: a permission layer that prevents the agent from doing things it shouldn’t.

The Failure Modes You’ll Actually Hit

Caching breaks when you change the static prompt. If you modify BASE_SYSTEM_PROMPT, Anthropic’s prompt caching invalidates and you pay full token costs until the cache warms up again. Keep the static portion stable. Put everything project-specific in agents.md files.

Compaction loses critical state. If your summarization prompt is too aggressive, the agent forgets tool schemas, project conventions, or in-progress work. Test compaction explicitly by running long sessions and checking whether the agent’s behavior degrades after a compaction event. The self-evolving memory system built with Claude Code hooks uses a similar pattern to preserve important context across sessions.

The while-loop runs forever on ambiguous tasks. If the agent can’t make progress but also can’t produce a text-only response, it loops until it hits the iteration cap. Add a “stuck detection” heuristic: if the last three tool calls were identical, break the loop and surface the state to the user.

Session replay is slow on large logs. Append-only JSON is fast to write but O(n) to read. For sessions with thousands of events, replay time becomes noticeable. Add an index file that records the byte offset of each compaction event, so you can seek directly to the most recent checkpoint instead of replaying from the beginning.

Tool schemas drift from implementation. When you update a tool’s behavior but forget to update its JSON schema, the model calls it with arguments that don’t match. Write tests that validate tool schemas against their handlers. This is boring work that saves hours of debugging.

Where to Take This

The reference implementation above is intentionally minimal. It handles the core loop, persistence, and prompt assembly. Production harnesses add several layers on top.

Lifecycle hooks — pre-tool and post-tool — let you inject custom logic without touching the harness itself. A pre-tool hook can log every tool call to an observability system, enforce additional permission rules, or modify tool inputs before execution. A post-tool hook can audit results, trigger alerts, or feed data to downstream systems. This is how enterprises adopt harnesses without forking them.

Sub-agent management is the next major addition. When a task is too large or too parallel for a single conversation thread, the harness spawns sub-agents with their own sessions, restricted tool sets, and focused system prompts. Each sub-agent gets a specific task; the orchestrator collects results. The pattern is: spawn, restrict, collect. This is exactly the architecture behind the five-skill YouTube short-form clip system — transcript extraction feeding into clip selection feeding into face-tracked reframe feeding into illustrated editing feeding into publishing — where each stage runs as a focused sub-agent with its own context window.

The skill system concept extends this further. An orchestrator skill wraps multiple modular sub-skills, each loaded progressively: level 1 is just the YAML front matter (~100 tokens), level 2 is the full skill.md (1-2k tokens), level 3 loads reference files only when the specific task requires them. This is how Claude Code keeps context costs manageable across large skill libraries. For a deeper treatment of how skills compose into larger workflows, the Claude Code skills vs plugins breakdown covers the architectural tradeoffs.

The PIV loop methodology — plan, implement, validate — maps cleanly onto this harness architecture. Planning generates a structured artifact (a plan.md); implementation runs against that artifact in a fresh session; validation runs tests and updates the Jira ticket via the Atlassian MCP server. Each phase is a separate harness invocation with its own session store. The harness doesn’t need to know about Jira; the skill does.

If you’re building production applications on top of this kind of agent infrastructure, MindStudio offers a no-code path to the same orchestration patterns — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which is useful when the harness engineering itself isn’t the core product you’re building.

For teams who want to go further up the abstraction stack, Remy takes a different approach to the spec-to-implementation pipeline: you write an annotated markdown spec and it compiles a complete TypeScript full-stack application from it — backend, database, auth, deployment. The spec is the source of truth; the generated code is derived output. It’s a different layer of the same abstraction trend that’s driving harness engineering.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

The WAT framework — Workflows, Agents, and Tools provides a complementary mental model for structuring the files and conventions that sit above the harness layer. Once your harness is running, you’ll want a consistent way to organize the skills, workflows, and tool definitions that live inside it.

The minimal harness you’ve built here — while-loop, append-only JSON persistence, dynamic system prompt assembly — is the foundation everything else builds on. The Cursor SDK, Claude managed agents, Microsoft’s hosted agents in Foundry: these are all variations on the same architecture, with different tradeoffs around control, cost, and operational complexity. Understanding the foundation means you can evaluate those tradeoffs clearly, instead of treating them as black boxes.

Build the minimal version first. Run it on a real task. Watch where it fails. The failures will tell you exactly which components to add next.