Agentic Coding Levels Explained: From Autocomplete to the Dark Factory
From tab completion to fully autonomous codebases, here are the five levels of AI-assisted coding and where production teams should operate today.
The Five Levels of AI-Assisted Coding
Most conversations about AI coding treat it as one thing. “Do you use AI in your workflow?” as if GitHub Copilot and a fully autonomous pipeline that ships pull requests without a human in the loop are the same category.
They’re not. Agentic coding operates across a spectrum, and where you sit on that spectrum determines what you can build, how fast, and what risks you’re taking on. Understanding the levels helps you make real decisions — about tooling, team structure, and when to hand off control.
This article maps the five levels of AI-assisted development, from tab completion through to the dark factory. It covers what each level looks like in practice, who’s operating at each level today, and what it takes to move up.
Level 1: Autocomplete
This is where most developers first encountered AI coding tools. A model watches what you type and suggests the next few tokens: a variable name, a closing bracket, a function call you were about to write anyway.
GitHub Copilot is the most widely used example. It generates completions inline, inside your existing editor, using your surrounding code as context. You accept or reject each suggestion. The AI has no memory between completions and no ability to reason about the larger system.
What it does well: Reduces keystrokes for boilerplate. Fills in repetitive patterns — serializers, test fixtures, CRUD endpoints. Useful for languages or APIs you know less well.
What it doesn’t do: Understand your architecture. Make decisions. Handle ambiguity. Write anything requiring more than a few seconds of forward planning.
At Level 1, the human is fully in control. The AI is a fast typist who reads over your shoulder.
Level 2: Chat-Assisted Coding
Level 2 introduces a conversation. Instead of autocomplete, you describe what you want and the model writes it. You can ask questions, get explanations, request refactors. The interaction is back-and-forth.
Tools like Cursor and Claude in chat mode operate here by default. You write a prompt, the model proposes a diff or a new file, you review it, you apply it or you don’t.
Comparing Cursor and Claude Code shows how differently two tools at roughly this level can feel — one is built around the code editor, the other around conversational context — but both still rely on you to evaluate, approve, and apply every meaningful change.
What it does well: Handles more complexity than autocomplete. Can generate full components, write tests for existing code, explain unfamiliar patterns, and refactor on request.
What it doesn’t do: Run code. Observe results. Iterate without you. Each conversation is independent unless you manually provide context.
At Level 2, you’re delegating drafting, but you’re still making all the decisions.
Level 3: Agentic Coding
Level 3 is where things get interesting — and where agentic AI starts to mean something real.
An agentic coding tool doesn’t just suggest. It writes code, runs it, reads the output, decides what to do next, and loops. The agent has access to tools: a terminal, file system, test runner, maybe a browser. It can fail, observe the failure, reason about what went wrong, and try again — without you intervening at each step.
Claude Code in agentic mode works like this. You give it a task — “add pagination to this endpoint and make the tests pass” — and it goes. It reads relevant files, writes the changes, runs the test suite, reads the failures, patches the code, and repeats until the tests pass or it gives up and explains why.
This shift matters because the agent is no longer just a text generator. It’s running a feedback loop. Understanding how AI coding agents actually work helps clarify why this loop — observe, plan, act, evaluate — is fundamentally different from chat.
What it does well: Tasks that require iteration. Debugging, refactoring, feature implementation across multiple files, test-driven development where the tests already exist.
What it doesn’t do: Operate reliably on long tasks. Maintain context over many steps — a problem known as context rot, where the model’s understanding degrades as the conversation grows. Make architectural decisions safely without oversight.
At Level 3, you’re setting goals and reviewing results. You’re not watching every step, but you’re reviewing the output before it ships.
Level 4: Harness-Driven Development
Level 4 is what serious engineering teams are actually building when they talk about “AI-assisted development at scale.” It’s not one agent. It’s a structured system: orchestrators, sub-agents, evaluators, guardrails, and defined handoff points where humans review before anything merges.
This is what Stripe’s Minions system produces — reportedly over 1,300 AI-generated pull requests per week. The agents don’t have free rein. They operate inside a harness: a defined workflow with explicit constraints, test requirements, and review gates.
Harness engineering is the discipline of building these systems. It sits above prompt engineering and context engineering. The question isn’t “how do I write a better prompt?” It’s “how do I design a system where the agent’s outputs are reliably reviewable and safe to merge?”
Common patterns at this level include:
- Planner-generator-evaluator loops — one agent plans, another generates, a third evaluates the output against a rubric before surfacing it to a human
- Sub-agent decomposition — breaking a large task into smaller isolated tasks that each agent handles in a fresh context window, avoiding the degradation that comes from long single-agent chains
- Gated checkpoints — the workflow pauses at defined moments for human review before continuing
Multi-agent workflow patterns give a concrete sense of how these systems are wired together. The right pattern depends on what you’re building and how much autonomy you’re comfortable granting.
What it does well: High-volume, repeatable coding tasks. Routine ticket work. Test coverage, migrations, localization, documentation. Anything where the pattern is clear but the volume is too high for humans alone.
What it doesn’t do: Handle genuinely novel problems without human guidance. Self-correct without the right evaluation harness in place. Work out of the box — Level 4 requires significant upfront investment to design and maintain.
At Level 4, humans are still in the loop — but they’re reviewers and architects, not keystroke-by-keystroke coders. How AI is changing what it means to be a developer covers this shift in more depth.
Level 5: The Dark Factory
Level 5 is where humans are removed from the critical path entirely.
The term comes from manufacturing. A dark factory is a fully automated plant — no lights needed because there are no workers. The machines run in the dark.
Applied to software, a dark factory is a codebase that ships code without human review in the loop. An agent receives a trigger — a bug report, a feature spec, a failing test — and produces a merged, deployed change. No one approves the pull request. The system handles end-to-end.
This isn’t science fiction. It’s being built, tested, and deployed in controlled contexts today. But it requires specific conditions:
- High test coverage and strong CI — the automated test suite is the human review proxy. If tests pass, the change ships.
- Narrow task scope — dark factories work best for well-defined, bounded tasks. Not greenfield architecture.
- Observable outputs — you need monitoring and rollback. If a dark factory ships something bad, you need to catch it in production and revert automatically.
- Progressive autonomy — the safest path is expanding agent permissions gradually, not granting full autonomy immediately. Progressive autonomy for AI agents covers how to think about this safely.
Building an AI dark factory is genuinely complex. The infrastructure — CI pipelines, deployment gates, rollback systems, monitoring — has to be airtight before you remove humans from the loop.
What it does well: High-frequency, routine changes at a scale no human team can match. Bug fixes with clear reproduction steps. Dependency updates. Content-driven template generation.
What it doesn’t do: Handle ambiguity. Make product decisions. Operate safely without strong automated verification.
At Level 5, the developer’s job isn’t writing or reviewing code. It’s designing and maintaining the system that writes, tests, and ships code autonomously.
Where Production Teams Should Be Operating Right Now
Most teams aren’t ready for Level 5. That’s not a criticism — it’s a realistic assessment of where automated verification, trust, and risk tolerance actually stand.
The honest answer for most production teams in 2026: Level 3 to Level 4, with Level 5 applied narrowly.
Here’s what that looks like in practice:
-
Solo builders and indie hackers — Level 3 is the sweet spot. Shipping full-stack apps with AI is now genuinely viable for people without large engineering teams. One developer operating at Level 3 can move faster than a small team operating at Level 1.
-
Product teams with existing codebases — Level 4 is the target. Invest in the harness. Define what tasks agents can handle reliably. Put review gates in place for anything touching critical paths.
-
Enterprises with strong CI/CD — Selective Level 5 for low-risk, high-volume tasks. Dependency updates, test generation, localization strings. Humans stay in the loop for anything with meaningful production risk.
The biggest mistake teams make is jumping to Level 5 without the infrastructure. Why AI-generated apps fail in production usually comes down to this: the autonomy outran the verification.
What Moves You Between Levels
Moving up isn’t just about switching tools. A few factors determine which level is actually accessible:
Model capability. Earlier models couldn’t reliably operate at Level 3 — they’d hallucinate tools, misread error messages, and spiral. Better models changed this. The AI model tipping point explains why capability improvements in reasoning models suddenly made agentic tools actually work.
Test coverage. Automated verification is what makes higher autonomy safe. Without good tests, even Level 3 is risky because you can’t tell if the agent’s output broke something.
Task clarity. Agents perform better on well-defined tasks. Ambiguous goals produce worse outputs and require more human intervention. Level 5 specifically requires extremely well-scoped work.
Harness investment. Moving from Level 3 to Level 4 requires building infrastructure — not just prompting better. Most teams underestimate this.
Where Remy Sits
Remy approaches this from a different angle. Rather than asking “how do we give agents better access to an existing codebase,” it starts one level higher: the spec.
In Remy, you describe your application in a structured markdown document — annotated prose that carries the application’s intent, data model, edge cases, and rules. Remy compiles this into a full-stack app: backend, database, auth, frontend, tests, deployment. The spec is the source of truth. The code is compiled output.
This changes where the agent operates. Instead of an agent trying to reason about an existing TypeScript codebase (with all the context rot and architectural ambiguity that implies), the agent is working from a precise, structured spec. Changes to the spec produce new compiled code. The spec stays in sync as the project evolves.
For indie hackers and domain experts building real applications, this is a meaningful difference. You’re not vibe coding — throwing prompts and hoping the output holds together. You’re working from a structured source of truth that both you and the agent can reason about precisely.
You can try Remy at mindstudio.ai/remy.
Frequently Asked Questions
What is agentic coding?
Agentic coding refers to AI systems that do more than suggest code — they write code, execute it, observe the results, and iterate. An agentic coding system operates in a feedback loop: plan, act, evaluate, repeat. This is Level 3 and above in the framework above.
What’s the difference between an AI coding agent and GitHub Copilot?
Copilot operates at Level 1 — it completes code as you type, one suggestion at a time. An AI coding agent takes a goal, accesses tools like a terminal and file system, and runs a multi-step process to complete it. The agent can fail, observe the failure, and try again without you intervening.
What is a dark factory in AI coding?
A dark factory is a fully autonomous coding pipeline where AI agents write, test, and ship code without human review in the loop. The term comes from automated manufacturing. In software, it means an agent receives a trigger and produces a deployed change end-to-end. This is Level 5 in the spectrum and requires strong automated verification to be safe.
Is vibe coding the same as agentic coding?
No. Vibe coding usually refers to informal, prompt-driven development — describing what you want loosely and hoping the AI produces something usable. Agentic coding involves structured systems with defined feedback loops, tool access, and evaluation mechanisms. Vibe coding tends to stay at Level 2. Agentic coding starts at Level 3.
What level of AI coding is safe for production?
It depends on your verification infrastructure. Level 3 is generally safe for production when you review outputs before shipping. Level 4 is safe when you have well-designed review gates and a reliable harness. Level 5 requires excellent automated test coverage and monitoring. Most teams should start at Level 3 and expand autonomy progressively as trust in the system builds.
Do I need to be a developer to use agentic coding tools?
Less so than before, but it depends on the level. Level 1 and 2 still benefit from knowing how to read and evaluate code. Level 3 tools can produce useful output for domain experts who understand what they want but not necessarily how to build it. This is the basis of domain expert building — people with deep subject matter knowledge shipping real software using AI without traditional coding backgrounds.
Key Takeaways
- There are five distinct levels of AI-assisted coding: autocomplete, chat-assisted, agentic, harness-driven, and dark factory.
- Each level differs in how much autonomy the AI has, how much infrastructure is required, and how humans stay in the loop.
- Most production teams should target Level 3–4 now, with selective Level 5 for narrow, well-verified tasks.
- Moving up levels requires better models, stronger test coverage, clearer task definitions, and real harness investment — not just better prompts.
- The dark factory is real, but only works safely with the right automated verification underneath it.
If you want to build real full-stack applications without getting stuck in context rot or agentic chaos, try Remy — where the spec is the source of truth and the code follows from it.