7 Things You Must Set Up Before Deploying an AI Agent to Production
Model control, guardrails, budget limits, MCP auth, tracing, and evals — the production checklist every team needs before shipping AI agents.
The Gap Between “It Works” and “It’s Ready”
Most AI agents fail in production not because the model is bad, but because the infrastructure around it wasn’t built for real conditions. Deploying an AI agent to production means handing it access to live systems, real user data, and the ability to take actions that can’t always be undone. That gap — between a working demo and a production-ready agent — is where teams get burned.
This checklist covers the seven things you must set up before you ship. Not nice-to-haves. Actual requirements that determine whether your AI agent deployment succeeds or turns into an incident.
1. Model Control: Pin Your Model, Define Your Routing
The first thing you need to lock down is which model your agent uses and under what conditions.
This sounds obvious, but most teams skip it. They build against the latest model, ship it, and then a provider updates the default endpoint. Behavior changes. Outputs drift. You’re debugging a regression you didn’t introduce.
Pin your model version explicitly. Don’t use a floating alias like gpt-4 or claude-3. Use the exact versioned identifier. When you want to upgrade, test it deliberately — don’t let the vendor do it for you.
Beyond pinning, you need a routing strategy. Not every task in your agent workflow requires your most expensive, capable model. Simpler subtasks — classification, formatting, extraction — can run on smaller, cheaper models without sacrificing quality. Multi-model routing lets you match model capability to task complexity, which matters both for cost and latency.
Define the routing rules upfront. Which tasks go to which models? What’s the fallback if a model endpoint is unavailable? If you’re running a multi-agent architecture, each agent in the pipeline may need its own model assignment.
2. Behavioral Guardrails: What the Agent Cannot Do
Guardrails are the constraints that define the outer bounds of what your agent is allowed to do. They’re not the same as instructions — they’re enforcement mechanisms.
There are two categories to think about:
Input guardrails block problematic requests before the model ever sees them. These include:
- Prompt injection detection (users trying to hijack agent behavior via crafted inputs)
- Topic filtering (keeping the agent on-scope)
- PII detection before data reaches the model
Output guardrails inspect what the agent produces before it acts or responds:
- Blocking outputs that reference competitors, make legal claims, or violate brand policies
- Preventing the agent from generating responses it’s not authorized to give
- Catching hallucinated tool calls or malformed action parameters
Prompt injection attacks are a real production threat, especially if your agent processes user-submitted content or retrieves documents from external sources. A well-crafted injected prompt can redirect an agent’s behavior entirely — including overriding its instructions and exfiltrating data.
Guardrails need to be tested at the layer level, not just end-to-end. Know what happens when a guardrail fires. Does the agent fail gracefully? Does it notify a human? Does it log the incident? Define that behavior before deployment.
3. Token Budget and Cost Controls
Token costs compound fast in production. An agent that handles 100 requests a day in testing might see 10,000 in production. Context windows grow during multi-turn conversations and multi-step tool use. A single runaway agent can burn through thousands of dollars before anyone notices.
You need hard limits, not just guidelines.
Set token budgets at multiple levels:
- Per-request limits — cap the maximum tokens an agent can use for a single task
- Per-user limits — prevent any one user from consuming disproportionate resources
- Daily/monthly spend caps — hard stops at the billing level, not just monitoring alerts
Token budget management also affects agent behavior. When a model approaches its context limit, it starts to compress or lose earlier context — which can introduce subtle reasoning failures. Build awareness of this into your architecture. Some teams pass the remaining token budget directly to the model as a signal so it can prioritize accordingly.
Monitor cost per conversation, cost per task type, and cost trends over time. Anomalies are often the first signal that something is wrong with agent behavior, not just billing.
4. MCP Authentication and Tool Authorization
If your agent uses external tools — APIs, databases, file systems, internal services — you need a proper authorization model for those connections. This has gotten more structured with the rise of the Model Context Protocol (MCP), but the underlying problem has always existed.
MCP servers expose tool capabilities to agents in a standardized way. But exposing a capability isn’t the same as authorizing access to it. Every MCP connection needs:
- Authentication — the agent must prove identity before accessing the tool
- Authorization — the agent should only have access to the tools it needs for its specific role
- Scope limits — even within an authorized tool, the agent’s permissions should be narrowly scoped
The pattern to follow is least-privilege. Your customer support agent doesn’t need write access to your production database. Your content-generation agent doesn’t need to query financial records. Define what each agent is allowed to touch, and enforce that at the connection level — not just in the system prompt.
Watch out for the MCP server trap: wrapping an API in an MCP server doesn’t automatically make it safe for agents to consume. The data structure matters too. Agents can misuse or misinterpret API responses that weren’t designed with agent-readable formats in mind.
Also think about agent identity. When your agent authenticates to an external system, what identity does it present? Using a human employee’s credentials is a liability. Agents should have their own service identities with auditable access logs.
5. Tracing and Observability
You cannot debug what you cannot see. In production, tracing is the difference between “something went wrong” and “here’s exactly what happened, in what order, with what inputs.”
Every production AI agent needs end-to-end tracing that captures:
- The full input sent to the model (including system prompt and retrieved context)
- The model’s reasoning steps (if using chain-of-thought or tool use)
- Every tool call made, with its parameters and response
- The final output
- Timing for each step
- Model and version used
This isn’t just for debugging. It’s for security audits, compliance reviews, and identifying failure patterns before they become incidents.
Agent failure modes are often non-obvious without traces. An agent might produce a correct-looking output while having made a wrong tool call internally — you won’t see this unless you’re logging the full execution path.
Structure your traces so they’re queryable. You want to be able to ask: “Show me all sessions where the agent called the delete_record tool” or “Find every request where latency exceeded 10 seconds.” Raw log dumps are not enough.
For multi-agent systems, distributed tracing is essential. When Agent A hands a task to Agent B, the trace needs to follow the request across both. Losing the thread between agents makes post-incident analysis nearly impossible.
6. Evaluations: Know What “Good” Looks Like
Shipping without evals is shipping blind. Evals are how you verify that your agent behaves correctly — not just once during development, but continuously as your system evolves.
Before production, define your eval suite. This should cover:
Functional correctness — Does the agent do what it’s supposed to do? For structured tasks, you can use binary assertions: the output either contains the right answer or it doesn’t.
Safety and policy compliance — Does the agent ever produce outputs it shouldn’t? Run adversarial inputs through your eval suite. Try to break your own agent.
Edge cases and failure conditions — What happens when inputs are ambiguous, malformed, or designed to confuse? Document expected behavior and test it.
Regression testing — Every time you change the system prompt, swap a model, or update a tool, run your full eval suite. Model updates from providers can shift behavior in subtle ways.
Writing effective evals doesn’t require engineering resources. Many evals can be written as structured test cases with expected outputs. The harder question is whether to use binary assertions or more subjective scoring — both have their place depending on the task type. Binary assertions work well for factual and deterministic tasks; LLM-graded evals are better for open-ended outputs where correctness isn’t a single value.
Treat evals as a living artifact. Add new cases whenever you catch a failure in production. Your eval suite should grow over time to reflect the real-world edge cases your agent encounters.
7. Access Controls and Permission Scoping
The last thing to set up before deployment is a clear permission model for what your agent can do — and explicit controls to enforce it.
This goes beyond MCP auth. It covers the full surface of your agent’s capabilities:
- Which actions are irreversible? Sending emails, deleting records, making payments, submitting forms — any action that can’t be undone should require either human confirmation or should be off-limits for autonomous execution.
- Which data can the agent access? Scope data access by user, role, and context. An agent handling a request on behalf of User A should not have access to User B’s data.
- Which environments is the agent allowed to act in? Production vs. staging vs. sandbox. Agents that can reach production systems should be explicitly authorized, not left as the default.
The principle of progressive autonomy is worth following: start with tightly constrained permissions and expand them deliberately based on observed behavior and trust earned in production. Don’t grant full permissions at launch and dial them back after something goes wrong.
The most costly AI agent disasters on record — including production database wipes — happened because agents had permissions they shouldn’t have had, and no one had explicitly defined what they were allowed to touch. Don’t let undefined permissions be your problem.
For enterprise deployments, access controls also need to map to your existing identity and compliance infrastructure. SSO, RBAC, audit logs — these aren’t optional when agents are operating on behalf of employees or handling regulated data. Compliance requirements like GDPR and SOC 2 apply to AI agents in the same way they apply to other software systems.
How Remy Handles Production Readiness
Most of the infrastructure on this checklist — model routing, observability, access controls, budget management — has to be built by hand when you’re working with raw APIs or rolling your own agent framework. That’s a significant amount of work before you’ve even started thinking about what your agent actually does.
Remy is built on the same infrastructure MindStudio has been running in production for years, which means the foundational pieces are already in place. Model routing across 200+ models, managed deployment, auth, and logging — these come with the platform, not as afterthoughts. When you describe what your agent should do in a spec and compile it into a full-stack application, you’re not starting from scratch on the infrastructure layer.
That matters especially as agent sprawl becomes a real problem for teams managing multiple agents across different systems. Consistent infrastructure across agents — same tracing, same auth model, same cost controls — is much easier to maintain when it’s not bespoke for every deployment.
If you’re building an agent-powered application and want production infrastructure without the overhead, try Remy at mindstudio.ai/remy.
Frequently Asked Questions
What is the most common reason AI agents fail in production?
Missing observability is the most common root cause — teams don’t know what’s happening inside the agent during execution. Without traces, you can’t diagnose failures, and without evals, you can’t detect regressions. Many failures also trace back to permissions being too broad, which allows agents to take actions they shouldn’t have been able to take in the first place.
Do I need all 7 of these before shipping?
Yes, for any agent with real-world consequences. If your agent can send messages, modify data, make API calls, or interact with external systems, all seven apply. For a read-only demo or internal prototype, you can defer some of them — but as soon as you’re handling real users or live data, the full checklist is the minimum.
How do I set up guardrails without slowing down my agent?
Run lightweight input guardrails (keyword and pattern matching) synchronously before the model call. Reserve heavier evaluation (LLM-graded output checks) for asynchronous post-processing or sample-based monitoring rather than every request. The goal is to catch the highest-risk failures in real time and do deeper inspection on a subset of traffic.
What’s the difference between tracing and logging?
Logging captures individual events — a request came in, a response went out. Tracing captures the full execution path with timing and context across every step, including intermediate tool calls and model interactions. For AI agents, tracing is what you actually need. Logs alone don’t give you enough context to understand why a multi-step agent behaved a certain way.
How often should I run my eval suite?
Before every deployment. That means every system prompt change, model version update, tool modification, or any other change to the agent’s configuration. This is especially important because the reliability compounding problem means small regressions in individual components can produce large failures in a multi-step pipeline.
How should I handle MCP authentication for multiple agents?
Each agent should have its own service identity with a minimal permission scope. Avoid sharing credentials across agents. Use a centralized secrets manager rather than hardcoding credentials in system prompts or environment variables. Audit access logs regularly and rotate credentials on a schedule — not just when a breach occurs.
Key Takeaways
- Pin your model version and define routing rules before deployment — don’t let providers change your behavior for you.
- Guardrails are enforcement mechanisms, not instructions. Build them at the input and output layers, not just in the system prompt.
- Token budgets need hard limits at multiple levels: per-request, per-user, and per billing period.
- MCP authentication is not optional — every tool connection needs proper auth, authorization, and scope constraints.
- Traces are the only way to diagnose production failures in multi-step agent workflows. Log the full execution path, not just inputs and outputs.
- Evals should run on every deployment — treat them the same way you’d treat automated tests in a software release process.
- Start with narrow permissions and expand deliberately based on observed behavior, not the other way around.
Production AI agents are real software with real consequences. Treat the infrastructure around them accordingly — and if you want a head start on building the full stack, get started with Remy.