7 Things You Must Do Before Deploying an AI Agent to Production
Before shipping a multi-user AI agent, lock down model control, guardrails, budget limits, tool auth, monitoring, and evals. Here's your production checklist.
The Gap Between “It Works” and “It’s Ready”
Getting an AI agent to work in a demo is one thing. Deploying it to production — where real users, real data, and real consequences are involved — is something else entirely. Most agent failures don’t happen because the model was bad. They happen because the surrounding infrastructure wasn’t ready.
This checklist covers the seven things you must have in place before your AI agent goes live. Skip any of them and you’re accepting risk you probably haven’t thought through. Work through all of them and you’ll ship with confidence instead of hope.
1. Lock Down Which Model You’re Using
The first decision most people treat as obvious — “we’ll use the best model” — is actually where a lot of production problems start.
Models change. Providers deprecate versions, adjust behavior through silent fine-tuning updates, and swap in different safety configurations. What works today might return different results next week even if your code is identical.
Before deploying, you need to:
- Pin to a specific model version, not just a model family. Use
gpt-4-0125-previewnotgpt-4. Useclaude-3-5-sonnet-20241022notclaude-3-5-sonnet. - Document the behavioral baseline you tested against. If the model updates, you’ll want a record of what you were working from.
- Set a fallback model for cases where your primary model is unavailable or rate-limited. This is especially important in multi-agent workflows where a single failed call can cascade.
This is closely related to the broader question of how the workflow controls the agent rather than the other way around. Model control is part of that — you decide what model runs, when, and under what conditions.
2. Set Explicit Guardrails on Inputs and Outputs
Guardrails are the rules that constrain what your agent will accept and what it will return. Without them, you’re trusting that every user will interact with your agent in good faith, and that the model will always behave within the boundaries you’d expect.
Neither assumption holds in production.
Input guardrails
These filter or reject incoming requests before the model sees them:
- Length limits — Cap the maximum input token count. Long inputs are one vector for token flooding attacks, which can drain your budget or destabilize the model’s reasoning.
- Topic filtering — If your agent is scoped to a specific domain (e.g., customer support for a SaaS product), inputs that are wildly off-topic should be caught early.
- Injection detection — Prompt injection is a real attack surface. Strings like “Ignore previous instructions and…” should trigger a flagging or rejection system. See the full breakdown of how prompt injection and token flooding attacks work if you haven’t already built this into your thinking.
Output guardrails
These catch problems before the agent’s response reaches the user:
- Format validation — If you expect JSON, validate the JSON. Don’t pass malformed output downstream.
- Content filtering — Block responses that contain PII, confidential system details, or harmful content.
- Confidence thresholds — For high-stakes actions, consider requiring the model to express explicit confidence above a threshold before proceeding.
The OWASP Top 10 for Large Language Model Applications is a useful reference here — it documents the most common LLM-specific attack vectors and the controls that address each one.
3. Set Hard Budget Limits Before You Ship
Runaway API costs are one of the most common and most avoidable production incidents. An agent without spend limits is a blank check waiting to be cashed — by a bug, by a bad user, or by a feedback loop in your own workflow.
Token budget management isn’t optional. It’s infrastructure. Here’s what to put in place:
Per-request token limits
Set a maximum token count (input + output) per call. This prevents any single request from becoming unexpectedly expensive.
Per-user and per-session limits
Especially critical for multi-user agents. Without per-user caps, a single aggressive user — or a compromised account — can consume resources meant for thousands.
Daily and monthly spend caps
Set hard limits at the API provider level and in your own monitoring. Don’t rely on just one layer. When you approach a limit, trigger an alert — don’t wait for the ceiling to cut off service.
Cost anomaly alerts
Configure alerts for unusual spending patterns. A 10x spike in API costs at 3am is a signal you want to see immediately, not in the morning report.
4. Scope and Authenticate Every Tool the Agent Can Use
An agent with tool access is fundamentally more dangerous than an agent that just produces text. It can send emails, write to databases, call APIs, modify files, and trigger downstream workflows.
Every tool your agent can call needs to be treated as a potential blast radius.
Use scoped credentials, not admin keys
If your agent reads from a database, give it read-only credentials. If it needs to send emails, scope the OAuth token to send-only. The principle here is least privilege: the agent should only be able to do what it needs to do for its stated purpose.
This connects directly to how you think about progressive autonomy — you don’t hand an agent full admin access on day one. You start with minimal permissions and expand them as you build confidence.
Require explicit authorization for destructive actions
Delete operations, bulk updates, anything irreversible — these should require explicit confirmation, whether from a human or a separate authorization check. The story of a 1.9 million row database wipe is a useful reminder of what happens when destructive actions aren’t gated.
Log every tool call
Every action the agent takes through a tool should be logged with enough detail to reconstruct what happened. This matters for debugging, auditing, and incident response.
Rotate credentials regularly
Don’t use static, long-lived API keys. Rotate them on a schedule and immediately if you suspect exposure.
5. Build Observability Before You Need It
You cannot debug what you cannot see. And in production, you will need to debug things — because agents fail in ways that are often subtle, intermittent, and not obvious from the user’s perspective.
Understanding how agents fail is much easier when you’ve built observability in from the start rather than retrofitting it after an incident.
What to log
At minimum, log:
- Every prompt sent to the model (inputs)
- Every response received (outputs)
- All tool calls and their results
- Latency for each step
- Token counts (input, output, total)
- User identifiers (pseudonymized if required by compliance)
- Error codes and failure reasons
What to monitor in real time
- Error rate (failed requests / total requests)
- P95 and P99 latency — AI agent latency issues often show up in the tail, not the average
- Token spend per hour/day
- Tool call success/failure rates
- Fallback triggers (how often you’re falling back to a secondary model)
Set up alerts for the things that matter
Not every anomaly needs a page. But cost spikes, high error rates, and repeated tool failures should wake someone up. Define what “normal” looks like before you deploy so you have a baseline to alert against.
6. Define and Run Evals Before You Ship
Evals are tests that verify your agent behaves correctly across a range of inputs — not just the happy path you demoed in staging. Shipping without evals means you’re making a bet: that your manual testing covered enough cases.
It almost never does.
What good evals look like
There are two main types, and you need both:
Binary assertions — Clear pass/fail checks. “Does the agent return valid JSON?” “Does it correctly extract the customer’s name from this input?” “Does it refuse this off-topic request?” These are easy to automate and should run on every deploy.
Subjective evals — Quality judgments. “Is this response helpful?” “Does it stay on-brand?” These require either human review or a judge model that scores outputs against a rubric. Understanding the difference between binary assertions and subjective evals will help you build a test suite that actually catches regressions.
What to include in your eval suite
- Golden examples — 20–50 input/output pairs that represent ideal behavior. Run new versions against these to catch regressions.
- Edge cases — Inputs that have broken the agent before, or that you expect to be tricky.
- Adversarial inputs — Prompt injection attempts, off-topic requests, inputs designed to extract system prompt content.
- Task completion rate — For multi-step agents, what percentage of tasks complete successfully end-to-end?
The practical guide to writing evals for AI agents goes deeper on how to structure these if you’re starting from scratch.
Run evals on a schedule, not just at deploy time
Models change. Your integration data changes. User behavior shifts. Evals that pass today might fail in 60 days for reasons unrelated to your code.
7. Establish Human-in-the-Loop Checkpoints for High-Stakes Actions
Not every agent action is reversible. Not every agent decision is something you want to delegate fully to the model. Before you ship, you need to be explicit about which actions require a human to approve, review, or at minimum acknowledge.
Human-in-the-loop design isn’t a sign that your agent isn’t ready — it’s a sign that you’ve thought carefully about where the risk is.
Where to add human checkpoints
Consider requiring human review for:
- Any action that deletes or modifies data at scale
- Outbound communications to customers (especially escalations or refunds)
- Actions that involve financial transactions
- Decisions with legal or compliance implications
- Any time the agent’s confidence is below a defined threshold
Build a clear escalation path
When the agent can’t handle something — ambiguous input, unexpected tool failure, out-of-scope request — it needs somewhere to go. That might be a human support queue, a fallback to a simpler deterministic flow, or a graceful “I can’t help with that” response.
What it can’t be is an agent that silently fails, makes a bad guess, or gets stuck in a loop. The reliability compounding problem in multi-step agents means that small failures early in a workflow can cascade into big failures downstream.
Document who is accountable
When something goes wrong — and eventually something will — someone needs to be accountable. AI liability in the agentic economy is still being worked out legally, but inside your organization you need a clear answer to: who owns this agent’s behavior? Who gets the alert when it fails? Who makes the call to roll it back?
A Note on Compliance
If your agent touches user data — which most production agents do — you also need to have addressed compliance before you ship.
GDPR, SOC 2, HIPAA: the applicable frameworks depend on your industry and geography, but the common thread is that you need to know where data flows, how it’s stored, how long it’s retained, and who can access it. AI agent compliance across GDPR, SOC 2, and other frameworks isn’t a nice-to-have for enterprise deployments. It’s a prerequisite.
Enterprise teams in particular should be looking at platforms that have SSO, audit logging, and role-based access controls built in rather than bolted on. Enterprise AI agents with SSO and compliance features covers what to look for in the underlying infrastructure.
How Remy Handles Production Readiness
Most of the items on this checklist require infrastructure that’s time-consuming to build from scratch: logging pipelines, spend controls, auth systems, monitoring dashboards. For teams building agents on top of raw model APIs, getting these pieces in place is often as much work as building the agent itself.
Remy is built on the infrastructure MindStudio has been running in production for years — infrastructure that includes 200+ AI models, managed auth, integrated deployment, and the controls enterprise teams need. When you build with Remy, you’re not starting from zero on observability or security. The spec you write compiles into a full-stack application with real backends, real databases, and the production foundations already in place.
That doesn’t mean you skip the checklist above — you still need to think through guardrails, evals, budget limits, and human checkpoints for your specific use case. But it means you’re starting from a stronger baseline than a blank repo and a model API key.
You can explore what’s possible at mindstudio.ai/remy.
Frequently Asked Questions
What’s the most common mistake teams make before deploying an AI agent to production?
The most common mistake is treating the demo as the test. A demo is a best-case scenario — you control the inputs, you know the expected outputs, and you’re watching for problems. Production is the opposite: you don’t control inputs, users will do things you didn’t anticipate, and failures often happen when no one is watching. Teams that skip structured evals and observability are essentially deploying blind.
How do I set token budget limits for an AI agent?
Start at the model API level — most providers let you set per-request token maximums. Then add a layer of application-level tracking: count tokens per user, per session, and per day, and stop requests that would exceed your defined limits. Alert before you hit the ceiling, not after. Tools like Anthropic’s token counting API and OpenAI’s usage endpoints make this trackable.
What’s the difference between input guardrails and output guardrails?
Input guardrails run before the model sees the request. They filter, validate, or reject what’s coming in. Output guardrails run after the model responds, before the response reaches the user. They catch formatting errors, content policy violations, or responses that contain information the model shouldn’t have surfaced. You need both — input guardrails reduce attack surface and cost; output guardrails are your last line of defense.
How many eval cases do I need before deploying an AI agent?
There’s no universal number, but a practical floor is 50 cases: 20–30 golden examples representing ideal behavior, 10–15 edge cases, and 10 adversarial inputs. The goal is enough coverage to catch regressions when you update the model or change the prompt. Quality matters more than quantity — one well-constructed adversarial test is worth ten variations of the same happy path.
Do I need human-in-the-loop for every AI agent action?
No. Routine, low-stakes, and easily reversible actions don’t need human approval — adding a human checkpoint to every action defeats the point of automation. The threshold for human review should be proportional to the potential impact: irreversible actions, financial decisions, customer-facing communications, and legally sensitive outputs all warrant it. Read-only queries and informational responses generally don’t.
What compliance frameworks apply to AI agents?
It depends on your use case and geography. GDPR applies if you process data from EU residents. SOC 2 is relevant if you’re a B2B SaaS handling customer data. HIPAA applies in US healthcare contexts. Beyond these, sector-specific regulations (financial services, insurance, government) add their own requirements. The starting point is always the same: map where data flows, document how it’s retained and deleted, and ensure your AI provider has appropriate data processing agreements in place.
Key Takeaways
- Pin your model version before deploying. Silent model updates are a real risk.
- Guardrails on input and output aren’t optional in multi-user environments — they’re the barrier between “works in testing” and “safe in production.”
- Set hard budget limits at multiple layers. Per-request, per-user, per-day.
- Scope tool access to least privilege. Every tool is a potential blast radius.
- Build observability first — log inputs, outputs, tool calls, latency, and spend before you need to debug anything.
- Run evals before you ship, and keep running them after. Behavioral regressions happen for reasons you didn’t anticipate.
- Define human checkpoints for irreversible or high-stakes actions. Know who is accountable when something goes wrong.
If you’re building agents that need to hold up under real user load and real consequences, try Remy — it’s built on production infrastructure so you’re not piecing together the foundation yourself.