How to Audit Your Enterprise AI Vendor for Agentic Security: 2 Questions to Ask Before You Sign
Before signing any enterprise AI contract, ask two questions about agent vs. human access and pressure-tested behavior. The Lily hack shows why it matters.
Two Questions That Will Tell You If Your AI Vendor Is Ready for Agents
An autonomous agent spent $20 — no credentials, no insider help — and walked away with read/write access to tens of millions of McKinsey consultant chat messages, every system prompt governing the platform, and tens of thousands of user accounts. The platform is called Lily. The startup that found the hole is called Codewall. The disclosure date was March 9, 2026.
You can audit your own vendor situation in under an hour using two core procurement questions: (1) Does your AI platform distinguish between human users and AI agents? (2) What happens on your platform when the team is under pressure? Everything else — the six-question technical checklist, the repair playbook for contracts already signed — flows from these two.
This post is about how to actually ask those questions, what a good answer looks like, and what a bad answer sounds like when it’s dressed up in enterprise language.
What you’re trying to find out (and why it matters now)
The Lily incident got framed as a security failure. That framing is wrong, and the wrong framing leads to the wrong fix.
McKinsey has excellent engineers. Lily had been in production for more than two years. The exploit wasn’t exotic — it was SQL injection, first documented in 1998, taught in every intro web security course. The attack vector isn’t the story.
How Remy works. You talk. Remy ships.
The story is that 22 of 200 API endpoints shipped with no authentication. Including production-writable endpoints. That’s not one engineer forgetting to lock a door on a Friday. That’s a pattern. Patterns come from culture and process, not from individual mistakes.
If you read the postmortem lesson as “authenticate your endpoints,” you’ll tighten one bolt and leave the underlying process intact. The same process will produce the same shape of failure somewhere else in your stack — possibly in the AI platform you signed a contract for last quarter.
The two questions below are designed to surface that process before you sign, not after.
What you need before you start
You don’t need a security team to run this audit. You need:
- Access to a vendor call or a detailed product demo — ideally with someone technical on the vendor side, not just sales
- Your current or prospective vendor’s documentation on permissions, audit logs, and agent configuration
- A basic understanding of what your agents will actually do — which systems they’ll touch, what data they’ll read, what actions they might take
If you’re building internally rather than buying, these questions apply to your own architecture. The Lily incident was an internal build. The procurement failure pattern is the same whether you’re buying or building.
One more thing: if you’re already running multi-agent workflows and want to understand the security surface they create, the AutoResearch loop pattern is a useful frame for thinking about how agents compound actions across systems — which is exactly where permission boundaries get complicated.
Step 1: Ask whether the platform distinguishes humans from agents
This is the first question, and it’s more specific than it sounds.
Here’s the scenario. A senior McKinsey consultant using Lily might have legitimate read access to 40 client accounts, built up over five years of work. An agent running on behalf of that consultant, on a specific client engagement, should probably only touch that one client’s data. That’s a reasonable boundary.
If the platform doesn’t enforce that boundary — if it treats the agent as having the same access scope as the human — then one compromised agent run becomes a company-wide exposure event. That’s not an IT problem. That’s a board-level liability conversation.
Ask the vendor directly: “Does your platform have separate authentication and permission scoping for AI agents versus human users?”
Listen for three things in the response:
1. Audit trails that are agent-specific. When something goes wrong, regulators don’t ask what the user did. They ask what the system did on behalf of the user. If the audit log can’t distinguish “human clicked this” from “agent called this endpoint,” the compliance gap is the size of every agent action in your organization. That’s a large gap.
2. Real-time revocation. Can someone revoke an agent’s access from a console in the next five minutes — not delete it, not file a ticket, not wait for a code deploy — while they figure out what happened? If the answer is no, your incident response plan has a hole in it. You’ll discover that hole either in a tabletop exercise or at 3 a.m. during an actual incident.
3. Scope binding. Can you configure an agent to only touch the systems and data relevant to its specific task? Or does the agent inherit the full permission set of the user who launched it?
The Codewall agent walked up to Lily’s API and the API didn’t ask who was calling. There was nothing to authenticate because the system wasn’t built with the concept of an agent in it at all. The blast radius was unbounded because there was no “this agent” to bound.
If you’re evaluating vendors that offer spec-driven app compilation, it’s worth asking how agent identity is handled at the infrastructure level. Remy — MindStudio’s spec-driven full-stack app compiler — is one example of a tool that makes the permission model explicit during the build process: you write a markdown spec with annotations, and it compiles into a complete TypeScript app with backend, database, auth, and deployment already wired together. That kind of structure forces the agent-versus-human access question to be answered before anything ships, rather than discovered in production.
Now you have: A clear picture of whether the vendor has actually thought about agent identity as a distinct concept, or whether they’ve bolted agents onto a human-user permission model and hoped for the best.
Step 2: Ask what happens under pressure
This is the organizational question, and it’s harder to answer — which is why it’s more important.
The 22 unauthenticated endpoints at McKinsey didn’t happen because no one knew how to authenticate an endpoint. Authentication is not a hard engineering problem. Every engineer on that team knew how to do it. The question is why the default behavior — the thing that happens when a team is moving fast and no one has time to have the full architectural conversation — produced unauthenticated production-writable endpoints.
That’s a question about organizational design, not technology.
Ask the vendor: “What is the out-of-the-box security posture for agent permissions? What does the platform look like if nobody touches the security settings after initial setup?”
This question is uncomfortable for vendors because the honest answer often reveals that the default is permissive. Permissive defaults are easier to sell — they reduce friction during onboarding — but they’re the thing that bites you when your team is under deadline pressure and doesn’t have time to configure every control.
What a good answer sounds like: The vendor describes a default-deny posture for agent permissions. Agents start with no access and you explicitly grant what they need. The vendor can tell you what happens if a developer skips the permission configuration step entirely — and the answer is “the agent can’t do anything” rather than “the agent inherits the user’s full scope.”
What a bad answer sounds like: “We have a comprehensive authentication framework.” “All our authentication options are documented in the developer guide.” “Enterprise customers have full flexibility to configure policies.”
These sentences are probably true. They don’t answer the question. They describe what’s possible when someone has time to configure things carefully. They say nothing about what happens when someone doesn’t.
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
The deeper version of this question is about who gets to be in the room when technical decisions get made. At McKinsey, Lily shipped two years ago — before autonomous agents that could probe public endpoints and reach production data were a normal threat. The technical landscape changed. The question is whether the organizational process changed with it, or whether business timelines continued to drive architecture decisions without enough technical input.
You can ask a version of this to your vendor: “When your customers are pushing to deploy quickly, what guardrails does the platform enforce automatically versus what requires deliberate configuration?” The answer tells you a lot about whose interests the platform was designed to protect.
Step 3: Map the cross-workflow complexity before you sign
Here’s the thing that makes agentic procurement different from SaaS procurement.
For the last 15 years, enterprise software has been bought in the same sequence: strategic decision at the top, procurement negotiates the contract, security and compliance review, IT plans the integration, developers build against whatever got purchased. That sequence works for SaaS because SaaS is bounded. The vendor gives you an admin console, a set of integration points, a published API, and a permissions model that maps to roles. You’re configuring software.
For agents, that sequence leads to the failure mode we saw with Lily.
Walk through what an agent actually does on a single real run. The user says “prepare the renewal brief for our largest customer.” The agent pulls from the CRM, from support tickets, from contract management, from product usage data, from call transcripts, from an internal wiki. It crosses permission boundaries that for a human are mediated by what’s visible on a screen — but for an agent are mediated by tokens and roles and scopes that have to exist as code written by someone.
A human consultant doesn’t notice that the contract management tool has its own permissions, the support system has its own audit log, the CRM treats their access differently from an analyst’s access. Their eyes do the work. The screen is the permissions model.
An agent has no eyes. Every one of those systems has to have a clear, auditable answer to “am I allowed to read this?” And every one of those audits has to compose with every other one when a regulator asks what happened in a given sequence.
This is why six major vendors announced agent-infrastructure products within roughly one week of each other in early 2026: SAP acquired Dreo and Prior Labs to bring a unified data layer and tabular foundation models to where actual business data lives. Pinecone launched Nexus — essentially “stop making your agent rebuild business context from scratch every time it runs.” Salesforce shipped headless 360, which exposes their platform as APIs, tools, and CLI commands because agents don’t click through screens. ServiceNow opened up Action Fabric so outside agents can trigger governed workflows with identity and audit trail attached. Anthropic and OpenAI both stood up enterprise services companies with billions behind them to put engineers inside customer build rooms for exactly this complexity.
That’s not six separate product announcements. That’s one signal: the model was never the hard part. The hard part is whether the agent can reach the right data, use the right permissions, trigger the right workflow, and leave an audit trail — all at a cost the company can live with.
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
If you’re building agents that chain across multiple systems, understanding multi-agent workflow patterns before you finalize your vendor architecture will save you from discovering the permission boundary problems six months into deployment.
Now you have: A map of which systems your agents will actually touch, which permission boundaries they’ll cross, and whether your vendor has thought through what that means — or whether they’re selling you the model and leaving the rest as “implementation details.”
The failure modes to watch for
The vendor conflates agent access with user access. If the vendor’s demo shows an agent doing things “as” the logged-in user with no additional scoping, that’s the Lily pattern. The agent inherits whatever the human has, with no way to bound it to the task at hand.
The vendor’s audit log is human-readable but not agent-queryable. You need to be able to ask “what did this specific agent do between 2 p.m. and 4 p.m. on Tuesday” and get a structured answer. If the audit log is a flat text file or a UI you scroll through manually, it won’t survive a regulatory inquiry.
The default posture is permissive. Ask what happens on day one, before any configuration. If the answer is “agents can do most things by default and you restrict from there,” that’s a red flag. Permissive defaults plus deadline pressure equals unauthenticated production-writable endpoints.
Technical teams aren’t in the buying conversation. This is the organizational failure mode, and it’s the hardest to fix after the fact. If the people who understand what agents actually do across your data structures aren’t in the room when the contract gets shaped, you’re committing capital to a strategy whose viability hasn’t been tested. You find out it doesn’t work six months in, one permission boundary at a time.
Platforms like MindStudio handle agent orchestration across 200+ models and 1,000+ integrations with a visual builder — which means the permission and integration questions are surfaced during the build process, not discovered in production. That doesn’t eliminate the need to ask these procurement questions, but it changes when you encounter the complexity.
The vendor’s security answer is documentation, not defaults. “It’s in the developer guide” is not a security posture. It’s a liability transfer. If the secure configuration requires deliberate action by your team under deadline pressure, some percentage of your deployments will skip it.
Where to take this further
The two questions in this post are the starting point. They’re designed to be askable in a vendor call without a security team in the room.
The next layer is a six-question technical checklist that covers: how permissions compound when agents delegate to other agents, what actual token cost looks like at scale, whether your audit trail can answer a regulator quickly, and what’s reversible when an agent makes a mistake. That checklist, along with a repair playbook for organizations that have already signed contracts that don’t pass these tests, is available on the Codewall/NateBJones Substack — linked from the original video.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
If you’re building rather than buying, the same questions apply internally. The Lily incident was an internal build. The failure was procurement and organizational, not technical. Asking your own team “what is our default posture when someone is moving fast?” is the same question, pointed inward.
The cheapest thing you can do this quarter is move the technical architectural review earlier in the process. Bring developers to the table before the contract is signed, not after. Give them influence over timelines. Have people who understand agentic architecture weigh in on business deadlines and what those deadlines mean for cross-workflow complexity.
The most expensive thing you can do is keep the existing procurement sequence and pretend that multi-agent workflows work like SaaS. They don’t. And the receipt for that assumption, when it arrives, tends to arrive at 3 a.m.
For teams thinking about how agents interact with marketing systems specifically, the AI agents for marketing teams breakdown is a useful reference for understanding the permission surface those workflows create — which is exactly the kind of cross-system complexity the two questions above are designed to surface before you’re in production.
One opinion, since this post earns one: the Lily incident isn’t a McKinsey story. The shape of that failure — governance arriving late, technical perspective underweighted against business timelines, agents bolted onto a human-user permission model — shows up in a lot of enterprise AI programs. McKinsey got unlucky because the consequences were vivid and the brand is large. The process that produced it is not unique to them.