22 of 200 API Endpoints Shipped Unauthenticated: The Lily Incident's Real Procurement Failure

22 of 200 API Endpoints Shipped Unauthenticated — and That Number Is the Story

Lily, McKinsey’s internal AI platform, shipped 22 of 200 API endpoints with no authentication. Not one. Twenty-two. And at least one of those 22 allowed production write access — meaning an agent on the open internet could modify the data that governs how 28,000 consultants receive AI-generated advice. Codewall, a startup doing security research, found this on March 9, 2026, spent $20, used no credentials, and gained read/write access to tens of millions of chat messages plus every system prompt on the platform. The whole thing took two hours.

McKinsey patched it within an hour of responsible disclosure. That part is genuinely impressive. But the patch is not the interesting part. The interesting part is how you get to 22.

The Number That Doesn’t Make Sense on Its Own

If you read the postmortem framing — authenticate your endpoints, sanitize your inputs, treat your AI platform like production — you come away thinking this was a hygiene failure. Someone forgot something. A checklist item got skipped.

That framing collapses immediately when you look at the number 22.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

McKinsey has excellent engineers. Lily had been running in production for over two years. SQL injection, the attack vector Codewall used, has been documented since 1998. It is literally taught in introductory web security courses. There is no world where a team of McKinsey’s caliber doesn’t know how to authenticate an API endpoint. That is not a hard engineering problem. It’s a few lines of middleware.

So the question isn’t “why didn’t someone authenticate the endpoint.” The question is: what organizational process produces 22 unauthenticated endpoints in a production AI system, including writable ones, without any of them getting caught?

That’s a different question entirely. And it has a different answer.

What the Procurement Sequence Actually Produces

Enterprise software has been bought the same way for at least 15 years. Strategic decision at the top. Procurement negotiates the contract. Security and compliance review. IT plans the integration. Developers build against whatever platform already got purchased.

This sequence worked well for SaaS. It worked for Salesforce, Workday, ServiceNow — the entire generation of cloud applications most companies run on today. It worked because SaaS is bounded. The vendor gives you an admin console, a set of integration points, a published API, and a permissions model that maps cleanly to human roles. You’re configuring software. The blast radius of a misconfiguration is limited because the software itself is limited.

Agents are not bounded in this way.

Think through what an agent actually does on a single real task inside a company in 2026. A user says: “Prepare the renewal brief for our largest customer.” The agent has to figure out which systems hold the answer. It pulls from the CRM, from support tickets, from contract management, from product usage data, from call transcripts, from an internal wiki. It crosses permission boundaries that for a human are mediated by what’s visible on a screen — but for an agent are mediated by tokens and roles and scopes that have to exist as actual code written by someone.

When a human consultant pulls a renewal brief together, none of this complexity is visible to them. They open Salesforce. They glance at the support history. They check the contract. They scan a Slack thread. They don’t notice that the contract management tool has its own permissions, that the support system has its own audit log, that the CRM treats their access differently from the analyst’s access. They don’t have to notice. Their eyes do the work. The screen is the permissions model.

An agent has no eyes. The agent is asking each of those systems in code: am I allowed to read this? And every one of those systems has to have a clear answer. And every one of those answers has to be auditable. And every one of those audits has to compose with every other one when a regulator asks what happened in this sequence.

None of this exists by default. All of it is engineering work that someone has to do against a deadline before the agent ships.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

When you put developers last in the buying sequence — when the strategic decision is made, the contract is signed, and then the technical team is handed the platform and told to build — you are committing capital to a strategy whose viability has not been tested. You don’t find that out in the demo. You find it out six months later when your team is trying to push a workflow into production and discovering one boundary at a time that the platform wasn’t buildable for the work you bought it to do.

The Lily incident is what you find out when you find out early, because someone external found it for you.

Why 22 Is a Culture Signal, Not a Bug Count

Here’s the thing about 22 unauthenticated endpoints. If it were one, you’d blame a single engineer on a Friday afternoon. If it were two or three, you’d blame a bad sprint. But 22 of 200 — over 10% — is a pattern. That’s a platform where the default assumption, somewhere in the process, was that you could push to production without that level of scrutiny on your endpoints.

The deeper failure isn’t that the endpoints weren’t authenticated. The deeper failure is that no one asked whether the API endpoint itself was the correct shape for a world where autonomous agents exist on the internet and can probe your production surfaces.

When Lily first shipped two years ago, that was a reasonable oversight. Autonomous agents capable of walking up to a public API and probing it systematically were not a normal threat model in 2024. They are a very normal threat model in 2026. The platform didn’t update its assumptions. That’s the gap.

And this gap is not unique to McKinsey. The shape of this failure — governance and thoughtful technical perspective arriving late, implementation treated as downstream of strategy rather than constitutive of it — shows up in a lot of enterprise AI programs. McKinsey is the version that made the news because the consequences were vivid, the disclosure was responsible and well-documented, and the brand is large enough that people paid attention.

The organizations that haven’t made the news yet are not necessarily safer. They may just not have been probed yet.

The Specific Thing Agents Break in Your Permissions Model

There’s a precise technical reason why the old procurement sequence fails for agents, and it’s worth being concrete about it.

In a human-facing SaaS system, permissions are scoped to users. A senior consultant at McKinsey might have legitimate read access to 40 client accounts built up over five years. That access is appropriate for a human who navigates those accounts deliberately, one at a time, through a screen that mediates what they see.

An agent running on behalf of that consultant, if it inherits those same permissions, can access all 40 client accounts simultaneously, programmatically, in a single run. The blast radius of a compromised agent is not the blast radius of a compromised user. It’s the blast radius of a compromised user times the speed of code.

If the platform doesn’t enforce a boundary between human-user permissions and agent-scoped permissions, one incident becomes a company-wide exposure event. That’s not an IT problem. That’s a board-level liability conversation.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

And then there’s the audit question. When something goes wrong, regulators don’t ask what did the user do. They ask what did the system do on behalf of the user, and can you prove it. If your platform can’t answer that question for agent actions specifically — if the audit trail has a gap the size of every agent action in your organization — your compliance team is going to find that out the hard way.

The Codewall agent walked up to Lily’s API and the API didn’t ask who was calling. There was nothing to authenticate because the system wasn’t built with the concept of an agent in it to authenticate at all. The blast radius was unbounded because there was no “this agent” to bound.

For teams building multi-agent systems, this compounds further. When an agent delegates to a sub-agent, whose permissions apply? If you’re thinking through how multi-agent systems actually compose, the answer is almost never “it just works” — it requires explicit design decisions about permission inheritance that most platforms don’t make for you.

What the Vendor Announcements Are Actually Telling You

In roughly the same week that the Lily incident was being analyzed, six major vendors announced agent-infrastructure products: SAP acquired Dreo and Prior Labs to bring a unified data layer and tabular foundation models to where actual business data lives. Pinecone launched Nexus — essentially a solution to agents rebuilding business context from scratch on every run. Salesforce shipped headless 360, which exposes their platform as APIs, tools, and CLI commands because agents don’t click through screens. ServiceNow opened Action Fabric so outside agents can trigger governed workflows with identity and audit trail attached. Anthropic and OpenAI both stood up enterprise services companies with billions behind them to put engineers inside customer build rooms.

Six announcements. One story.

Every single one of those vendors is now selling you the thing your AI roadmap was supposed to already have: reachable surfaces, governed action, permission-aware data, cheaper context assembly, forward-deployed humans who can actually wire up your workflows.

The model was never the hard part. The hard part is exactly what the Lily incident surfaced — whether the agent can reach the right data, use the right permissions, trigger the right workflow, leave the right audit trail, and do all of it at a cost the company can live with.

Platforms like MindStudio approach this from the other direction: 200+ models, 1,000+ integrations, and a visual builder for composing agents and workflows, so the orchestration layer doesn’t have to be assembled from scratch against a deadline. But even with good tooling, the permissions architecture still has to be designed deliberately. Tooling doesn’t substitute for the architectural decisions.

The Organizational Fix Is Earlier, Not Different

The cheapest thing you can do this quarter is move the technical developer review earlier in the process. Not different — earlier.

Bring your developers to the table before the contract is signed. Give them actual influence on timeline and deployment. Have people who understand architecture weigh in on business timelines and the impact of those timelines on complex cross-agent workflows. Because the most expensive thing you can do is keep the existing procurement sequence and pretend that multi-agent workflows work like SaaS when they don’t.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

The implementation question isn’t downstream of the strategic decision. For agents, it effectively is the strategic decision. If the agent can’t authenticate against the system it needs, the strategy doesn’t work. If the permissions model only thinks about humans clicking through screens, the strategy doesn’t work. If every run reassembles the same business context from scratch and your token bill goes up by 3x, the strategy doesn’t work. If you can’t audit what the agent did, the strategy won’t get past legal — and shouldn’t.

None of these are implementation details to be worked out later. Every one of them is enough to change how the roadmap actually gets shaped.

This is also true if you’re building internally rather than buying. The Lily incident was an internal build. The same cross-workflow complexity that breaks purchased platforms also breaks internally built ones if the technical team isn’t at the table with space to talk. When you’re building something like an AI agent for complex business workflows, the permissions architecture has to be designed in, not bolted on after the first security incident.

For teams doing the actual architectural review, the questions that matter are specific: Does the platform separately authenticate human users and AI agents? What is the default posture when the team is moving fast and doesn’t have time to configure security settings? How do permissions compound when agents delegate to sub-agents? What does the audit trail look like when a regulator asks what happened in a specific sequence? What’s reversible when an agent makes a mistake?

These questions have specific failure modes. You would rather catch them before you sign than six months after, when you’re discovering one boundary at a time that the platform wasn’t buildable for the work you bought it to do.

Tools like Remy take a related approach to the spec-first problem: you write your application as annotated markdown — intent in prose, precision in annotations — and it compiles into a complete TypeScript stack with backend, database, auth, and deployment. The point isn’t that the code disappears; it’s that the source of truth becomes explicit before anything gets built. That same instinct — make the requirements precise before you commit to an implementation — is exactly what’s missing from most enterprise AI procurement processes.

What 22 Actually Means for Your Roadmap

The Lily incident is not a security story. It’s a procurement and build story that surfaced as a security incident.

The exploit was SQL injection, a technique documented in 1998. But that’s not the attack vector that matters. The attack vector that matters is the organizational process that produces 22 unauthenticated endpoints in a production AI system — including writable ones — without any of them getting caught before an external researcher found them for $20.

That process is the one most enterprise AI programs are running right now. The sequence where technical teams arrive late, where governance is downstream of strategy, where the demo looks good and the production deployment is where you find out what you actually bought.

The six vendor announcements are a signal that the industry has figured out what the hard part is. The hard part is not the model. The hard part is the permissions, the audit trail, the governed action surface, the context that doesn’t get rebuilt from scratch on every run.

If you’re evaluating an AI platform right now — or if you signed one last quarter — the question worth asking is not whether the vendor has a security page. The question is what happens when your team is under pressure and moving fast. What is the default? Where does the system go when no one has time to configure it carefully?

Because that’s the version of the platform you’re actually going to run. And if that version ships 22 unauthenticated endpoints, you’re going to find out about it one way or another. The question is whether you find out from your own security review or from someone else’s $20 experiment.

For teams doing the deeper technical review, the cybersecurity capability gap between different AI models is also worth understanding — because the same agents that power your workflows are increasingly capable of finding the vulnerabilities in them.

The fix is not exotic. It’s earlier. Move the technical review earlier. Give it more weight. Treat the implementation questions as the strategic questions, because for agents, they are.

22 of 200 API Endpoints Shipped Unauthenticated: The Lily Incident's Real Procurement Failure

22 of 200 API Endpoints Shipped Unauthenticated — and That Number Is the Story

The Number That Doesn’t Make Sense on Its Own

Everyone else built a construction worker.
We built the contractor.

What the Procurement Sequence Actually Produces

Remy is new. The platform isn't.

Why 22 Is a Culture Signal, Not a Bug Count

The Specific Thing Agents Break in Your Permissions Model

Other agents ship a demo. Remy ships an app.

What the Vendor Announcements Are Actually Telling You

The Organizational Fix Is Earlier, Not Different

One coffee. One working app.

What 22 Actually Means for Your Roadmap

Related Articles

How to Audit Your Enterprise AI Vendor for Agentic Security: 2 Questions to Ask Before You Sign

McKinsey's Lily AI Platform Was Hacked for $20: 6 Enterprise AI Security Failures the Incident Exposed

Claude in Microsoft Office Uses Sub-Agents That Talk to Each Other — Anthropic Doesn't Advertise This

An AI Agent Deleted a Production System Because No One Defined 'Staging' — Here's the Fix

22 of 200 API Endpoints Shipped Unauthenticated — and That Number Is the Story

The Number That Doesn’t Make Sense on Its Own

Everyone else built a construction worker.We built the contractor.

What the Procurement Sequence Actually Produces

Remy is new. The platform isn't.

Why 22 Is a Culture Signal, Not a Bug Count

The Specific Thing Agents Break in Your Permissions Model

Other agents ship a demo. Remy ships an app.

What the Vendor Announcements Are Actually Telling You

The Organizational Fix Is Earlier, Not Different

One coffee. One working app.

What 22 Actually Means for Your Roadmap

Related Articles

How to Audit Your Enterprise AI Vendor for Agentic Security: 2 Questions to Ask Before You Sign

McKinsey's Lily AI Platform Was Hacked for $20: 6 Enterprise AI Security Failures the Incident Exposed

Claude in Microsoft Office Uses Sub-Agents That Talk to Each Other — Anthropic Doesn't Advertise This

An AI Agent Deleted a Production System Because No One Defined 'Staging' — Here's the Fix

Everyone else built a construction worker.
We built the contractor.