McKinsey's Lily AI Platform Was Hacked for $20: 6 Enterprise AI Security Failures the Incident Exposed

For $20, an Agent Read Every McKinsey Consultant’s AI Conversations

An autonomous agent spent $20, used zero credentials, and walked out with read/write access to tens of millions of chat messages belonging to McKinsey consultants. No insider help. No exotic tools. The attack vector was SQL injection — a technique first documented in 1998 and taught in every introductory web security course on earth.

The platform is called Lily. It’s McKinsey’s internal AI system, in production for more than two years, used by roughly 70% of the firm’s 40,000 consultants. The startup that found the hole is called Codewall. They disclosed responsibly on March 9, 2026. McKinsey patched it within one hour.

Here are the six systemic failures the Lily incident exposed — not for McKinsey specifically, but for every organization that has signed an AI platform contract in the last 18 months.

The Attack Itself Wasn’t the Problem

Before getting to the list, you need to understand what the Codewall team actually found — because the exploit is almost beside the point.

Of Lily’s 200 API endpoints, 22 shipped with no authentication at all. Not weak authentication. Not misconfigured authentication. None. And critically, at least one of those unauthenticated endpoints allowed production write access. That’s the one the agent used. For $20 in compute, it gained the ability to read tens of millions of consultant chat messages, access tens of thousands of user accounts, and — this is the part that should make you stop — rewrite every system prompt governing how the platform reasons.

Think about what that means in practice. An attacker with $20 and a few hours could have silently changed how Lily advises consultants who advise the largest companies in the world. Not stolen data. Poisoned the reasoning itself.

McKinsey has excellent engineers. Lily had been running in production for over two years. The SQL injection that cracked it is so well-understood that OWASP has been publishing defenses against it for decades. So the question isn’t “how did a bad engineer let this slip through?” The question is what kind of environment produces 22 unauthenticated endpoints at scale, including one with production write access.

That’s a different question. And it has six answers.

Failure 1: The Platform Was Designed for Humans, Then Agents Were Bolted On

When Lily launched two years ago, autonomous AI agents capable of probing public endpoints and reaching production data were not a realistic threat model. That’s not an excuse — it’s context for understanding the shape of the failure.

The Codewall agent walked up to Lily’s API and the API didn’t ask who was calling. There was nothing to authenticate because the system wasn’t built with the concept of an agent in it. The platform’s security model assumed a human would be on the other end of every request — someone whose eyes would mediate what they could see, whose screen would function as the permissions layer.

Agents have no eyes. An agent asks each system in code: am I allowed to read this? Every one of those systems has to have a clear answer written by someone. When you design for humans and bolt agents on afterward, you get Lily: a platform where the blast radius of a single unauthenticated endpoint is unbounded, because there’s no concept of “this agent” to bound.

This is the foundational failure. Everything else flows from it.

Failure 2: 22 Endpoints Is a Culture Problem, Not an Engineering Mistake

One unauthenticated endpoint is a mistake. Twenty-two is a pattern.

The standard postmortem framing — “authenticate your endpoints, sanitize your inputs, treat your AI platform like production” — isn’t wrong. It’s just not the story. That framing puts the failure on a single engineer who skipped a checklist on a Friday. If that’s what happened, the fix would be training, and it would be easy.

But you don’t get 22 unauthenticated endpoints from one engineer’s bad day. You get 22 from a default state — an environment where the assumption, implicit or explicit, is that you can push to production without that level of scrutiny on your endpoint. Where the technical architect’s opinion about what matters doesn’t carry enough weight to stop a deploy. Where the pressure to ship overrides the question of whether the thing being shipped is the right shape for a world where agents exist.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

This is an organizational design problem. The technology to authenticate an endpoint is trivial. There is no organization on earth where this is a hard engineering problem. The hard problem is building a culture where the question gets asked before the endpoint ships, not after the incident report. Teams that have studied agentic workflow patterns in depth understand that the permissions architecture has to be designed into the workflow from the first commit — retrofitting it after the fact is how you end up with 22 gaps instead of one.

Failure 3: The Procurement Sequence Wasn’t Built for This

Most enterprise software has been bought the same way for fifteen years: strategic decision at the top, procurement negotiates the contract, security and compliance review, IT plans the integration, developers build against whatever platform already got purchased.

That sequence worked for SaaS because SaaS is bounded. The vendor gives you an admin console, a set of integration points, a published API, a permissions model that maps cleanly to roles. You’re configuring software. The complexity is manageable.

For agents, that same sequence leads directly to situations like Lily. Consider what an agent actually does on a single real run: a user asks it to prepare a renewal brief for a major customer. The agent pulls from the CRM, from support tickets, from contract management, from product usage data, from call transcripts, from an internal wiki. It crosses permission boundaries that for a human are mediated by what’s visible on a screen — but for an agent are mediated by tokens and roles and scopes that have to exist as code written by someone.

When developers are last in the buying sequence, you’re committing capital to a strategy whose viability has never been tested. You don’t find that out in a demo. You find it out six months in, when your team is pushing a workflow into production and discovering, one boundary at a time, that the platform you purchased wasn’t buildable for the work you bought it to do. The implementation question isn’t downstream of the strategic decision. For agents, it is the strategic decision.

Failure 4: No Distinction Between Human Users and AI Agents

Here’s a concrete version of the problem. A senior McKinsey consultant using Lily might have legitimate read access to 40 client accounts, built up over five years of work. An agent running on a specific client engagement should probably only touch that client’s data. That boundary seems obvious.

If the platform doesn’t enforce it — if it can’t distinguish between a human user and an AI agent operating on that user’s behalf — then one incident becomes a company-wide exposure event. That’s not an IT problem. That’s a board-level liability conversation.

The Codewall exploit made this concrete: the agent had no identity the system recognized. There was no “this agent” to scope, no permissions to bound, no audit trail to follow. The blast radius was the entire platform.

This is why the first question any organization should be asking their AI vendor right now is: does your platform separately authenticate human users and AI agents? Not “do you support security controls” in the abstract. Specifically: can an agent be scoped to a subset of what its human operator can access? Can you revoke that agent’s access from a console in five minutes without a code deploy? If the answer to either question is no, you have an unbounded blast radius and you probably don’t know it yet.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

The agentic workflow patterns that are emerging in 2026 make this distinction even more critical — agents that delegate to sub-agents compound the permissions problem geometrically. And the cybersecurity capability gap between AI models means that the agent probing your endpoints may be significantly more capable than the one you used when you last assessed your threat model.

Failure 5: The Audit Trail Doesn’t Exist for Agents

When something goes wrong, regulators don’t ask what the user did. They ask what the system did on behalf of the user, and can you prove it.

For human actions in most enterprise platforms, this is solved. There’s a log. There’s an audit trail. There’s a record of who accessed what and when. For agent actions, in most platforms deployed today, that trail either doesn’t exist or has gaps the size of every agent action in the organization.

That gap is not small. Agents are capable of taking hundreds of actions in a single run — reading data, writing data, triggering workflows, calling external APIs. If none of that is logged in a way that answers a regulator’s question, your compliance team is going to discover the problem the hard way.

ServiceNow’s Action Fabric announcement — one of six major vendor announcements that came out within roughly a week of the Lily disclosure — is a direct response to this. It lets outside agents trigger governed workflows with identity and audit trail attached. Salesforce’s headless 360 exposes the platform as APIs and CLI commands specifically because agents don’t click through screens. Pinecone Nexus addresses the related problem of agents rebuilding business context from scratch on every run, which drives up token costs and creates consistency gaps. These aren’t coincidental product launches. They’re the industry acknowledging, in product form, that the audit and permissions infrastructure for agents doesn’t exist yet in most deployments.

For teams building their own agent infrastructure — rather than buying it — the same audit requirements apply. MindStudio provides orchestration across 200+ models and 1,000+ integrations with a visual builder for designing agent workflows, but the governance layer still has to be designed intentionally. The tooling existing doesn’t mean the architecture is automatic. Having the right platform is a prerequisite; having the right architecture is the actual work.

Failure 6: The Pressure Test Reveals the Real Default

The most revealing question you can ask about any AI platform — or any team building one — isn’t about its security features in the abstract. It’s about what happens when the team is under pressure.

Vendors will tell you they have comprehensive authentication frameworks. That their enterprise customers have full flexibility to configure policies. That all authentication options are documented in the developer guide. All of that may be true. None of it answers the core question.

The core question is: what is the technical default when your team is told to move quickly? When there’s no time for a thorough architectural review, where does the platform land? Does it default to authenticated or unauthenticated? Does it default to agents having bounded or unbounded access? Does it default to audit trails being on or off?

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Because that default — not the feature list, not the documentation, not the enterprise configuration options — is the version of the platform you’re actually running. The version that ships when someone is working to a deadline. The version that was running on Lily when Codewall’s agent showed up.

This is also where the build-versus-buy distinction collapses. Whether you’re building your own version of Lily or purchasing a vendor platform, you still have to deal with cross-workflow complexity. You still have to involve technical teams. You still have to ask what the default state looks like when no one has time to configure it carefully. The question of what happens under pressure is organizational, not technological. And it’s the question that most procurement processes never ask.

For teams thinking about how to build more carefully from the start, the abstraction level matters. Tools like Remy take a spec-driven approach — you write annotated markdown describing your application, and a complete TypeScript backend, database, auth layer, and deployment get compiled from it. The spec is the source of truth; the code is derived output. That kind of explicit, reviewable source document is exactly the kind of artifact that makes security review tractable before deployment, not after. When the entire application is generated from a single auditable spec, the question of what shipped and why has a clear answer.

The Vendors Noticed

Within roughly one week of the Lily story gaining traction, six major vendors announced agent infrastructure products: SAP acquired Dreo and Prior Labs to bring a unified data layer and tabular foundation models to where actual business data lives. Pinecone launched Nexus to stop agents from rebuilding business context from scratch on every run. Salesforce shipped headless 360. ServiceNow opened Action Fabric. Anthropic and OpenAI both stood up enterprise services companies with billions behind them to put engineers inside customer build rooms.

One story. Six announcements. Every single one of those vendors is now selling you the thing your AI roadmap was supposed to already have: reachable surfaces, governed action, permission-aware data, cheaper context assembly, humans who can actually wire up your workflows.

The signal is clear: the model was never the hard part. The hard part is exactly what Lily surfaced — whether the agent can reach the right data, use the right permissions, trigger the right workflow, leave the right audit trail, and do all of it at a cost the organization can live with.

Understanding how multi-agent teams are structured in practice makes the permissions problem more concrete: when agents spawn sub-agents and those sub-agents trigger workflows, the identity and audit chain has to extend through every layer. Most enterprise deployments have no answer for that today.

What the $20 Actually Cost

The $20 Codewall spent is almost insultingly cheap. But the real cost of the Lily incident isn’t measured in dollars. It’s measured in what it revealed about the shape of enterprise AI deployment in 2026.

The shape is this: governance and thoughtful technical perspective tend to arrive late. The exploit is just the receipt.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

McKinsey got unlucky in the sense that Codewall found the hole before someone malicious did, and in the sense that the brand is large enough that the story made news. But the underlying pattern — agents bolted onto human-centric systems, technical voices arriving after the strategic decision, no distinction between human and agent permissions, no audit trail for agent actions, defaults that favor speed over security — that pattern is not unique to McKinsey.

The cheapest thing any organization can do this quarter is move the technical architectural review earlier in the process. Give developers more influence on timeline. Have people who understand how agents actually work weigh in on business timelines before the contract is signed, not six months after.

The most expensive thing is to keep the existing procurement sequence and pretend that multi-agent workflows work like SaaS. They don’t. Lily proved it for $20.

If you’re building agents that cross system boundaries — and if your AI roadmap is worth anything, you are — the cybersecurity capability gap between AI models is one dimension of the problem. The permissions architecture is another. And the organizational culture that decides which one gets prioritized is the one that will determine whether your incident shows up in a responsible disclosure report or a news cycle.