AISI's Last Ones Benchmark: 5 Findings That Explain Why the White House Blocked Claude Mythos

Two Models Just Completed a Simulated Corporate Network Attack End to End

The UK’s AI Security Institute published results this spring that most people skimmed past. AISI’s “Last Ones” benchmark — a 32-step simulated corporate network attack that takes a human expert an estimated 20 hours to complete — has now been finished end to end by two separate AI models. Claude Mythos completed it in 3 out of 10 attempts. GPT-5.5 completed it in 2 out of 10 attempts. Those five numbers are the reason the White House intervened to block Anthropic from expanding Mythos access, and they’re worth understanding precisely.

Here are the five findings buried in the AISI results that explain what happened next.

The Benchmark Itself Is the Story

Before getting to the findings, you need to understand what “Last Ones” actually measures — because “simulated corporate network attack” undersells it.

The benchmark models a sustained, multi-step intrusion against an enterprise-level target. Thirty-two discrete steps. AISI estimates a skilled human attacker would need roughly 20 hours to complete the full chain. That’s not a single exploit. That’s reconnaissance, lateral movement, privilege escalation, persistence — the whole operational sequence a threat actor would run against a real organization.

The fact that any AI model completes this at all is the news. The fact that two models now have, within months of each other, is the trend.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

If you’ve been following the Mythos story, the cybersecurity capability gap between Mythos and earlier Claude models has been documented elsewhere. What the Last Ones results add is external, independent confirmation from a government-backed evaluation body — not Anthropic’s own benchmarks, not a press release.

Finding 1: The Completion Rate Is Low, and That’s Not Reassuring

Mythos completed Last Ones 3 times out of 10. GPT-5.5 completed it 2 times out of 10. A 20–30% success rate sounds modest until you think about what it means operationally.

A human attacker doesn’t get one attempt. They get as many as they want, for as long as they want, at whatever cost they’re willing to absorb. If an AI model succeeds 30% of the time on a 32-step attack chain, a motivated actor running it repeatedly will eventually get a completion. The question isn’t whether the model can do it — it’s how many tries it takes.

The cost curve makes this worse. AISI highlighted a separate result from the GPT-5.5 evaluation: a reverse engineering challenge that would take a human expert roughly 12 hours was solved in 10 minutes and 22 seconds for $1.73 in API costs. When the per-attempt cost is under two dollars, a 30% success rate stops being a barrier and starts being a budget line.

Finding 2: The Two Models Are Closer Than the Headlines Suggest

GPT-5.5 scored 71.4% on AISI’s expert-level cyber tasks. Claude Mythos scored 68.6%. That’s a 2.8 percentage point gap — meaningful in a benchmark context, but not the kind of gap that puts one model in a different category from the other.

This matters for the policy debate. The White House moved to restrict Mythos specifically, but GPT-5.5 is now documented as having essentially equivalent offensive cyber capabilities. OpenAI, meanwhile, is rolling out GPT-5.5 Cyber to its own list of “critical defenders.” The two labs are running parallel access programs with similar-capability models, and the regulatory response has been asymmetric.

That asymmetry is partly about trust and partly about timing. Anthropic’s compute situation has been a separate source of friction — the White House cited concerns about whether Anthropic had enough capacity to serve both expanded commercial access and government use simultaneously. Anthropic disputes that compute is the limiting factor, pointing to new deals with Amazon, Google, and Broadcom. But those buildouts take time, and the government doesn’t want to be standing in line.

Finding 3: The 27-Year-Old Bug Is the Most Concrete Data Point

Benchmark scores are abstract. The OpenBSD vulnerability is not.

Claude Mythos identified a security flaw in OpenBSD that had gone undetected for 27 years. OpenBSD is not obscure software maintained by a small team — it’s a security-focused operating system with a long history of rigorous auditing, used in firewalls, routers, and critical infrastructure. The fact that a vulnerability survived 27 years of human review and was surfaced by an AI model in a controlled evaluation is the kind of result that gets the attention of the Federal Reserve.

According to the source reporting, banks that received early Mythos access had an emergency meeting after seeing what the model could do. These aren’t organizations that spook easily, and they’re not on Anthropic’s PR team. They saw the outputs and got concerned.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

This is the defender argument in its strongest form: if Mythos can find a 27-year-old OpenBSD bug, it can find bugs in systems that are actively protecting financial infrastructure right now. Getting that capability into the hands of defenders — the people who can patch those systems — is the case Anthropic has been making for expanding access. The full picture of what Mythos can do makes that case harder to dismiss.

Finding 4: AISI Explicitly Doesn’t Know How This Translates to Real Attacks

This is the finding that gets buried, and it’s the most important one for calibrating the actual risk level.

AISI stated directly that they don’t know how these models would perform against real-world hardened systems. The Last Ones benchmark runs in a simulated environment. There are no active defenses. No triggered alerts. No defensive tooling responding in real time. It’s closer to a single-player game than an adversarial engagement — the model isn’t playing against a security team that’s watching and adapting.

Real enterprise networks have EDR, SIEM, honeypots, anomaly detection, and human analysts. A 32-step attack chain that succeeds 30% of the time in a static simulation might succeed 0% of the time against a live SOC. Or it might succeed more often, because the model can move faster than human defenders can respond. AISI doesn’t know, and they said so.

This caveat doesn’t make the results less significant — it makes them harder to interpret. The time and cost curves are collapsing regardless of what the real-world transfer rate turns out to be. A capability that costs $1.73 per attempt and takes 10 minutes will get cheaper and faster. The question of how well it transfers to hardened targets is the one that actually determines the risk level, and right now nobody has a good answer.

Finding 5: An Informal Licensing Regime Is Already Operating

No law was passed. No formal regulatory framework exists. But the White House effectively blocked a private company from expanding access to its own product by citing national security concerns — and that block held.

Anthropic had 50 organizations with Mythos access and wanted to add 70 more, bringing the total to 120. The White House said no. There was also a separate incident: an unauthorized Discord group had somehow obtained Mythos access, and that situation is still under investigation. The combination of an informal access list, a government veto over expansions, and an active investigation into unauthorized access looks a lot like a licensing regime, even if nobody is calling it that.

Dean Abal, an AI policy analyst with prior government experience, described the White House move as “building a dam against a tsunami.” His argument is that these capabilities will diffuse across the broader AI ecosystem within 6 to 18 months — from western frontier labs, from Chinese open-source models, from wherever. Restricting Mythos access buys time, but it doesn’t change the trajectory.

David Sax, a venture capitalist and Trump administration adviser, offered a different frame: stop mystifying Mythos, arm defenders first, move fast. His point is that the model doesn’t create vulnerabilities — it finds ones that already exist. The OpenBSD bug was always there. Mythos just found it faster than any human auditor had.

Both arguments have merit, and they’re not entirely in conflict. The short-term restriction might be the right call while the compute situation gets sorted and formal access protocols get established. The long-term question — who gets access, under what rules, verified how — doesn’t have an answer yet.

What the Compute Constraint Actually Means

Mythos sits above Opus in Anthropic’s model hierarchy. Haiku, Sonnet, Opus — those are the tiers the public has had access to. Mythos is a new class above that, with substantially larger compute requirements per inference. Running it at scale, across 120 organizations doing active security audits, is a different resource commitment than running Sonnet for document summarization.

If you wanted to use Mythos to run security checkups across every major company in the United States, you might simply not have enough compute to do it in any reasonable timeframe. That’s not a hypothetical — it’s the practical constraint that makes the government’s priority access concern legible, even if Anthropic disputes the framing.

The capability jump from Opus to Mythos isn’t just a benchmark improvement. It’s a different operational profile. The models that are cheap enough to run continuously at scale — the ones that power agentic workflows and automated security tooling — are the Sonnet-class models. Platforms like MindStudio that support 200+ models and 1,000+ integrations can chain those models into security-adjacent workflows today, but Mythos-class capability at Sonnet-class cost is not here yet.

The Defender Problem Isn’t Solved by Access Restrictions

Here’s the opinion this post is allowed: restricting Mythos access to 50 organizations while GPT-5.5 Cyber rolls out to a separate “critical defenders” list is not a coherent security strategy. It’s two parallel programs with similar-capability models, different access policies, and no unified framework for deciding who qualifies as a defender.

The actual hard problem is that offensive and defensive use cases are inseparable. The same capability that lets Mythos find a 27-year-old OpenBSD vulnerability is the capability that lets it find vulnerabilities in systems that haven’t been patched yet. You can’t give defenders access to the finding capability without also giving them access to the exploitation capability. The model doesn’t know which side you’re on.

What a real framework would need: verified organizational identity, auditable use logs, clear rules about what outputs can be shared and with whom, and some mechanism for revoking access when those rules get violated. The unauthorized Discord group with Mythos access is evidence that the current informal system isn’t working. The broader questions about what Mythos is and how it fits into Anthropic’s model lineup are worth understanding before the formal policy debate catches up to the technical reality.

The AISI results aren’t a reason to panic. They’re a reason to build the framework before the capability is everywhere — which, if Abal’s 6-to-18-month estimate is right, is not a lot of time.

For AI builders thinking about where this lands practically: the models doing the work in production security tooling today are not Mythos. They’re the models you can actually run at scale, chain into workflows, and audit. The Last Ones benchmark is a preview of what’s coming. The infrastructure question — who builds the tooling, who verifies the defenders, who holds the logs — is the one that actually needs answering now.

Tools like Remy approach a related problem in software development: instead of writing application code directly, you write a spec in annotated markdown and compile the full-stack app from it — backend, database, auth, deployment. The spec becomes the auditable source of truth. That kind of explicit, reviewable intent layer is exactly what’s missing from the current informal Mythos access regime, where the rules exist in someone’s head rather than in a document anyone can inspect.

The benchmark numbers are 3 out of 10 and 2 out of 10. The policy numbers are 50 organizations, 120 requested, 0 approved. The gap between those two sets of numbers is where the actual work needs to happen.