Claude Mythos and GPT-5.5 Pass the 'Last Ones' Cyberattack Benchmark: 6 Things You Need to Know
AISI's 32-step corporate network attack sim took human experts 20 hours. Claude Mythos completed it 3 times out of 10. Here's what that means.
Two AI Models Just Completed a Simulated Corporate Cyberattack End-to-End
The AISI “Last Ones” benchmark is a 32-step simulated corporate network attack. Human experts take an estimated 20 hours to complete it. Claude Mythos completed it in 3 out of 10 attempts. GPT-5.5 completed it in 2 out of 10 attempts. Those are the numbers you need to hold in your head for everything that follows.
This isn’t a marketing claim from Anthropic. The UK’s AI Security Institute — a government-backed evaluation body — ran these tests. When the White House starts blocking model rollouts and the Federal Reserve holds emergency meetings, you can be reasonably confident something real is being measured.
Here are six things buried in those results that matter if you build with AI.
What AISI Actually Measured
The “Last Ones” benchmark isn’t a CTF challenge or a toy problem. It’s a 32-step simulation of a sustained corporate network intrusion — the kind of multi-stage attack that would require reconnaissance, lateral movement, privilege escalation, and persistence. AISI estimates a human expert needs roughly 20 hours to complete it from start to finish.
Claude Mythos completed it end-to-end 3 times out of 10 attempts. GPT-5.5, running the same evaluation, completed it 2 times out of 10.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
Those completion rates sound low until you think about what they mean operationally. A 30% end-to-end success rate on a 32-step attack chain means the model can sometimes execute a full corporate intrusion without human guidance. At scale, with cheap API access, you run it ten times and you get three successful attacks. The cost of failure is zero. The cost of success is whatever the attacker wanted.
On the broader expert-level cyber task scoring, GPT-5.5 hit 71.4% versus Claude Mythos at 68.6%. Close enough that neither lab should be claiming a decisive lead — and alarming enough that both results warrant serious attention.
If you want more context on where Mythos sits in Anthropic’s model hierarchy, Claude Mythos vs Claude Opus 4.6 capability comparison covers the capability jump in detail. The short version: Mythos is a new model class above Opus, not an incremental update.
The Numbers That Should Actually Concern You
The end-to-end completion rates get the headlines. The number that should concern you more is $1.73.
AISI highlighted a specific reverse-engineering challenge during GPT-5.5’s evaluation. A human expert would need approximately 12 hours to solve it. GPT-5.5 solved it in 10 minutes and 22 seconds, at a total API cost of $1.73.
Read that again. Twelve hours of expert labor, compressed to ten minutes, for less than the cost of a coffee. The time compression is striking. The cost compression is what changes the threat model entirely.
Cybersecurity has always had an asymmetry problem — defenders need to be right every time, attackers only need to be right once. What these benchmarks suggest is that the cost of being an attacker is collapsing. You no longer need a team of skilled engineers with deep domain expertise. You need API access and a few dollars.
Claude Mythos also found a 27-year-old OpenBSD vulnerability during testing — a bug that had survived decades of security audits and expert review. That’s not a benchmark artifact. That’s a model doing something that the entire security community missed for nearly three decades.
The cybersecurity capability gap between Claude Mythos and Opus 4.6 is worth understanding here: Mythos scores 83.1% on cybersecurity benchmarks versus Opus 4.6’s 66.6%. That’s not a marginal improvement. That’s a different category of capability.
Why the Controlled Environment Caveat Matters Less Than You Think
AISI is careful to note that these are controlled evaluations. The simulated environments lack active defenses, don’t trigger real alerts, and have no defensive tooling responding in real time. It’s closer to a PvE game than a live network intrusion. AISI explicitly states they don’t know how these models would perform against real-world hardened systems.
That caveat is real and worth keeping. But it’s also doing a lot of work to make people feel better than they should.
The relevant question isn’t whether Claude Mythos could take down a hardened enterprise network today. The relevant question is what happens when these capabilities are six months more mature, running on more compute, with better scaffolding around them. AISI’s own data suggests that inference compute improves performance — throw more GPUs at it and the models get better at these tasks. The time and cost curves are collapsing simultaneously.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
David Sax, who advises the Trump cabinet on tech policy, pushed back on the doomsday framing: Mythos “is not magic, not a doomsday device.” He expects all leading Chinese models to reach the same capability level within six months. His point is that treating Mythos as uniquely dangerous misses the broader trend — this is where frontier models are going, full stop.
He’s probably right about the trajectory. That doesn’t make the trajectory less concerning. It makes it more so.
The Non-Obvious Detail: This Isn’t a One-Off
The most important thing about GPT-5.5 passing the Last Ones benchmark isn’t GPT-5.5. It’s what GPT-5.5 proves about Claude Mythos.
Before GPT-5.5’s results, there was a plausible narrative that Anthropic was overstating Mythos’s capabilities — that the benchmark results were cherry-picked or that the model was being positioned as scary for marketing reasons. That narrative is now harder to sustain. An independent model from a competing lab, evaluated by the same government-backed body, produced comparable results.
Mythos wasn’t an anomaly. It was a leading indicator.
This is the pattern with frontier AI capabilities: one lab demonstrates something that looks exceptional, and within months it becomes the baseline. The 3/10 and 2/10 completion rates on the Last Ones benchmark are early numbers from early models. The question isn’t whether future models will do better — they will. The question is how much better, how fast, and who has access when they do.
For builders thinking about how to construct AI systems that interact with security-sensitive infrastructure, this is the context that matters. Platforms like MindStudio give you 200+ models and a visual builder for chaining agents and workflows — which is useful precisely because the right model for a given task keeps changing as the capability landscape shifts. Model-agnostic infrastructure is increasingly the sensible default.
The Compute Problem Nobody Is Talking About Enough
Here’s the part of this story that gets less attention than the benchmark numbers: Anthropic may not have enough compute to serve the demand that Mythos is generating.
The White House blocked Anthropic’s plan to expand Mythos preview access from roughly 50 organizations to 120. One of the stated reasons was national security. The other was that officials weren’t confident Anthropic could serve 120 organizations without degrading the government’s own access to the model.
Anthropic disputes that compute is the limiting factor. They’ve signed deals with Amazon, Google, and Broadcom to expand capacity. But those buildouts take time, and the compute isn’t online yet.
Mythos is the first model in a new class above Opus in Anthropic’s hierarchy — Haiku, Sonnet, Opus, and now Mythos. It’s substantially larger than anything Anthropic has offered publicly before. Running it at scale, especially for the kind of sustained multi-step tasks the Last Ones benchmark requires, consumes significantly more compute than running Sonnet or Opus. If you wanted to use Mythos to run security audits across every major company in America, you’d need compute that doesn’t currently exist in sufficient quantity.
This is why the access question isn’t just political. It’s physical. There’s a real scarcity here, and scarcity means prioritization, and prioritization means someone decides who gets the model and who doesn’t. That’s a different kind of infrastructure problem than most AI discussions acknowledge. The Anthropic compute shortage post covers the quota tightening in more detail if you’re hitting limits in your own workflows.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
What You Should Actually Watch For
The benchmark results themselves are already public. What’s worth tracking going forward is more specific.
The completion rate trajectory. 3/10 and 2/10 are the current numbers. Watch what happens when AISI re-runs these evaluations with the next generation of models, with more inference compute, with better scaffolding. The number that matters isn’t the current completion rate — it’s the slope.
Whether the controlled environment caveat holds. AISI is honest that they don’t know how these models perform against real hardened systems. At some point, that question gets answered, either through deliberate red-teaming or through something less controlled. The gap between benchmark performance and real-world performance is the most important unknown in this space right now.
The defender access question. Both Anthropic and OpenAI are framing their rollouts around getting these models into the hands of “defenders” — security teams who can use the same capabilities to find and patch vulnerabilities faster. That framing is correct as far as it goes. The 27-year-old OpenBSD vulnerability that Mythos found is exactly the kind of thing defenders need these models for. The question is whether the access controls around offensive use are keeping pace with the access expansion for defensive use.
The cost floor. $1.73 to solve a 12-hour reverse-engineering challenge is already low. API costs for frontier models have been declining consistently. What does this threat model look like when the same task costs $0.17? The capability isn’t static, and neither is the economics.
For teams building security-adjacent tooling, the practical implication is that the threat model your system was designed against six months ago may already be outdated. If you’re building applications that need to reason about security posture — vulnerability scanning, code review, compliance checking — the spec-driven approach that tools like Remy use is worth understanding: you define the application’s behavior and constraints in annotated markdown, and the full-stack implementation gets compiled from that spec. When the underlying capability landscape shifts, you update the spec rather than rewiring the implementation.
The broader point is that these benchmark results aren’t a reason to panic and they’re not a reason to dismiss. They’re a reason to update your model of what AI systems can do, and to build accordingly.
The Last Ones benchmark was designed to measure something specific: whether an AI model can sustain a multi-step corporate network attack from start to finish, without human guidance, against a simulated target. Two models can now do that, sometimes. The word “sometimes” is doing a lot of work in that sentence, and it’s doing less work every quarter.
That’s the honest read of what AISI measured. Everything else — the policy fights, the compute constraints, the access restrictions — is downstream of that underlying capability fact. The models can do this now. More models will be able to do it soon. The question is what the infrastructure around that capability looks like, and who’s building it thoughtfully versus who’s just watching the benchmark numbers go up.