Claude Mythos: 5 Alarming Capabilities Buried in the Government Security Reports

A 27-Year-Old Bug, a Blocked Expansion, and the Pentagon’s Snub

Claude Mythos found a vulnerability in OpenBSD that had gone undetected for 27 years. The UK’s AI Security Institute confirmed it. The White House responded by blocking Anthropic from expanding Mythos access. And the Pentagon signed AI agreements with eight companies — SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, and Oracle — and left Anthropic off the list entirely.

If you’re building on top of frontier AI models, or evaluating which ones to trust with sensitive workloads, the government security reports on Mythos tell you something that the marketing materials don’t. Here are the most alarming findings buried in them.

The OpenBSD Finding Is the One That Should Stop You Cold

A 27-year-old vulnerability sitting in OpenBSD is not a minor footnote. OpenBSD has a reputation as one of the most security-hardened operating systems in existence. It’s used in firewalls, routers, and infrastructure where “secure by default” is the whole point. The fact that a bug survived there for nearly three decades — through countless audits, security reviews, and expert eyes — and Claude Mythos surfaced it, tells you something specific about what this model class can do.

This wasn’t a known CVE that Mythos re-flagged. It was something that had genuinely escaped detection. That’s the distinction that matters.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

For context on where Mythos sits in Anthropic’s model hierarchy: it’s above Opus. The current public ladder runs Haiku → Sonnet → Opus, and Mythos is the first model in a new class above that. It’s not a bigger Opus. It’s a different tier entirely, which is part of why it creates a compute crunch — running it at scale requires significantly more infrastructure than anything Anthropic has previously deployed publicly. If you want a fuller picture of what Mythos is and how it was discovered, the benchmarks and leak history are worth reviewing separately.

What the AISI ‘Last Ones’ Benchmark Actually Measures

The UK’s AI Security Institute runs a cyber evaluation called “Last Ones.” It’s a 32-step simulated corporate network attack — the kind of sustained, multi-stage intrusion that would be required to actually bring down an enterprise network. AISI estimates a human expert would need roughly 20 hours to complete it end-to-end.

Claude Mythos completed it in 3 out of 10 attempts. GPT-5.5 completed it in 2 out of 10 attempts.

Those numbers sound low until you think about what they mean at scale. A 30% completion rate on a 32-step attack simulation, running autonomously, at whatever cost per API call, is not a theoretical risk. It’s a repeatable capability. And the cost curve is collapsing fast.

On the GPT-5.5 evaluation, AISI highlighted a specific reverse-engineering challenge the model solved in 10 minutes and 22 seconds, for $1.73 in API costs. A human expert would need approximately 12 hours for the same task. That’s not a benchmark number — that’s a price-per-exploit calculation. And it’s going to get cheaper.

GPT-5.5 scored 71.4% on expert-level cyber tasks. Claude Mythos scored 68.6%. They’re close enough that the gap is essentially noise. What’s significant is that two frontier models now sit above whatever threshold the government considers alarming, and both got there within months of each other. This is not a one-lab anomaly.

The Compute Problem Is Inseparable from the Security Problem

The White House blocked Anthropic from expanding Mythos preview access from roughly 50 organizations to 120. The stated reasons were two: national security concerns about wider access, and doubts about whether Anthropic has enough compute to serve both new organizations and the federal government without degradation.

Anthropic disputes the compute framing. They’ve signed deals with Amazon, Google, and Broadcom to expand infrastructure — but those buildouts take time. The capacity isn’t online yet.

Here’s the tension: Anthropic’s stated goal for the Mythos preview was to get it into the hands of more “defenders” — security teams, researchers, and organizations that can use its vulnerability-finding capabilities to patch systems before attackers exploit them. The White House position is essentially that wider access creates more attack surface, not less, and that the government’s own priority access shouldn’t be diluted.

The compute scarcity isn’t just a business problem. When a model is this capable and this resource-intensive, who gets priority access becomes a policy question. The federal government, apparently, does not want to be standing in line. That’s a reasonable position, and also a preview of how AI infrastructure will increasingly be treated — less like SaaS, more like controlled national infrastructure.

The compute shortage affecting Claude limits has been building for a while. The Mythos situation is the most visible expression of it yet.

The Pentagon Snub and What It Actually Signals

The Department of Defense signed AI agreements with eight companies: SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, and Oracle. Anthropic is not on that list.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

This is the downstream consequence of the Pentagon dispute earlier this year, where Anthropic refused to remove red lines around autonomous warfare and mass surveillance of American citizens. Those were the two specific limits Anthropic wouldn’t cross. The Pentagon walked. OpenAI stepped in.

The politics here are genuinely complicated. Anthropic has historically been the AI safety lab — the one most associated with careful deployment, alignment research, and regulatory engagement. That positioning created friction with the Trump administration, which has its own skepticism of the Biden-era AI safety framework that Anthropic helped shape.

The result is that the lab with arguably the most dangerous model — the one the White House is actively restricting — is also the one excluded from formal Pentagon AI agreements. That’s a strange position to be in. It means Mythos is simultaneously too dangerous to expand and not trusted enough to be in the official defense supply chain.

Dean Ball, an AI policy analyst with close government ties, put it plainly: “The government restricting the release of AI models is a type of licensing regime. It’s an informal, highly improvised licensing regime, but a licensing regime nonetheless.” He also noted this appears to be “the very first case of the US government restricting rollout of a new AI model based on policy considerations.” There are no formal laws, no legislative body issuing licenses — just the White House telling a private company it can’t expand its customer list.

The Non-Obvious Detail: These Are Sandboxed Evaluations

AISI is explicit about something that tends to get lost in the coverage: these are controlled evaluations. The simulated environments lack active defenses, don’t trigger real alerts, and don’t include defensive tooling responding in real time. It’s closer to a PvE scenario than a live network intrusion.

AISI explicitly said they don’t know how these models would perform against real-world hardened systems. The 3/10 completion rate on “Last Ones” might be higher or lower against an actual enterprise with active security operations. Nobody knows yet.

That caveat matters — but it doesn’t neutralize the concern. The time and cost curves are collapsing regardless of the sandboxing. Whatever these models can do in a controlled evaluation today, they’ll do faster and cheaper in six months. David Sacks, who advises the Trump cabinet and generally pushes back on AI doomsday framing, made this point directly: Mythos “is not magic, not a doomsday device,” and he expects all leading Chinese models to reach the same capability level within six months.

If Sacks is right, the question isn’t whether to restrict Mythos specifically. It’s whether restriction accomplishes anything when the capability is about to be widely distributed anyway.

What the Dual-Use Problem Actually Looks Like

The standard counterargument to AI cybersecurity concerns goes something like: skilled engineers can already find these vulnerabilities, so AI doesn’t change the threat landscape. That argument misses the point.

The relevant comparison isn’t AI versus a world-class security researcher. It’s AI versus someone who previously couldn’t do this at all. The $1.73 reverse-engineering solve isn’t impressive because it beats a human expert on cost — it’s significant because it puts that capability within reach of anyone with an API key and a credit card.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Language barriers disappear. Technical prerequisites disappear. The population of people who can attempt a sophisticated network intrusion expands dramatically. That’s the actual risk surface, and it’s not captured by benchmarking AI against the top 1% of security professionals.

This is also why the “defenders first” framing from Anthropic has real logic behind it. If these capabilities are going to diffuse regardless — through open-source Chinese models, through other frontier labs, through whatever comes next — then the question becomes whether defenders can use the same tools to patch vulnerabilities faster than attackers can exploit them. The cybersecurity capability gap between Mythos and earlier Claude models is large enough that this isn’t a marginal improvement for defenders. It’s a different class of tool.

What Builders Should Be Watching

If you’re building AI-powered security tooling, or integrating frontier models into workflows that touch sensitive infrastructure, a few things are worth tracking closely.

First, the compute priority question is going to get more explicit. Anthropic’s deals with Amazon, Google, and Broadcom will eventually bring more capacity online, but the period between now and then is one where access to Mythos-class models will be rationed. Knowing where your organization sits in that queue matters.

Second, the informal licensing regime Dean Ball describes is likely to become more formal. The White House blocking a customer expansion without any legal authority to do so is not a stable equilibrium. Either laws get written, or the practice stops, or it escalates into something messier. Any of those outcomes changes the compliance landscape for organizations building on frontier models.

Third, inference compute improves performance on these evaluations. AISI noted this directly — more hardware means better results on cyber tasks. As GPU availability increases and inference costs drop, the 3/10 and 2/10 completion rates on “Last Ones” are floors, not ceilings.

For teams building security-adjacent agents, the orchestration layer matters as much as the model. Platforms like MindStudio support 200+ models and 1,000+ integrations, which means you can swap in different model tiers for different risk levels — using a lighter model for routine scanning and reserving frontier-class inference for the tasks that actually need it. That kind of model-routing discipline is going to matter more as compute costs get explicit.

On the development side, the shift toward spec-driven tooling is relevant here too. Remy takes a different approach to building full-stack applications: you write an annotated markdown spec as the source of truth, and it compiles that into a complete TypeScript backend, database, auth, and deployment. When the underlying models improve, you update the spec and recompile — you’re not chasing AI-generated code through a codebase.

The Uncomfortable Conclusion

The most alarming thing in the government security reports on Claude Mythos isn’t any single finding. It’s the combination: a model that found a 27-year-old zero-day in one of the most hardened operating systems in existence, that can complete a 32-step network attack simulation nearly a third of the time, that is simultaneously too dangerous to expand and too politically complicated to deploy in the official defense supply chain.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

The White House restriction is described as the first of its kind. It probably won’t be the last. And the six-month window before Chinese models reach the same capability level — if Sacks is right — means the window for any restriction to matter is narrow.

The broader picture of what Mythos can do is still emerging. But the government security reports have already told us enough to know this isn’t a marketing story. The banks that got access had emergency meetings. The White House intervened without legal authority to do so. The Pentagon built its AI supply chain around eight companies and left the most safety-focused lab off the list.

That’s a lot of institutions acting like something real is happening. They’re probably right.