GPT-5.5 vs Claude Mythos on Cybersecurity: Which AI Is More Dangerous?

Two Models Walk Into a Cyberattack Simulation

GPT-5.5 and Claude Mythos are now the only two AI models known to complete a full end-to-end cyberattack simulation. That’s the choice you’re actually navigating: which of these systems is more capable, more dangerous, and more relevant to how you think about AI in security contexts. The numbers are close enough to matter. GPT-5.5 scored 71.4% on expert-level cyber tasks. Claude Mythos scored 68.6%. GPT-5.5 solved a reverse-engineering challenge in 10 minutes and 22 seconds for $1.73 in API costs — a task estimated to take a human expert 12 hours.

That cost figure is the one worth sitting with. Not the benchmark percentage, not the completion rate. $1.73.

When you compress 12 hours of expert labor into 10 minutes and under two dollars, you haven’t just made a task faster. You’ve changed who can do it. That’s the actual story here, and it’s worth understanding both models clearly before drawing conclusions.

What the Benchmark Actually Measures

The AISI — the UK’s AI Security Institute — runs an evaluation called “The Last Ones.” It’s a 32-step simulated corporate network attack. Not a quiz. Not a capture-the-flag puzzle. A sustained, multi-step intrusion sequence against a simulated enterprise target, the kind of thing that would take a human expert roughly 20 hours to complete end-to-end.

Before this month, no AI model had completed it end-to-end. Then Claude Mythos did — 3 out of 10 attempts. Then GPT-5.5 did — 2 out of 10 attempts.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Three things are worth noting about those completion rates. First, they’re low in absolute terms. A 30% and 20% success rate sounds unimpressive until you realize the baseline was zero. Second, these are sandboxed environments without active defenses, triggered alerts, or real-time response — AISI explicitly noted they don’t know how performance would translate to hardened real-world systems. Third, the trend line matters more than the current number. Both models are improving. Both are getting cheaper to run. The 32-step simulation that stumped every model six months ago now has two solutions.

The expert-level cyber task scores — 71.4% for GPT-5.5, 68.6% for Claude Mythos — are a broader measure across many task types, not just the end-to-end simulation. GPT-5.5 leads by about 2.8 percentage points. That’s a real gap but not a decisive one. These models are operating in the same capability tier.

GPT-5.5: The Faster, Cheaper Attacker

The reverse-engineering result is the most concrete data point we have on GPT-5.5’s cyber capabilities. AISI highlighted a specific challenge: GPT-5.5 solved it in 10 minutes and 22 seconds, spending $1.73 in API costs. The human expert baseline was approximately 12 hours.

That’s not a 10x improvement. It’s closer to 70x on time, and the cost comparison isn’t even meaningful — no human expert works for $1.73.

What this tells you about GPT-5.5 is that it’s particularly strong at the kind of structured, well-defined problem that reverse engineering represents. You have a binary or a compiled artifact, you have a goal, and you need to reason systematically through layers of abstraction to find the vulnerability or understand the behavior. GPT-5.5 appears to be very good at this.

The 71.4% expert cyber task score puts it slightly ahead of Mythos on the aggregate measure. OpenAI is also rolling GPT-5.5 cyber capabilities out to what they’re calling “critical defenders” — vetted organizations using it to find and patch vulnerabilities before attackers do.

One practical note for anyone building security tooling: GPT-5.5’s API costs are substantially higher than most models ($5 per million input tokens, $30 per million output tokens). The $1.73 figure for that reverse-engineering challenge reflects a relatively short, focused task. Sustained agentic security workflows at scale will cost more. If you’re building multi-step vulnerability scanning pipelines, the economics matter. Platforms like MindStudio handle this orchestration across 200+ models, which matters when you’re trying to route different task types to the most cost-effective model rather than running everything through the most expensive one.

Claude Mythos: The Deeper Digger

Mythos scored 68.6% on expert cyber tasks and completed the Last Ones simulation in 3 out of 10 attempts — one more than GPT-5.5. On the headline benchmark, it trails slightly. On the end-to-end simulation, it leads slightly. Neither gap is large enough to declare a winner.

What distinguishes Mythos is the kind of finding it surfaces. The most striking example: Mythos found a 27-year-old OpenBSD vulnerability. A security flaw that had existed, undetected, since roughly 1998. Not a new vulnerability it created — AI doesn’t create vulnerabilities — but one that had been sitting in production code for nearly three decades without anyone finding it.

That’s a different capability profile than fast reverse engineering. It suggests Mythos is particularly strong at deep, patient analysis of large codebases — the kind of work where the vulnerability isn’t obvious and requires understanding subtle interactions across many components. The OpenBSD finding is the kind of result that makes security researchers take notice, because it implies the model can find things that experienced humans missed for years.

Mythos sits above Opus in Anthropic’s model hierarchy — Haiku → Sonnet → Opus → Mythos — which means it’s also substantially more compute-intensive. This is part of why access is restricted to roughly 50 organizations, and why Anthropic’s attempt to expand to 120 organizations ran into the White House intervention. Anthropic has signed compute deals with Amazon, Google, and Broadcom, but those buildouts take time. The model that found a 27-year-old OpenBSD bug is also the model you might have to wait in line to use.

For a deeper look at how Mythos compares to Anthropic’s previous flagship, the Claude Mythos vs Opus 4.6 cybersecurity capability gap analysis is worth reading — the jump from Opus to Mythos on security benchmarks is larger than most people expect.

The Dimensions That Actually Separate Them

Speed and cost on defined tasks. GPT-5.5 wins clearly here. The 10-minute, $1.73 reverse-engineering result is a concrete data point. For tasks with clear structure — analyze this binary, find the vulnerability in this function — GPT-5.5 appears faster and its API pricing, while high, is predictable.

Deep codebase analysis. Mythos appears stronger here, based on the OpenBSD finding. Finding a 27-year-old bug requires the kind of patient, wide-context reasoning that isn’t captured well by timed benchmarks. If you’re auditing a large legacy codebase, Mythos’s profile looks more relevant.

End-to-end attack simulation. Mythos: 3/10. GPT-5.5: 2/10. Mythos leads, but both are in the same range. Neither is reliable enough to run autonomously without human oversight — which is probably the right constraint for now.

Availability. GPT-5.5 is more accessible. Mythos access is restricted to ~50 organizations, and the White House has blocked Anthropic’s attempt to expand to 120. GPT-5.5 is being actively rolled out to defenders. If you need to actually use one of these models today, GPT-5.5 is the realistic option for most organizations.

Aggregate expert task score. GPT-5.5: 71.4%. Mythos: 68.6%. A real but narrow gap. Not decisive on its own.

The Dual-Use Problem Neither Score Captures

Both models are described by the labs as tools for “defenders” — organizations that find vulnerabilities to patch them. That framing is accurate but incomplete.

The same capability that lets a defender find a 27-year-old OpenBSD bug lets an attacker find it first. The same 10-minute reverse-engineering capability that helps a security researcher analyze malware helps someone else build it. This isn’t a new problem — every security tool is dual-use — but the cost and accessibility curve is new.

When a task that required 12 hours of expert labor costs $1.73 and 10 minutes, you’ve changed the population of people who can do it. Not just made it faster for experts. Made it accessible to people who aren’t experts, who don’t speak English natively, who don’t have the background to find these vulnerabilities through traditional means. The AISI benchmarks measure capability. They don’t measure the distribution of who will use that capability.

David Sax’s counter-framing is worth taking seriously here: Mythos “is not magic, not a doomsday device,” and he expects all leading Chinese models to reach the same capability within six months. If that’s right, then restricting Mythos access doesn’t prevent the capability from existing in the world — it just determines who has it first. The argument for getting these models into defenders’ hands quickly is that the offensive capability is coming regardless, and defenders need the same tools.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

That’s a reasonable argument. It’s also the argument that every dual-use technology has made throughout history, and it’s sometimes right and sometimes wrong.

The Pentagon Signal

One data point that doesn’t show up in the benchmark tables: the Pentagon signed AI agreements with eight companies — SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, and Oracle. Anthropic is notably absent.

This matters for the GPT-5.5 vs. Mythos comparison because it tells you something about institutional trust and deployment trajectory. GPT-5.5 is being integrated into government and defense contexts. Mythos is being restricted from expanding beyond its current ~50 organizations. Whatever the benchmark scores say, the deployment reality is that GPT-5.5 is more likely to end up in production security workflows in the near term.

For teams building security tooling today, that’s a practical consideration. The GPT-5.5 vs Claude Opus 4.7 coding comparison covers some of the broader capability differences that inform how these models behave in agentic pipelines, which is increasingly how security tools get deployed.

Which One to Use, and When

Use GPT-5.5 if you need fast, cost-predictable analysis of well-defined security tasks. Reverse engineering, binary analysis, CVE triage, structured vulnerability scanning — tasks where the problem is clear and speed matters. It’s more accessible, more deployable, and the $1.73 reverse-engineering result suggests it punches hard on focused problems.

Use Mythos if you can get access to it and you’re doing deep, open-ended codebase auditing where the vulnerability isn’t obvious. The OpenBSD finding suggests it has a different kind of patience for complex, long-horizon analysis. If you’re auditing legacy infrastructure where the bugs have had decades to hide, Mythos’s profile is more relevant.

For most organizations, the choice is made for you by availability. Mythos is restricted. GPT-5.5 is not. The benchmark gap is narrow enough that this practical constraint dominates the decision.

The more interesting question is what happens in six months, when — if Sax is right — multiple models reach this capability tier. At that point, the comparison shifts from “which model found the 27-year-old bug” to “which infrastructure lets you deploy these capabilities reliably, route tasks to the right model, and maintain oversight.” That’s a different problem than picking a benchmark winner. Tools like Remy point at one version of this future: you write a spec for what you want the system to do, and the full-stack implementation — backend, database, deployment — gets compiled from it. The source of truth shifts up the abstraction stack, which matters when the underlying models are changing every few months anyway.

The benchmark scores will keep moving. GPT-5.5’s 71.4% and Mythos’s 68.6% are snapshots. The $1.73 reverse-engineering result is the number that will age least gracefully — not because it’s wrong, but because it will look quaint when the next model does it in two minutes for thirty cents.

What won’t change is the underlying dynamic: these models find real vulnerabilities, they’re getting faster and cheaper, and the population of people who can use them is expanding. That’s the comparison that matters most, and neither benchmark score fully captures it.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

For context on how the broader model landscape has shifted around these two, the GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmark comparison shows where the previous generation stood — which makes the jump to Mythos and GPT-5.5’s cyber capabilities easier to calibrate.