GPT-5.5 Solved a 12-Hour Reverse Engineering Challenge in 10 Minutes for $1.73

GPT-5.5 Just Made a 12-Hour Security Task Cost $1.73

GPT-5.5 solved a reverse engineering challenge in 10 minutes and 22 seconds for $1.73 in API costs. The same task would take a human security expert roughly 12 hours. That’s not a rounding error — it’s a compression of time and cost by a factor of about 70x.

This comes from AISI, the UK’s AI Security Institute, which is a government-backed evaluation body that tests frontier models for dangerous capabilities. They ran GPT-5.5 through a series of expert-level cybersecurity tasks and published the results. The reverse engineering figure is the one that should stop you mid-scroll.

If you work anywhere near security tooling, offensive research, or infrastructure defense, the implications here are worth sitting with for a few minutes.

The Benchmark Numbers Behind the Headline

AISI runs a test called the Last Ones — a 32-step simulated corporate network attack that they estimate would take a human expert around 20 hours to complete end-to-end. It’s not a CTF puzzle. It’s a multi-phase simulation of the kind of sustained attack you’d run against an enterprise target: reconnaissance, lateral movement, privilege escalation, the whole chain.

GPT-5.5 completed the Last Ones in 2 out of 10 attempts. Claude Mythos completed it in 3 out of 10 attempts. Both numbers are low in absolute terms, but the point isn’t the completion rate — it’s that these models can complete it at all. Before Mythos, no model had done it. Now there are two.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

On the broader expert cyber task scoring, GPT-5.5 came in at 71.4%. Claude Mythos scored 68.6%. The gap is narrow enough that you’d call it a tie in most contexts. What matters is that both models are now operating in the same capability tier on offensive security tasks — and that tier didn’t exist at the frontier model level until very recently.

The $1.73 reverse engineering result is the sharpest illustration of what this tier means in practice. AISI highlighted it specifically because it collapses two variables simultaneously: time and cost. Twelve hours of expert labor versus ten minutes of API calls. The dollar figure is almost beside the point, but it makes the comparison concrete in a way that “faster” doesn’t.

For context on how GPT-5.5 stacks up against Claude on other dimensions, the GPT-5.5 vs Claude Opus 4.7 coding comparison is worth reading — GPT-5.5 uses significantly fewer output tokens on equivalent tasks, which compounds the cost advantage in agentic workloads.

Why This Matters for Anyone Building Security Tooling

The obvious read is “AI can now help attackers.” That’s true, but it’s also the less interesting half of the story.

The more immediate implication is for defenders. If you’re running a security team and you’re not experimenting with these models for vulnerability discovery, you’re operating at a structural disadvantage. The same capability that lets GPT-5.5 solve a reverse engineering challenge in ten minutes can be pointed at your own codebase to find what’s already there.

Claude Mythos demonstrated this concretely when it found a 27-year-old OpenBSD vulnerability — something that had gone undetected for nearly three decades across countless audits and reviews. The model didn’t create a new attack surface. It found one that had always existed. That’s the distinction David Sax draws in his counter-framing: these models are microscopes, not weapons factories. They expose what’s already there.

The dual-use nature is real, but the asymmetry matters. Defenders have legitimate access to their own systems. Attackers have to find a way in first. If defenders can run a $1.73 reverse engineering pass over their own binaries before an attacker does, that’s a meaningful advantage — assuming they actually do it.

OpenAI is already rolling out GPT-5.5 Cyber to what they’re calling a “critical defenders” list. The framing from the labs is consistent: get vetted defenders access first, let them find and patch vulnerabilities, then worry about broader availability.

The Non-Obvious Detail: These Are Sandboxed Environments

Here’s the thing AISI explicitly stated that most coverage glosses over: they don’t know how these models would perform against real-world hardened systems.

The Last Ones benchmark runs in a simulated environment. There are no active defenses. No triggered alerts. No blue team responding in real time to anomalous behavior. It’s closer to a PvE scenario than actual adversarial conditions. A model that completes 2 or 3 out of 10 attempts in a static simulation might perform very differently against a system with active monitoring, behavioral detection, and a security operations center watching the logs.

That caveat doesn’t make the results less significant — the time and cost compression is real regardless of the environment. But it does mean the “AI can now autonomously hack enterprise networks” framing is ahead of what the evidence actually shows. AISI is careful about this. The benchmarks measure capability in controlled conditions, not operational effectiveness in the wild.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

What the benchmarks do show clearly is the trajectory. Inference compute improves performance — more GPUs, better results. The models that score 71.4% today will score higher in six months on cheaper hardware. The $1.73 figure will become $0.50. The 10-minute solve will become 3 minutes. Whatever ceiling exists on these evaluations, the models are approaching it faster than most security teams are adapting.

This is also why the compute question matters more than it might seem. Anthropic’s compute constraints are directly relevant here — Mythos sits above Opus in Anthropic’s model hierarchy, in a new compute tier entirely, which is part of why access has been so restricted. Running Mythos-class models at scale against every major company’s infrastructure isn’t just a policy question; it’s a physics question about available GPU time.

The Access Control Problem Nobody Has Solved

The White House blocked Anthropic from expanding Mythos access from 50 to 120 organizations. Anthropic wanted to add 70 more vetted defenders. The White House said no, citing two reasons: national security concerns, and uncertainty about whether Anthropic had enough compute to serve both the expanded list and the federal government’s own usage.

Anthropic disputes the compute framing — they’ve signed new deals with Amazon, Google, and Broadcom. But those buildouts take time to come online, and in the interim, someone has to decide who gets priority access when demand exceeds supply. The federal government, unsurprisingly, doesn’t want to be in the queue behind 70 additional organizations.

What’s actually happening here is a soft licensing regime operating without formal legal authority. No laws were passed. No legislative body created a framework. The White House is simply exercising informal control over which organizations can access a specific model class, based on a combination of national security judgment and compute prioritization. It looks like licensing. It functions like licensing. It just doesn’t have the procedural legitimacy of licensing.

There’s also the Discord situation: an unauthorized group had Mythos access at some point, and the investigation is apparently still ongoing. The official count was 50 organizations. Unofficially, the perimeter was already leaky.

Dean Abal, an AI policy analyst with government experience, called this “building a dam against a tsunami.” His argument is that the underlying capabilities will diffuse in 6 to 18 months regardless — either from Western labs, from Chinese open-source releases, or from the general trajectory of model improvement. Restricting access to Mythos specifically buys time, but not much of it. GPT-5.5 is already at the same capability tier. The Chinese frontier labs are likely within six months of the same benchmarks. The dam metaphor is apt: it’s not that the dam is useless, it’s that you need to be building something else at the same time.

The more durable solution, Abal argues, is technical safeguards rather than access restrictions alone. If defenders can safely use stronger models to patch vulnerabilities faster than attackers can exploit them, that creates a structural advantage that doesn’t depend on keeping the models secret — because the models won’t stay secret.

What the Cost Curve Actually Means

The $1.73 figure deserves more attention than it’s getting.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

Security research has always been expensive in human time. A 12-hour reverse engineering task isn’t something you run on every binary in your dependency tree. You prioritize, you triage, you accept that some things won’t get looked at. The economics of manual security review create gaps, and those gaps persist because filling them costs more than most organizations can justify.

At $1.73 per task, the economics change. You can run this kind of analysis at a scale that was previously impractical. You can check every third-party library in your build. You can run it on every new commit. You can build it into your CI pipeline as a standard step rather than a quarterly engagement with an external firm.

This is where the defender advantage becomes concrete. An attacker running GPT-5.5 against your systems is working from the outside, with limited visibility, against active defenses. A defender running the same model against their own systems has full access, complete context, and can act on the findings immediately. The same $1.73 that makes offense cheaper makes defense cheaper too — and defenders have structural advantages that attackers don’t.

Building that kind of automated security workflow is increasingly tractable. Platforms like MindStudio handle the orchestration layer — connecting models to your existing tools, chaining analysis steps, routing findings to the right systems — without requiring you to write the integration code from scratch. The bottleneck shifts from “can we build this” to “what do we do with the output.”

The tooling side is also moving fast. Peter Steinberger, the developer behind OpenClaw, shipped a small open-source utility called Codex Bar that tracks quota usage for both OpenAI Codex and Claude Code in real time. It’s a minor thing, but it’s indicative of the pace at which the ecosystem around these models is developing. The infrastructure for working with frontier models at scale is being built in public, incrementally, by people who are actually using the tools.

For teams thinking about how to operationalize AI-assisted security review, the spec-driven approach is worth considering. Remy takes a different angle on this problem: you write your application as an annotated markdown spec, and it compiles a complete TypeScript backend, database, auth layer, and deployment from that spec. The point isn’t to avoid code — the code still exists — it’s that the source of truth is now something more readable and maintainable than the implementation itself. For security tooling that needs to be auditable, that distinction matters.

What to Watch For

The benchmark that matters most right now isn’t the Last Ones completion rate. It’s the cost curve.

GPT-5.5 scored 71.4% on expert cyber tasks. Claude Mythos scored 68.6%. Both completed the 32-step Last Ones simulation. Both can solve reverse engineering challenges that would take human experts most of a workday. These are the current numbers, and they’ll be higher in six months.

The question for anyone building security tooling is: what does your workflow look like when this analysis costs $0.50 instead of $1.73? When it takes two minutes instead of ten? When the model that scores 71% today scores 85%?

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

The Claude Mythos benchmark results give you a sense of what the current ceiling looks like on the Anthropic side — 93.9% on SWE-bench, with the cybersecurity capabilities that prompted the White House intervention. GPT-5.5 is now in the same tier. The next models from both labs will push further.

AISI’s honest caveat — that they don’t know how these models perform against real-world hardened systems — is the right frame for now. But “we don’t know” is not the same as “it doesn’t matter.” The controlled evaluation results are real. The cost compression is real. The trajectory is clear enough that waiting for certainty before adapting your security posture is itself a risk decision.

The defenders who are experimenting with these models now, building the workflows, understanding the failure modes, will be better positioned when the capabilities improve. The ones waiting for the technology to mature before engaging will be catching up to a moving target.

GPT-5.5 Solved a 12-Hour Reverse Engineering Challenge in 10 Minutes for $1.73

GPT-5.5 Just Made a 12-Hour Security Task Cost $1.73

The Benchmark Numbers Behind the Headline

Why This Matters for Anyone Building Security Tooling

The Non-Obvious Detail: These Are Sandboxed Environments

Seven tools to build an app. Or just Remy.

The Access Control Problem Nobody Has Solved

What the Cost Curve Actually Means

Day one: idea. Day one: app.

What to Watch For

Remy doesn't build the plumbing. It inherits it.

Related Articles

Claude Mythos and GPT-5.5 Pass the 'Last Ones' Cyberattack Benchmark: 6 Things You Need to Know

DeepSeek V4 Launch: 5 Specs That Threaten Closed Frontier Labs

Sam Altman Says OpenAI Is Now an AI Inference Company — What That Shift Means for Multi-Cloud Buyers

GPT-5.5 Review: A Better Agent Model, Not a Better Chat