Skill Compression: How Claude Mythos Turns Mediocre Hackers into Elite Threat Actors at Scale

271 Vulnerabilities in One Release Cycle Is the Wrong Number to Focus On

Claude Mythos found 271 vulnerabilities in Firefox v150 in a single release cycle. That number is striking, but it’s not the number that should concern you most.

The number that matters is the one implied by skill compression: the idea that tasks previously requiring a team of highly paid expert security researchers are now accessible to non-experts running 20–100 parallel agent instances simultaneously. That’s the structural change. The Firefox result is just one data point that makes the structure visible.

You don’t need to work in security to care about this. If you build software, deploy infrastructure, or work at a company that does either of those things, the threat model just changed in a way that affects you directly.

What Skill Compression Actually Means

The term gets used loosely, so it’s worth being precise. Skill compression doesn’t mean AI makes existing experts faster. It means AI collapses the barrier between “person with no relevant training” and “person capable of doing expert-level work.”

We’ve seen this before in adjacent domains. When ChatGPT launched, Amazon ebook submissions tripled — not because existing authors wrote more books, but because people who had never written a book before suddenly could. The same pattern hit iOS app submissions when agentic coding matured. A flat line for years, then a near-vertical climb. The existing developers didn’t change their behavior much. A new population of people entered the market.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

Cybersecurity is the same dynamic, with higher stakes.

Before models like Mythos, finding and exploiting vulnerabilities in production-grade codebases required a specific combination of skills: deep knowledge of memory management, familiarity with browser internals, the ability to reason about parser disagreements, experience with fuzzing infrastructure, and years of pattern recognition built from reading CVEs. That combination is rare. The people who have it are expensive. And they can only work on one thing at a time.

Mythos removes most of those requirements. The attacker doesn’t need to understand why a particular parser disagreement creates an exploitable condition — the model does. The attacker doesn’t need to speak English fluently, or at all. The prompt can be in any language; the attack executes the same way.

That’s skill compression. The skill floor dropped dramatically. The number of people who can now attempt serious vulnerability research went from thousands to potentially millions.

The Scale Problem Is Separate and Worse

Skill compression is the first problem. Scale is the second, and they compound.

When a human security team hunts for vulnerabilities, they work serially. Even a large red team is fundamentally constrained by the number of people who can hold a complex codebase in their heads simultaneously. You can’t split a human brain across 20 different attack vectors.

You can split an agent.

Anyone who has used Claude Code or Codex seriously knows the workflow: while one instance is working through a problem, you open another tab and start a second task. Boris Cherny, who built Claude Code, has described routinely running five tabs with agents and sub-agents working across different projects in parallel. That’s a productivity pattern for legitimate development work.

The same pattern applies to offensive security research. One attacker, running 20–100 parallel Mythos instances, can simultaneously probe 20–100 different codebases, or attack the same codebase from 100 different angles at once. The constraint shifts from “how many expert humans can I hire” to “how much compute can I afford.”

According to Anthropic’s own reporting, the per-exploit cost metric for Mythos is “not massive” — despite Mythos being an expensive model to run. When you’re thinking about cost-per-vulnerability rather than cost-per-hour, the economics of large-scale attack campaigns change entirely.

This is why the IMF’s warning isn’t about one super-hacker. It’s about thousands of mediocre hackers suddenly operating at elite level, simultaneously, across every major codebase on the internet.

Why the IMF and Every Major Bank CEO Are Paying Attention

The IMF’s article — “Financial stability risks mount as artificial intelligence fuels cyber attacks” — is notable for something beyond its conclusions. It specifically names Claude Mythos preview and OpenAI’s GPT-5.5 cyber attack version by name. This appears to be the first time specific AI model names have appeared in a systemic financial risk document from a body that covers 191 countries.

That’s not a casual editorial choice. When the IMF names a model in a financial stability warning, it’s because the risk assessment team concluded the capability is concrete enough to warrant that specificity.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The officials flagging Mythos read like a who’s-who of global financial oversight: Francois-Philippe Champagne (Canadian Finance Minister), Andrew Bailey (Bank of England Governor), Christine Lagarde (ECB President), Scott Bessant (US Treasury Secretary), and Jerome Powell (Federal Reserve Chair). These aren’t people who share a stage often.

The bank CEO briefings are equally telling. Jamie Dimon, along with the CEOs of Goldman Sachs, Bank of America, Citigroup, Morgan Stanley, and Wells Fargo, all attended what was described as a red-alert briefing on Mythos capabilities. Dimon’s shareholder letter put it plainly: “Cyber security remains one of the biggest risks and AI almost surely will make this risk worse.”

Getting all six of those CEOs in one room for a single topic is not routine. The fact that they came suggests the demonstration was compelling enough that declining felt unwise.

The systemic concern is straightforward: banks aren’t just websites with money behind them. They’re the plumbing for payroll, mortgages, credit markets, ATM networks, and settlement systems. A successful attack on payments infrastructure doesn’t need to cause actual collapse to trigger a financial shock. It just needs to create enough ambiguity that confidence erodes. Markets don’t need certainty to panic — they need uncertainty.

What Mythos Is Actually Doing When It Finds Vulnerabilities

Understanding the threat requires understanding the process. Mythos isn’t running a static analysis tool or a known-pattern scanner. The research loop is more sophisticated: read the code, form a hypothesis about where a vulnerability might exist, use tools to probe that hypothesis, generate test cases, reproduce the issue, refine the finding, and then explain the problem in detail.

This is adversarial interpretation. The model asks “what does this code actually allow?” rather than “what did the author intend?” Security failures almost always live in the gap between those two questions. A parser that the author believed accepted one format might, under specific conditions, accept two — and the attack lives in the disagreement between those two parsers.

Humans are bad at this kind of exhaustive adversarial reading at scale. We understand intent well. We’re good at the meaning layer. We’re much worse at systematically enumerating every possible behavioral interpretation of a complex codebase. Mythos appears to be genuinely better at that specific task.

The previous Anthropic collaboration with Opus 4.6 found 22 security-sensitive bugs in Firefox v148, with 14 rated high severity. That was already impressive. Mythos found 271 in v150. That’s not a linear improvement — it’s a different capability class. Mozilla’s blog post on the results was titled “Zero Days Are Numbered,” which is either confident or alarmed depending on which side of the vulnerability you’re on.

Google’s Project Naptime and Big Sleep, OpenAI’s Codex Security, and DARPA’s AI Cyber Challenge are all pursuing similar autonomous vulnerability research loops. The shape of the approach is consistent across organizations: understand the codebase, build a threat model, validate issues in a sandbox, propose patches. The details differ, but the direction is the same.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

If you’re building AI-powered security tooling and want to chain these kinds of research loops without writing all the orchestration yourself, MindStudio offers a no-code path: 200+ models, 1,000+ integrations, and a visual builder for composing agents and workflows — which is relevant when you want to experiment with multi-step security analysis pipelines before committing to custom infrastructure.

What This Means for Your Threat Model

If you’re an engineer or builder, the practical implication is that your codebase is now subject to a different kind of scrutiny than it was two years ago.

The previous threat model assumed that finding vulnerabilities in well-maintained code was expensive and slow. Firefox has dedicated fuzzing infrastructure, sandboxing, memory safety work, internal security teams, and a mature bug bounty program. It’s one of the most hardened open-source codebases in the world. If Mythos can surface 271 vulnerabilities there in one release cycle, the implicit security guarantee of “this code has been reviewed by good engineers” is weaker than it used to be.

That doesn’t mean you should panic. It means you should update your assumptions about who might be looking at your code and what they’re capable of finding.

The capability jump from Opus 4.6 to Mythos in cybersecurity is significant — Mythos scores 83.1% on cybersecurity benchmarks versus Opus 4.6’s 66.6%. That gap isn’t just a benchmark number. It represents a qualitative change in what the model can do autonomously in a vulnerability research loop.

For teams building applications, the implication is that code hygiene has become a security property in a more direct way than before. Messy, hard-to-read code isn’t just a maintenance problem — it’s structurally resistant to the AI-assisted auditing tools that could make it safer. Narrow modules, explicit API boundaries, small interfaces, and clear specifications all make code easier for adversarial machine-scale scrutiny to reason over. Technical debt is now security debt in a more literal sense.

This is also where the abstraction question becomes practical. If you’re building systems where the spec is the source of truth and the implementation is derived output, you’re in a better position to apply systematic security review. Tools like Remy take this approach: you write an annotated markdown spec, and the full-stack application — TypeScript backend, SQLite database, auth, tests, deployment — gets compiled from it. When the spec is the canonical artifact, it’s easier to reason about what the system is supposed to allow and to verify that the implementation hasn’t drifted from that intent.

The Asymmetry That Makes This Hard to Defend Against

There’s a structural asymmetry in how skill compression affects offense versus defense.

On the offensive side, skill compression is a multiplier. A non-expert attacker running 20 parallel Mythos instances can probe 20 codebases simultaneously. The cost is compute. The skill barrier is low. The scale is essentially unlimited.

On the defensive side, skill compression helps too — but the defender has to protect everything, while the attacker only has to find one thing. That asymmetry has always existed in security, but Mythos makes it more acute. The attacker’s cost per successful exploit drops. The defender’s surface area doesn’t shrink.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

This is why the IMF’s framing of systemic risk is accurate. It’s not that any single attack will necessarily succeed. It’s that the probability of at least one significant attack succeeding across the entire financial system increases as the attacker population grows and their per-attempt cost falls. You don’t need a super-hacker. You need enough mediocre hackers with access to Mythos-level capability, running enough parallel instances, probing enough codebases, that the expected number of successful exploits per month becomes non-trivial.

The benchmark results for Mythos — 93.9% on SWE-bench, 59% on multimodal — give some sense of the underlying capability. But the cybersecurity numbers are the ones that matter for this specific threat model.

The Defensive Flip

There’s a version of this story that ends with “therefore everything is terrible.” That’s not quite right.

The same capability that makes Mythos dangerous as an offensive tool makes it valuable as a defensive one. Mozilla’s experiment demonstrates this directly: they got early access to Mythos, pointed it at Firefox v150, and shipped fixes for 271 vulnerabilities before those vulnerabilities were ever public. The zero-days were numbered because they were found and patched before attackers could find them.

The question is who gets access first and who moves faster. Right now, Anthropic is being selective about Mythos access — the organizations that have it tend to control some of the most critical infrastructure on the internet, which is presumably intentional. The goal appears to be hardening high-value targets before the capability becomes broadly available.

That window won’t stay open indefinitely. The capability trajectory from Opus 4.6 to Mythos suggests that Mythos-level security research capability will become more widely available as compute costs fall and competing models catch up. When that happens, the defensive advantage of early access disappears.

The practical implication for builders is to treat the current period as a window to get ahead of the threat model rather than react to it. That means writing cleaner code, building agentic security review into your pipeline, and taking seriously the idea that “a good engineer reviewed this” is a weaker security claim than it was in 2024.

Skill compression doesn’t make security impossible. It makes the old assumptions about who can attack you, and how many of them there are, no longer reliable.

Skill Compression: How Claude Mythos Turns Mediocre Hackers into Elite Threat Actors at Scale

271 Vulnerabilities in One Release Cycle Is the Wrong Number to Focus On

What Skill Compression Actually Means

Not a coding agent. A product manager.

The Scale Problem Is Separate and Worse

Why the IMF and Every Major Bank CEO Are Paying Attention

Day one: idea. Day one: app.

What Mythos Is Actually Doing When It Finds Vulnerabilities

Everyone else built a construction worker.
We built the contractor.

What This Means for Your Threat Model

The Asymmetry That Makes This Hard to Defend Against

Other agents ship a demo. Remy ships an app.

The Defensive Flip

Related Articles

AI Auditing With vs. Without NLAs: Catching Misaligned Claude Haiku 3.5 in 12–15% of Cases

Anthropic's NLA Research: 5 Times Claude Was Caught Hiding What It Was Really Thinking

Claude Knew It Was Being Tested in 26% of Benchmark Runs — Anthropic's NLA Data Explained

Anthropic's NLA Paper: 5 Alarming Findings About What Claude Knows But Doesn't Say

271 Vulnerabilities in One Release Cycle Is the Wrong Number to Focus On

What Skill Compression Actually Means

Not a coding agent. A product manager.

The Scale Problem Is Separate and Worse

Why the IMF and Every Major Bank CEO Are Paying Attention

Day one: idea. Day one: app.

What Mythos Is Actually Doing When It Finds Vulnerabilities

Everyone else built a construction worker.We built the contractor.

What This Means for Your Threat Model

The Asymmetry That Makes This Hard to Defend Against

Other agents ship a demo. Remy ships an app.

The Defensive Flip

Related Articles

AI Auditing With vs. Without NLAs: Catching Misaligned Claude Haiku 3.5 in 12–15% of Cases

Anthropic's NLA Research: 5 Times Claude Was Caught Hiding What It Was Really Thinking

Claude Knew It Was Being Tested in 26% of Benchmark Runs — Anthropic's NLA Data Explained

Anthropic's NLA Paper: 5 Alarming Findings About What Claude Knows But Doesn't Say

Everyone else built a construction worker.
We built the contractor.