Zero Days Are Numbered: 5 Signs AI Is About to Surpass Humans at Finding Security Vulnerabilities

271 Vulnerabilities in One Release Cycle: The Evidence That AI Is Winning at Security Research

Mozilla published a blog post called “Zero Days Are Numbered” earlier this year, and the title is not a metaphor. In Firefox version 150, Anthropic’s Mythos model found 271 vulnerabilities in a single release cycle. If you work on software — any software — that number should stop you cold.

This is not a story about AI helping with code review. It is a story about a new industrial process for vulnerability discovery, and the evidence is now specific enough that you can’t wave it away as hype.

The Numbers That Changed the Conversation

Start with the comparison. In Firefox version 148, Anthropic’s Opus 4.6 found 22 security-sensitive bugs — 14 of them high severity. That was already notable. Then Mythos came along for version 150 and found 271.

That is a 12x increase in discovered vulnerabilities between two consecutive model generations, applied to the same codebase. Firefox is not a toy project. It is one of the most security-hardened open-source codebases in existence — a browser that processes untrusted content from the internet all day, every day, and has been hardened by dedicated fuzzing programs, sandboxing, memory safety work, internal security teams, bug bounty programs, and years of what the source material accurately calls “hard-won paranoia.”

And Mythos found 271 bugs in one cycle.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Google’s Project Naptime and its successor Big Sleep have been moving in the same direction — autonomous systems designed to find vulnerabilities in production code, not just flag known-bad patterns. OpenAI’s Codex Security is built around a similar loop: understand the codebase, build a threat model, validate issues in a sandbox, propose patches for human review. DARPA’s AI Cyber Challenge tested autonomous systems finding and patching vulnerabilities across large codebases at scale.

The details differ. The shape is consistent. Something has shifted.

What Mythos Actually Does (And Why It’s Different)

The reason these numbers matter is not just scale — it’s the nature of the process. Mythos is not running a static analysis tool with better regex. It participates in what looks like an actual research loop: read the code, form a hypothesis, use tools, generate test cases, reproduce the issue, refine the finding, explain the problem.

That last step — explain — is significant. A tool that can articulate why something is a vulnerability is doing something qualitatively different from a linter that flags a known-bad pattern. It is performing adversarial interpretation.

Security failures live in the gap between what code means to its author and what code actually permits. The author writes a parser that accepts one format. The implementation allows edge cases the author never considered. An attacker finds the space between what two parsers might disagree on and lives there. Human reviewers see intended meaning. Attackers search for actual behavior. Mythos appears to be doing the latter at machine scale.

This is the distinction that makes the Firefox numbers so striking. Firefox’s security team was already doing everything right by human standards. They had the fuzzing. They had the sandboxing. They had the bug bounties. Mythos found 271 more things anyway — not because the Firefox team was negligent, but because exhaustive adversarial search at this scale was previously impossible for humans to perform.

For more on what Mythos is and how it differs from previous Claude models, the capability comparison between Claude Mythos and Opus 4.6 is worth reading alongside this.

Why This Matters for Anyone Shipping Code

Here is the uncomfortable version of this story: if Firefox — with all of its security infrastructure — had 271 undiscovered vulnerabilities, what does your codebase have?

Most production codebases are not Firefox. They do not have dedicated security teams. They do not have years of paranoid engineering culture baked in. They have developers doing their best, code reviews when time allows, and maybe a SAST tool running in CI. The gap between “a good engineer wrote this” and “this code has been exhaustively adversarially searched” is enormous, and until recently, closing that gap was economically impossible.

That is changing fast. The prediction from the source material is that open-source models will reach Mythos-like security capability by end of 2026. GPT-5.5 is already showing some of the same security-sniffing attributes, though without the side-by-side case studies that make the Firefox comparison so concrete. Future Claude models will get there as compute scales. By December, the argument goes, Mythos-like capability will be broadly available.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

That timeline matters for how you think about your pipeline right now. If you are building agentic systems today — and if you’re reading this, you probably are — the security review step in your pipeline is about to get a very different set of options. Understanding what Claude Mythos actually is before it becomes a standard part of security tooling is not premature.

The Non-Obvious Detail Buried in the Mozilla Story

The 271 number gets the headlines. The more interesting detail is what it implies about the future trust model for code.

For the entire history of software, human authorship has been the default trust anchor. Humans write the code. Machines help check it. A good engineer wrote this is a meaningful claim. That claim is eroding — not because engineers are getting worse, but because the standard of proof is changing.

If machines become better than humans at exhaustively searching the consequences of code, then human authorship stops being the trust anchor. It becomes one more source of unverified risk. Code won’t be trusted because a good engineer wrote it. It will be trusted because it survived adversarial machine-scale scrutiny.

This is a bigger shift than it sounds. We have been through versions of this before in software. We stopped trusting developers to casually write cryptography — that’s not acceptable practice anymore. We stopped trusting manual memory management in large classes of software once safer alternatives became practical. We stopped trusting handrun production deploys without automation, rollback, and observability. In each case, human skill didn’t disappear. Human execution just lost the presumption of safety in that specific domain.

Code itself may be the next domain to lose that presumption. Not all code, not tomorrow, not in the theatrical sense where engineers disappear. But the direction is visible.

The practical implication is that your agentic pipeline needs to be modular enough to swap in a Mythos-equivalent when it becomes available. If you have a principal engineer reviewing code today, think about how modular that role is. You may want to replace it with a model in four or five months. The architecture decision you make now determines whether that swap is easy or painful.

Platforms like MindStudio handle the orchestration layer for this kind of agentic setup — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which means the infrastructure for plugging in a security-review agent as one node in a pipeline already exists. The missing piece has been a model capable enough to trust in that role. That piece is arriving.

What the Research Loop Tells You About Evals

There is a practical implication buried in how Mythos works that most people building agentic pipelines are getting wrong.

The source material makes a specific recommendation: at least 50% of your agentic pipeline evals should cover code hygiene and architecture, not just functional correctness. Most teams are running 80% functional evals and maybe 20% non-functional requirements. That ratio needs to flip.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

The reason is that insecure code is partly an act of creativity. A good eval can verify that code does X, Y, and Z. It cannot verify that code doesn’t permit something the author never imagined. Mythos is good at adversarial interpretation — reading code as if trying to find the worst possible interpretation and then trying to break it. To support that kind of review, your code needs to be legible enough for the model to reason over.

This means your evals should be enforcing things like: maximum lines per function, dependency handling standards, which expressions in your language of choice are tolerated and which are not. Every language has expressions that are notoriously unreliable for security researchers. You can ask Claude or GPT to enumerate them for your stack and write them into your evals. That process — moving from 20% code hygiene in your eval to 50% — is not about perfectionism. It is about giving a future security-review model a clean surface to work with.

The spec layer matters here too. The gap between what code means and what code permits is where vulnerabilities live. Narrowing that gap starts with writing better specs. Specificity is the enemy of technical and security debt. A good module has a verb — it does a thing, one thing, clearly. If you cannot write that down precisely, the implementation will have room to drift. Tools like Remy take this seriously at the architecture level: you write a spec — annotated markdown where readable prose carries intent and annotations carry precision — and the full-stack application is compiled from it. The spec is the source of truth; the code is derived output. That approach makes the meaning layer explicit in a way that adversarial review can actually work with.

The Benchmarks Behind the Claim

For the skeptical reader: the Firefox numbers are striking, but are they representative? A few data points worth holding onto.

Claude Mythos posted 93.9% on SWE-bench, the standard benchmark for autonomous software engineering tasks. That score represents a significant jump from previous models and is the kind of number that makes the Firefox results less surprising in retrospect — a model that can solve 93.9% of real-world software engineering tasks from a benchmark is also a model that can reason about code at a level of depth that enables genuine vulnerability research.

The DARPA AI Cyber Challenge is a separate data point. That program tested autonomous systems finding and patching vulnerabilities across large codebases — not a controlled lab environment, but a competitive challenge designed to stress-test exactly this capability. The results were consistent with what Mozilla found: autonomous systems can find and patch real vulnerabilities at a scale and speed that human researchers cannot match.

The convergence across Mozilla, Google Big Sleep, OpenAI Codex Security, and DARPA is the thing to notice. These are independent programs, different organizations, different model architectures, all arriving at the same basic finding in the same window of time. That is not a coincidence. Something has tipped.

What to Watch For in the Next Six Months

The practical watchpoints, in order of immediacy:

Model availability. Mythos is currently available only to select organizations — Mozilla got early access specifically because hardening high-value targets before broad release is the responsible path. As compute scales and the model becomes more broadly available, the security-review use case becomes accessible to teams that are not Mozilla. Watch for this.

Open-source parity. The prediction is that open-source models reach Mythos-like security capability by end of 2026. If that happens, the economics of security review change completely. A model you can run on your own infrastructure, pointed at your own codebase, running continuously in your CI pipeline — that is a different world from paying for API access to a frontier model. Anthropic’s compute constraints are part of why Mythos isn’t broadly available yet; as that situation resolves, availability will expand.

The disclosure problem. This is the part that doesn’t get enough attention. A model finding 271 vulnerabilities in Firefox is useful because Mozilla has the infrastructure to process those reports, triage them, patch them, and ship fixes. Most open-source maintainers do not have that infrastructure. When Mythos-like capability becomes broadly available, the volume of real vulnerability reports hitting small teams could be overwhelming. The disclosure norms and funding models for open-source security have not caught up to this reality.

Your own codebase. The most actionable watchpoint is the one closest to you. If you are shipping production software, the question is not whether AI will eventually be better than humans at finding vulnerabilities in your code. The evidence suggests it already is, for at least one model, applied to at least one major codebase. The question is whether your code is legible enough for that kind of review to work — and whether your pipeline is modular enough to plug it in when the right model becomes accessible to you.

Mozilla’s blog post is called “Zero Days Are Numbered.” That title is doing a lot of work. Zero-day vulnerabilities stop being zero-days only when the vendor knows about them, understands them, fixes them, ships the fix, and users deploy it. The world is full of systems that remain vulnerable long after fixes exist. The promise of Mythos-like capability is not that bugs disappear the moment a model finds them. It is that the window between “bug exists” and “bug is known” collapses — and that the best bugs, the ones that would have been zero-days, never make it to production in the first place.

That is a different kind of security guarantee than anything the industry has had before. Whether it arrives on the timeline the evidence suggests, and whether the infrastructure around it — disclosure norms, patching capacity, pipeline integration — can keep up, is the open question. But the direction is no longer ambiguous.