How to Use AI for Security Auditing Before Your Competitors Do: A Practical Starting Guide

Autonomous Vulnerability Research Is Becoming Standard Practice — Here’s How to Start

You can set up a working AI security audit loop for your codebase in an afternoon. Not a perfect one, not a replacement for a real security team, but a genuine first pass that catches real bugs — the kind that would have taken a senior engineer days to find manually.

That’s worth doing now, because the tools doing this at scale are no longer experimental. Google’s Project Naptime and Big Sleep, OpenAI’s Codex Security, and DARPA’s AI Cyber Challenge are all converging on the same autonomous loop: read the codebase, build a threat model, generate test cases, reproduce issues, refine findings, explain the problem. Mozilla recently published a post called “Zero Days Are Numbered” describing what happened when they pointed Anthropic’s Claude Mythos preview at Firefox v150: 271 vulnerabilities surfaced in a single release cycle. The previous collaboration with Opus 4.6 found 22 security-sensitive bugs in Firefox v148, 14 of them high severity. The jump from 22 to 271 in one generation is the signal that this approach has crossed some kind of threshold.

The question for most engineering teams isn’t whether this matters. It’s whether you’re going to build the habit now or scramble to catch up later.

What You Actually Get Out of This

Before walking through the setup, it’s worth being concrete about what AI security auditing produces — and what it doesn’t.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

What it produces: a list of candidate vulnerabilities, ranked by severity, with explanations of why each one is a problem and often a suggested fix. The best systems also generate test cases that reproduce the issue, so you’re not just reading a description — you’re looking at evidence.

What it doesn’t produce: certainty. AI systems hallucinate. They miss context. They can flag a function as dangerous when the surrounding architecture makes it safe, or miss a subtle logic flaw that requires understanding the product’s intent. A good AI audit is a starting point for human review, not a replacement for it.

The reason this is still worth doing: the gap between “AI audit” and “no audit” is enormous. Most codebases have never had a systematic adversarial review. The AI will find real bugs. The IMF’s financial stability warning — which specifically named Claude Mythos preview and OpenAI’s GPT-5.5 cyber attack version — wasn’t about theoretical risk. It was about the demonstrated ability of these models to find and exploit vulnerabilities in major operating systems and browsers, even when used by non-experts. If the model can find those bugs in Firefox, one of the most security-hardened open-source codebases in the world, it can find bugs in your codebase.

The concrete outcome you’re aiming for: a repeatable process that runs on every significant release, produces a prioritized list of security findings, and feeds into your normal code review workflow.

What You Need Before You Start

You don’t need a security background to run an AI audit. You do need a few things in place.

Access to a capable model. Not all models are equivalent here. The Mozilla experiment used Claude Mythos, which is currently in limited preview. For teams that don’t have access, Claude Sonnet or GPT-4o are reasonable starting points for the kind of pattern-based vulnerability detection that catches the majority of common issues. The autonomous research loop — where the model forms hypotheses, generates test cases, and iterates — requires a more capable model, but you can get meaningful results from the simpler approach today.

A codebase you can share with the model. This sounds obvious, but it has real implications. If your code contains secrets, credentials, or proprietary data that can’t leave your environment, you’ll need to either use a self-hosted model or carefully scope what you share. For most teams, the practical answer is to share the code without secrets (which you should already be doing in version control) and treat the audit as you would any external code review.

A way to run the model on your code. You can do this manually by pasting code into a chat interface, but that doesn’t scale. The more useful setup involves either a CLI tool like Claude Code or a scripted workflow that feeds files to the model systematically. If you’re already using Claude Code for parallel development work, you can extend that setup to run security passes on branches before they merge.

A place to track findings. A simple spreadsheet works. A GitHub issue with a security label works. What doesn’t work is letting the output disappear into a chat window.

The Audit Loop, Step by Step

Step 1: Scope the audit

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Pick a specific part of your codebase to start with. Don’t try to audit everything at once.

Good candidates for a first audit: authentication code, input parsing, file upload handling, anything that touches external APIs, anything that handles user-supplied data before it reaches a database.

Bad candidates for a first audit: utility functions, UI components, configuration files. These can have security implications, but they’re harder to audit without context and less likely to produce high-severity findings.

Write down what you’re auditing and why. This becomes your audit scope document, and it’s what you’ll hand to the model as context.

Now you have: a defined scope, a list of files to review, and a one-paragraph description of what this code is supposed to do.

Step 2: Write a system prompt that sets up the adversarial frame

The difference between a useful security audit and a generic code review is the framing. You want the model to read your code the way an attacker would — asking “what does this code allow?” rather than “what does this code intend?”

Here’s a prompt structure that works:

You are performing a security audit of the following code. Your job is to find vulnerabilities, not to evaluate code quality or style.

For each finding:
1. Describe the vulnerability
2. Explain how an attacker could exploit it
3. Rate severity (critical / high / medium / low)
4. Suggest a fix

Focus on: injection vulnerabilities, authentication bypasses, authorization flaws, insecure deserialization, path traversal, race conditions, and logic errors that could be exploited.

Context: [describe what this code does and who uses it]

Code:
[paste the code]

The key phrase is “what does this code allow.” Security failures live in the gap between what the author intended and what the implementation actually permits. The model needs to be in adversarial interpretation mode, not helpful assistant mode.

Now you have: a prompt template you can reuse across audit sessions.

Step 3: Run the audit and capture the output

Paste your code and run the prompt. For larger codebases, you’ll need to chunk this — most models have context limits, and you’ll get better results from focused passes on specific modules than from dumping thousands of lines at once.

For each finding the model returns, capture:

The file and line number (or function name)
The vulnerability type
The severity rating
The model’s explanation
Whether you’ve verified it manually

Don’t skip the manual verification step. AI models produce false positives. A finding that looks alarming might be safe in context. A finding that looks minor might be more serious than the model realized. The model’s job is to surface candidates; your job is to triage them.

Now you have: a raw list of candidate vulnerabilities with severity ratings.

Step 4: Build the test case

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

For any finding rated high or critical, ask the model to generate a test case that demonstrates the vulnerability. This is where the autonomous research loop that Google Project Naptime and OpenAI Codex Security are building toward becomes visible — the model doesn’t just describe the problem, it shows you how to reproduce it.

A prompt for this step:

You identified this vulnerability: [paste the finding]

Write a test case (or a curl command, or a script) that demonstrates this vulnerability. The test should show the attack succeeding, not just describe it.

If the model can produce a working exploit, the vulnerability is real. If it can’t, that’s useful information too — it might mean the finding is a false positive, or it might mean the model doesn’t have enough context about your runtime environment.

Now you have: verified findings with reproduction steps, which is what you need to prioritize fixes.

Step 5: Integrate into your release process

A one-time audit is useful. A repeatable audit is what actually improves your security posture over time.

The goal is to run a security pass on every significant release, before it ships. This doesn’t have to be the full autonomous loop — a focused pass on changed files is often enough. If you’re using Claude Code agents running continuously, you can hook this into your build pipeline so it runs automatically.

The output of each audit should feed back into your scope document. If the model keeps finding the same class of vulnerability, that’s a signal about your codebase’s structural weaknesses — not just individual bugs to fix, but patterns to address in your architecture.

Now you have: a repeatable security audit process that runs on every release.

Where This Actually Goes Wrong

The model hallucinates a vulnerability. This happens. The model will sometimes describe an attack that doesn’t work because it misunderstood how a library function behaves, or because it missed a validation step earlier in the call chain. The fix is the verification step — always try to reproduce high-severity findings before acting on them.

The context window runs out. For large codebases, you’ll hit limits. The solution is chunking: audit one module at a time, and keep a running document of findings across sessions. The model can’t hold your entire codebase in context, but it doesn’t need to — most vulnerabilities are local to a specific function or module.

The findings are all low-severity. This can mean your code is genuinely secure (good!), or it can mean you scoped the audit too narrowly, or the model isn’t in the right frame. Try explicitly asking: “What’s the most dangerous thing an attacker could do with this code?” Sometimes you need to push the model toward the adversarial frame before it finds the interesting stuff.

You don’t have time to fix everything. This is the real problem. An AI audit can surface dozens of findings quickly, and triaging them takes time. The severity ratings help, but you’ll still need to make judgment calls. Focus on anything that allows unauthenticated access, anything that touches user data, and anything that could affect other users (not just the attacker). Everything else can wait.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Your codebase is too messy to audit effectively. This is a real constraint. The same properties that make code readable to humans — narrow modules, explicit boundaries, small interfaces, clear specifications — make it easier for AI systems to reason about security. Technical debt becomes security debt in a direct way when you’re trying to run automated audits. If the model keeps producing vague findings because it can’t follow the control flow, that’s a signal to refactor before auditing.

Where to Take This Further

The loop described above — scope, prompt, audit, verify, integrate — is a starting point. The direction the field is moving is toward more automation at each step.

The autonomous research loop that Mythos uses (read code → form hypothesis → use tools → generate test cases → reproduce issue → refine finding → explain problem) is the same shape as what Google’s Project Naptime and DARPA’s AI Cyber Challenge are building. The difference is that those systems can run the entire loop without human intervention at each step. That capability is becoming more accessible, not less.

For teams building agentic pipelines, the practical next step is to make the security audit a node in your existing workflow rather than a separate manual process. If you’re already thinking about how to structure workflows, agents, and tools in your build pipeline, a security audit agent fits naturally into that architecture — it’s a tool that takes code as input and produces findings as output, which can then trigger a human review step or feed into automated patching.

One thing worth building now, even before you have a fully automated pipeline: a spec document for each module that describes what the code is supposed to do, what it’s explicitly not supposed to do, and what the security-relevant invariants are. This is the “meaning layer” that makes audits more effective — the model can compare what the code allows against what the spec says it should allow. Platforms like MindStudio make it easier to chain this kind of structured review into a repeatable workflow, connecting the model output to your existing tools without writing orchestration code from scratch.

The spec document also has a second use: it’s what you hand to a human reviewer when the model flags something ambiguous. The reviewer needs to know what the code was supposed to do in order to judge whether the finding is real.

If you’re thinking about how specs relate to the code itself, Remy takes this further — you write the application as an annotated markdown spec, and the full-stack TypeScript app gets compiled from it. The spec is the source of truth; the code is derived output. That’s a different model than auditing existing code, but it points toward the same principle: the more precisely you can express intent, the more effectively machines can verify that the implementation matches it.

The Mozilla experiment found 271 vulnerabilities in Firefox v150 — a codebase with dedicated fuzzing, sandboxing, bug bounties, and internal security teams. The lesson isn’t that Firefox is insecure. The lesson is that exhaustive adversarial search at machine scale finds things that human review misses, even when the human review is very good. Your codebase has the same property. The question is whether you find those bugs first.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

Start with one module. Run the prompt. Verify the findings. Build the habit. The teams that have this process in place by the end of the year are going to be in a meaningfully different position than the ones that don’t.