AI Model Distillation Attacks: What They Are and Why They Matter
Anthropic, Google, and OpenAI all reported distillation attacks from Chinese AI labs. Learn what model distillation is and why it's a security concern.
When Imitation Becomes Theft: The Rise of AI Distillation Attacks
In early 2025, OpenAI made a striking claim: a Chinese AI lab had stolen its technology — not by hacking servers or bribing employees, but by repeatedly querying its API and using the responses to train a competing model. The technique is called model distillation, and it’s quickly become one of the more consequential security concerns in the AI industry.
Anthropic, Google, and OpenAI have all reported variations of this problem. And as AI model capabilities become central to economic and national security competition, understanding what AI model distillation attacks are, how they work, and why they matter has moved well beyond a niche technical concern.
This article explains the full picture — from the underlying mechanics of knowledge distillation to the specific incidents, the defenses AI labs are developing, and the policy debates these attacks have triggered.
What Is Model Distillation (The Legitimate Version)?
To understand distillation attacks, you first need to understand what model distillation is supposed to be.
The Original Idea
Knowledge distillation was formalized by Geoffrey Hinton and colleagues in a 2015 paper. The core insight: a large, capable model (called the “teacher”) can help a smaller model (the “student”) learn more efficiently than training from raw data alone.
Here’s how it works in practice:
- You have a large, expensive teacher model that produces very accurate outputs.
- You run training data through the teacher and collect its outputs — not just the final answers, but the probability distributions it assigns to different possible answers.
- You train a smaller student model to mimic those outputs.
- The student learns richer information than it would from simple right/wrong labels.
The “soft labels” produced by a teacher model — the probability distribution across possible outputs — carry information about how the model reasons. If a teacher assigns 70% probability to one answer and 20% to a plausible alternative, that nuanced signal helps the student learn much faster than if it just saw the correct answer labeled as “1” and everything else as “0.”
Why Distillation Is Genuinely Useful
Used properly, distillation solves a real problem. Training large frontier models costs tens or hundreds of millions of dollars and requires massive compute infrastructure. Distillation lets organizations:
- Compress models for deployment on devices with limited compute (phones, edge hardware)
- Reduce inference costs without significantly sacrificing quality
- Specialize models for specific tasks using a general teacher
Companies do this with their own models all the time. Meta, for instance, has published research on distilling large LLaMA models into smaller variants. OpenAI has acknowledged that some of its own smaller models benefit from distillation from larger ones.
None of this is controversial. The controversy starts when someone uses another organization’s model — without permission — as the teacher.
How Distillation Becomes an Attack
When applied to someone else’s proprietary model, the same technique becomes a form of IP theft. The attacker gains the benefits of a frontier model’s capabilities without paying for the research, compute, and data that produced them.
Black-Box Distillation
The most common attack vector is black-box distillation. The attacker doesn’t need access to the model’s weights or internal architecture — they only need API access.
The process looks like this:
- Generate or acquire a large dataset of prompts — questions, instructions, coding problems, reasoning tasks, etc.
- Query the target model’s API at scale, collecting the responses.
- Use those responses as training data for a new model.
- The new model learns to replicate the target’s behavior, tone, capabilities, and reasoning patterns.
The attacker is essentially outsourcing their data labeling to the target model. Instead of paying human annotators or running expensive training runs to develop capabilities, they’re using the frontier model as a free (or cheap, at API rates) annotation engine.
Chain-of-Thought Distillation
A more sophisticated variant targets chain-of-thought (CoT) reasoning. Models like OpenAI’s o1 and Anthropic’s Claude are trained to reason step-by-step before answering. When you query these models on hard math or logic problems, they often output their reasoning process, not just the final answer.
That reasoning trace is enormously valuable training data. It shows the student model not just what to answer but how to think through a problem. Training a model on thousands of such reasoning traces produces something much more capable than training on answers alone.
This is particularly significant because reasoning-capable AI models — often called “o-series” or “thinking” models — are among the most expensive and difficult to develop. Chain-of-thought distillation offers a shortcut to that capability.
The Scale of the Problem
What distinguishes a distillation attack from someone just using an AI tool is the intent and scale. An individual using ChatGPT for work is a user. An organization systematically generating millions of queries to harvest training data is an attacker.
Indicators of a distillation attack include:
- Unusually high API usage from specific accounts or IP ranges
- Query patterns that look like structured training datasets rather than natural user behavior
- Prompts designed to elicit specific response formats or reasoning demonstrations
- Output that closely mirrors the target model’s style, formatting, or even idiosyncratic errors
This last point is telling: if a “new” model makes the same unusual mistakes as a specific teacher model, it’s a strong signal that distillation occurred.
The Major Incidents: What Actually Happened
The distillation attack concern moved from theoretical to concrete with the emergence of DeepSeek’s R1 model in January 2025 — and the investigations it triggered.
DeepSeek and the OpenAI Investigation
DeepSeek is a Chinese AI lab affiliated with the hedge fund High-Flyer. In late January 2025, it released DeepSeek-R1, a reasoning model that performed comparably to OpenAI’s o1 on several benchmarks — at a fraction of the reported training cost.
The model’s rapid capability gains prompted immediate scrutiny. OpenAI announced it had evidence that DeepSeek had used “distillation” from OpenAI models to train R1. Microsoft’s security team, which monitors unusual patterns on OpenAI’s infrastructure, reportedly detected large-scale data extraction activity linked to DeepSeek-affiliated accounts.
DeepSeek’s own technical report offered a partial admission of sorts: it acknowledged that some of its smaller “distilled” models — versions fine-tuned from open-source base models like Qwen and LLaMA — were trained on outputs generated by R1. But the question of whether R1 itself was built using outputs from OpenAI’s models remained contested.
The circumstantial evidence drew wide attention:
- DeepSeek-R1’s reasoning style closely resembled patterns associated with OpenAI’s o1.
- DeepSeek’s technical report showed performance curves that some researchers said were inconsistent with training purely from scratch on public data.
- The speed of DeepSeek’s capability improvements outpaced what would typically be expected from independent development.
OpenAI, for its part, said it was investigating and taking steps to prevent further extraction.
Anthropic’s Experience
Anthropic hasn’t detailed a specific incident with the same public specificity, but the company has been vocal about distillation attacks as a known threat to its business and, by extension, its safety mission.
Anthropic’s position is that funding frontier safety research depends on having a viable commercial business. If competitors can simply extract the capabilities of Claude by querying the API and training on the results, it undermines the economic rationale for investing in safe, well-aligned frontier models.
Anthropic’s usage policies explicitly prohibit using Claude outputs to train models that compete with Claude. But enforcement relies primarily on detection — identifying patterns consistent with distillation — rather than any technical barrier.
Google’s Concerns
Google has faced similar concerns around Gemini. The pattern is consistent across all three labs: as their models become more capable, the value of distilling from them increases, which increases the incentive for unauthorized extraction.
Google has also had to grapple with the fact that some of its research — particularly work on knowledge distillation and model compression — helped establish the techniques now being turned against frontier labs. The 2015 Hinton paper was Google Research. The techniques that enable attacks were originally published to advance the field.
The Broader Pattern
It’s worth noting that these aren’t just external attacks. There have also been concerns about insiders or affiliated researchers using privileged access to extract model capabilities. And smaller, open-source efforts have routinely built on outputs from proprietary models — sometimes transparently (as with early “Alpaca” models from Stanford, which were trained on ChatGPT outputs and drew a cease-and-desist), sometimes not.
The DeepSeek situation stands out primarily because of scale, the geopolitical context, and the performance level achieved — suggesting that distillation at sufficient scale can produce results approaching the teacher model’s capabilities.
Why Distillation Attacks Matter Beyond IP Theft
The most obvious harm is economic: a competitor extracts value from R&D they didn’t fund. But the implications run deeper than that.
National Security and Export Controls
The US government has implemented significant export controls on advanced AI chips, specifically targeting Chinese entities. The logic: if China can’t access the compute needed to train frontier models, it can’t develop frontier AI capabilities.
Distillation attacks represent a partial end-run around that logic. If a Chinese lab can extract frontier capabilities from an American lab’s API — without needing the compute to train from scratch — chip export controls become less effective as a policy lever.
This is why US government officials and researchers have framed distillation attacks not just as business disputes but as national security concerns. The Senate and House have held hearings touching on this issue, and the Commerce Department has been asked to consider whether API access itself should be regulated for certain foreign entities.
Undermining Safety Research
The major Western AI labs — Anthropic, OpenAI, Google DeepMind — invest heavily in alignment and safety research. They argue that their models are safer than they would be if developed by organizations that prioritize capability over safety.
If distillation allows less safety-focused organizations to achieve comparable capabilities with much less investment, it shifts the competitive landscape away from labs that bear the costs of safety research. A lab that skips safety infrastructure can undercut the price of API access and still offer comparable capabilities, if those capabilities were extracted from a safety-conscious competitor.
This is particularly sensitive for Anthropic, whose entire business model is premised on the idea that safe AI development is worth funding because it produces better, more trustworthy products.
The Arms Race Dynamic
Detection and defense against distillation attacks is creating an escalating dynamic:
- Labs build better detection → attackers use more sophisticated obfuscation
- Labs restrict API access → attackers route through intermediaries and proxies
- Labs watermark outputs → researchers develop techniques to remove watermarks
- Labs limit output verbosity → attackers find ways to elicit more detailed responses
This arms race consumes engineering resources on the defensive side and creates friction for legitimate users, who may encounter rate limits, access restrictions, and monitoring that’s a direct consequence of attacker activity.
How AI Labs Are Defending Against Distillation Attacks
The defenses against distillation attacks operate at several levels, from technical detection to legal enforcement.
Anomaly Detection and Rate Limiting
The first line of defense is identifying attack patterns before significant data extraction occurs. Labs monitor for:
- Unusual query volumes from specific accounts or IP addresses
- Structured query patterns that suggest automated data collection rather than organic use
- High diversity in prompt topics — attackers often sample broadly to maximize coverage of the training distribution
- Specific prompt formats designed to elicit detailed reasoning traces or particular output formats
Rate limiting is a blunt but effective tool. If you can only make 100 queries per hour, bulk extraction becomes much slower and more expensive. Enterprise accounts with higher limits receive more scrutiny.
Output Watermarking
Watermarking AI model outputs is an active research area. The idea: embed detectable signals in generated text that allow the model’s owner to later prove that specific text was generated by their model.
There are two main approaches:
- Statistical watermarking: Subtly alter word choice probabilities to embed a detectable signal, invisible to readers but detectable by a classifier.
- Semantic watermarking: Embed signals in how ideas are expressed rather than the specific words chosen.
Watermarking is valuable for legal purposes — it could help a lab prove in court that a competitor’s model was trained on their outputs. But it has limitations: watermarks can sometimes be removed or diluted, especially if the attacker mixes watermarked data with large amounts of other training data.
Usage Policy Enforcement and Legal Action
Every major AI lab prohibits using their model outputs to train competing models in their terms of service. OpenAI’s terms state: “You may not use output from the Services to develop models that compete with OpenAI.”
Enforcement has so far been limited primarily to cease-and-desist letters and account termination. Legal action is complicated because:
- Proving that a model was trained on specific outputs is technically difficult
- Jurisdiction issues arise when the attacker is in another country
- The legal framework for AI-generated outputs as protectable IP remains unsettled in many jurisdictions
Stanford’s Alpaca project — which trained a model on ChatGPT outputs — was shut down voluntarily after OpenAI objected, suggesting that social and reputational pressure can sometimes be effective for domestic actors. It’s less clear that such pressure works across international lines.
Capability Limitations and Output Restrictions
Some labs have experimented with restricting what their models will output in response to certain prompt patterns. For instance:
- Refusing to produce structured datasets on request
- Limiting the length or detail of chain-of-thought reasoning visible to users
- Declining to answer certain types of questions that are primarily useful for training data generation
These restrictions create real costs for legitimate use cases — developers who want to extract structured data for benign purposes, for instance — and are generally viewed as temporary measures rather than robust long-term defenses.
Hardware-Level and Cryptographic Approaches
Longer-term research directions include:
- Trusted execution environments that could run model inference in ways that prevent output logging at scale
- Homomorphic encryption approaches that could theoretically allow querying a model without being able to observe inputs or outputs — though this remains computationally impractical at current model scales
- Federated inference schemes that distribute computation to make bulk extraction harder
None of these are deployed at scale today, and some face significant engineering obstacles. But they represent the direction that more robust defenses might eventually take.
The Legal and Policy Landscape
The legal framework for distillation attacks is still catching up to the technical reality.
Are Model Outputs Copyrightable?
One central question: does a company own the copyright to outputs produced by its model?
This is genuinely unsettled. Copyright protects original human expression, and the extent to which AI-generated outputs qualify has not been fully adjudicated. US Copyright Office guidance has indicated that AI-generated outputs without meaningful human authorship may not be copyrightable.
If a company’s API outputs aren’t copyrightable, then using those outputs as training data — even at scale — may not constitute copyright infringement under current law. The legal theory would need to rely on other grounds: contract law (terms of service violations), trade secret law, or unfair competition.
Trade Secret Claims
Trade secret law is potentially more applicable. If a lab can demonstrate that:
- Its model outputs contain information derived from proprietary training data and methods
- The lab takes reasonable steps to keep that information secret
- The attacker improperly acquired or used that information
…then a trade secret claim could succeed. But “reasonable steps” to keep API outputs secret is an odd concept when the whole point is to make those outputs available to users.
The Computer Fraud and Abuse Act
In the US, the Computer Fraud and Abuse Act (CFAA) makes unauthorized access to computer systems illegal. Violating terms of service to extract data has sometimes been prosecuted under the CFAA, though courts have had mixed views on whether ToS violations constitute “unauthorized access” in the statutory sense.
International cases — particularly those involving Chinese entities — face additional complications around enforcement and jurisdiction.
Regulatory Proposals
Several regulatory proposals have emerged:
- API access restrictions for foreign entities: Some policymakers have proposed requiring that AI API access by certain foreign nationals or companies be licensed or restricted, similar to export control frameworks.
- Mandatory output watermarking: Some proposals would require frontier AI labs to watermark outputs to enable attribution.
- Distillation disclosure requirements: Requirements that AI labs disclose when their models were trained using outputs from other AI systems.
The EU’s AI Act, which took effect in stages beginning in 2024 and 2025, includes provisions around AI transparency that may eventually touch on distillation practices, though specific rules around distillation attacks remain underdeveloped.
What Distillation Attacks Reveal About AI’s Competitive Dynamics
The distillation attack phenomenon reflects something important about how AI competition actually works.
Capability Is the Bottleneck
For most of AI’s history, data was the key scarce resource. The lab with the most and best training data would build the best models. Frontier model training has shifted that calculus — compute and research talent are now the primary bottlenecks for developing new capabilities from scratch.
But if capabilities can be extracted from existing models through distillation, then neither data nor compute is the binding constraint — access to a frontier model is. And frontier model access is relatively cheap compared to training one.
This dynamic changes competitive incentives in ways that are still playing out. It suggests that the “moat” for AI labs isn’t primarily their trained models but rather:
- The continuous improvement of those models (so yesterday’s extracted capabilities are already outdated)
- The data flywheel from user interactions
- Trust and safety reputation
- Enterprise relationships and integration ecosystems
Open Source Complicates Everything
The open-source AI movement — particularly Meta’s release of the LLaMA series — adds another layer. When capable open-source base models are available, distillation attacks become more efficient: an attacker can start with a strong open-source foundation and fine-tune it using extracted outputs from a proprietary model, rather than training from scratch.
DeepSeek’s published distilled models explicitly use LLaMA and Qwen as base models, fine-tuned on R1 outputs. The question is whether R1 itself was built using OpenAI outputs — a more serious allegation.
The availability of capable open-source models means that the barrier to mounting a distillation attack is lower than it would be if attackers had to train from scratch. And open-source models, once released, can’t be recalled or restricted.
The Benchmark Problem
AI models are evaluated on standardized benchmarks. These benchmarks are often public. This creates an incentive for attackers to target specifically the capabilities measured by prominent benchmarks — essentially teaching their model to perform well on the tests that matter for marketing and reputation, even if those capabilities aren’t deeply generalized.
Distillation from frontier models on benchmark-relevant task types can produce inflated benchmark performance that doesn’t reflect genuine general capability. This makes it harder to assess true capability gaps between models, and can lead to overclaiming about the progress of distilled models.
How MindStudio Approaches AI Model Access and Security
For organizations building AI-powered applications on top of frontier models, distillation attack concerns have practical implications. If you’re building a product that relies on a particular model’s capabilities, you need to think about how those models are accessed, logged, and potentially exposed.
MindStudio’s platform offers an interesting vantage point here. Because it provides access to 200+ AI models — including Claude, GPT-4, and Gemini — through a unified interface, it abstracts the direct API relationship. Teams building AI agents on MindStudio don’t need to manage API credentials directly or worry about the infrastructure-level logging that can expose data.
More practically: organizations using MindStudio to build internal tools and workflows can switch between models without rebuilding their applications. If a specific model becomes unavailable due to access restrictions — an increasingly plausible scenario as labs tighten API access in response to distillation threats — the application can be updated to use a different model without significant rework.
For teams that want to build on top of AI capabilities without getting entangled in the escalating access restrictions that distillation defenses are creating, MindStudio’s model-agnostic approach is worth considering. You can start for free at mindstudio.ai.
This isn’t about circumventing restrictions — MindStudio uses models through legitimate API agreements. But as AI labs respond to distillation attacks by increasing scrutiny of API usage, having an infrastructure layer that’s designed for compliant, well-governed access matters.
Frequently Asked Questions
What exactly is an AI model distillation attack?
A distillation attack happens when someone systematically queries a proprietary AI model’s API at scale, collects the responses, and uses those responses as training data to build a new model. The new model “learns” from the frontier model’s outputs, effectively extracting its capabilities without the attacker needing to invest in the original training. The attack is analogous to reverse-engineering a product through extensive testing — you can often reconstruct a lot of what went into making it just by observing how it behaves.
Is model distillation always illegal or unethical?
No. Distilling your own models — training smaller versions of systems you own — is a standard and widely accepted practice in the AI industry. The problem arises specifically when distillation is performed on another organization’s proprietary model without permission, especially when the terms of service explicitly prohibit it. There’s also a spectrum of severity: a researcher casually collecting a few hundred outputs for a non-commercial study is very different from an organization running millions of queries to build a commercial competitor.
How did OpenAI detect that DeepSeek was distilling from its models?
OpenAI and Microsoft didn’t publish a detailed methodology, but the detection was reported to involve monitoring for unusual API usage patterns — specifically, query volumes and patterns from certain accounts that were inconsistent with normal user behavior and more consistent with structured data collection for machine learning. Additionally, characteristics of DeepSeek’s model outputs — including specific stylistic patterns and behaviors that closely matched OpenAI’s models — served as circumstantial evidence. This kind of behavioral fingerprinting is one of the primary tools for post-hoc detection.
Can AI companies actually prevent distillation attacks?
Not fully, with current technology. Rate limiting slows attacks but doesn’t stop a well-resourced adversary willing to use many accounts or spend more time. Output watermarking can help with attribution after the fact but doesn’t prevent extraction. Usage policy enforcement depends on detection, and sophisticated attackers can route through proxies or use many identities to avoid detection. The honest answer is that determined, well-resourced attackers — especially state-affiliated ones — can likely conduct meaningful distillation despite current defenses. The defenses raise the cost and complexity but don’t create an impenetrable barrier.
What is chain-of-thought distillation and why is it especially concerning?
Chain-of-thought distillation specifically targets the reasoning traces that some AI models produce — the step-by-step thinking process, not just the final answer. These reasoning traces are particularly valuable training data because they show how to think through problems, not just what the correct answer is. Developing this kind of reasoning capability from scratch is among the most expensive and difficult aspects of training frontier models. If an attacker can harvest thousands of high-quality reasoning traces from an o1-class model and use them to fine-tune a cheaper base model, they can achieve a significant capability transfer at relatively low cost.
What are the national security implications of AI distillation attacks?
The core concern is that export controls designed to limit China’s access to frontier AI technology can be partially circumvented through API access. If a Chinese lab can achieve near-frontier capabilities through distillation from American model APIs, the chip export controls that were intended to limit China’s AI development become less effective. This has led some policymakers to call for additional restrictions on API access by Chinese entities or entities with Chinese affiliations — essentially extending the export control logic to software and model access, not just hardware.
Key Takeaways
Understanding AI model distillation attacks matters for anyone following the AI industry — whether you’re building AI applications, investing in AI companies, or thinking about AI policy.
- Model distillation is legitimate technology repurposed as an attack vector. Training small models using outputs from large ones is standard practice when done with your own models; it becomes a problem when applied to competitors’ proprietary systems without permission.
- The DeepSeek incident raised the profile of this threat significantly. OpenAI’s claim that DeepSeek distilled from its models — combined with the performance of DeepSeek-R1 — made the theoretical concern concrete.
- Current defenses are imperfect. Rate limiting, watermarking, and anomaly detection raise the cost of attacks but don’t prevent them entirely. The legal framework for pursuing bad actors — especially internationally — is underdeveloped.
- The national security implications go beyond IP theft. Distillation attacks represent a potential workaround to chip export controls, which has drawn attention from lawmakers and security officials.
- The incentives are only growing. As frontier models become more capable, the value of distilling from them increases. Expect this to remain an active area of technical and policy development for years.
If you’re building AI-powered products and want a model-agnostic platform that handles the infrastructure complexity — including the operational risks that come with depending on a single model’s API — MindStudio is worth exploring. You can get started for free, and the platform supports Claude, GPT-4, Gemini, and 200+ other models without requiring separate accounts or API key management.