Why US Export Controls on GPUs Accidentally Made DeepSeek V4 Cheaper Than Any American Model

DeepSeek V4 Costs $1.74/M Tokens. Here’s the Actual Reason Why.

DeepSeek V4 launched with a 1 million token context window and input pricing of $1.74 per million tokens. GPT-5.5 charges $5 per million tokens input and $30 per million tokens output. That’s not a small discount — it’s a structural difference, and it traces back to something most coverage skips: China’s export-restricted GPUs forced DeepSeek into compute-efficient training methods, and those methods made the resulting model cheaper to serve at inference time too.

This post is about understanding why that happened, not just that it happened. If you’re deciding which models to build on, or trying to explain to a stakeholder why a Chinese open-weight model can undercut US frontier labs by 3x, this is the explanation.

What you need to follow this

You don’t need to run any code. But it helps to have a rough mental model of how LLMs work at inference time — specifically, that serving a model costs money proportional to how much compute each generated token requires.

If you’ve compared pricing pages across providers, you’re ready. If you’ve read about GPT-5.4 vs Claude Opus 4.6 on benchmarks and cost, you already have the right frame.

The constraint that created the efficiency

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

In 2022, the US government placed export controls on high-end Nvidia GPUs — specifically the A100 and later the H100 — restricting their sale to China. The intent was to slow Chinese AI development by denying access to the best training hardware.

It didn’t work the way anyone expected.

DeepSeek, a Chinese AI lab backed by the hedge fund High-Flyer, couldn’t buy the chips that US labs were training on. So they had to find other ways. They couldn’t just throw more compute at a problem. They had to be smarter about how they used the compute they had.

The result was a series of architectural and training choices that reduced the raw compute required to train a capable model. And here’s the part that matters for pricing: the same efficiency gains that reduce training cost also reduce inference cost. A model that was designed to do more with less hardware is also cheaper to serve at scale.

This is the core of why DeepSeek V4 can charge $1.74/M input tokens when GPT-5.5 charges $5/M and Claude Opus 4.7 charges $5/M. It’s not that DeepSeek is subsidizing losses. It’s that the model genuinely costs less to run.

How training efficiency translates to inference cost

To understand this, you need to know a bit about what makes a model expensive to serve.

When you send a prompt to a model, the model has to process every token through its layers to generate a response. The more parameters that activate per token, the more compute is required, and the more it costs to serve.

This is where architecture matters. DeepSeek V4 uses a mixture-of-experts (MoE) architecture. In a dense model, every parameter activates for every token. In an MoE model, only a subset of “experts” — specialized sub-networks — activate for any given token. The model routes each token to the relevant experts and skips the rest.

The practical effect: a model can have a very large total parameter count while only using a fraction of those parameters per inference call. You get the capability of a large model at the cost of a smaller one.

Meta’s Llama 4 Scout and Maverick use the same MoE approach. OpenAI’s GPT-OSS-20B and GPT-OSS-120B, released as open-weight reasoning models under Apache 2.0, are also part of this broader shift toward architectures that are more efficient at inference time.

But DeepSeek’s version of this was sharpened by necessity. They couldn’t afford to train inefficiently, so they developed techniques that US labs — with access to abundant H100 clusters — had less pressure to prioritize.

The benchmark picture

Before you conclude that cheaper means worse, look at what DeepSeek V4 actually does on benchmarks.

On math and Q&A tasks, DeepSeek V4 sits close to GPT-5.4 — not GPT-5.5, but the previous generation. That’s the honest comparison. GPT-5.5 at $30/M output tokens is measurably better on frontier tasks. But for the vast majority of production workloads — document summarization, customer support agents, structured data extraction, code assistance — the gap between DeepSeek V4 and GPT-5.5 is smaller than the price gap.

Compare the output pricing directly:

Model	Input (per 1M tokens)	Output (per 1M tokens)
DeepSeek V4	$1.74	$3.48
GPT-5.5	$5.00	$30.00
Claude Opus 4.7	$5.00	$25.00
Gemini 3.1	$2.00	$12.00

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Gemini 3.1 is the closest competitor on price, but its output pricing climbs with token volume. DeepSeek V4’s output pricing at $3.48/M is the lowest in this group by a significant margin.

For an enterprise running millions of tokens per day through an agentic workflow, that output pricing difference is the number that matters most. Agents generate a lot of output tokens.

Why open-weight changes the calculus further

DeepSeek V4 is open-weight. That means you can download the weights and run it yourself — on your own infrastructure, in your own data center, behind your own firewall.

When you self-host, the marginal cost per token drops to electricity and hardware amortization. For large enterprises with existing GPU infrastructure, this is a meaningful option. The security and privacy arguments are real: your prompts never leave your network, you’re not subject to a provider’s rate limits or terms of service changes, and you can fine-tune the model on your own data.

The catch is that DeepSeek V4 is still large enough that you can’t run it on a consumer GPU. You need serious hardware — the kind of setup described in the DGX Spark tier of local AI stacks, or a multi-GPU server. For most companies, that means using DeepSeek’s cloud API or a third-party host rather than true self-hosting. But the option exists, and it’s a meaningful difference from GPT-5.5 or Claude Opus 4.7, which are closed-weight and can only be accessed through their respective APIs.

Nvidia’s Neotron 3 Nano Omni is another example of this trend — an open-weight multimodal model (text, images, audio, video, documents, charts, and GUIs) designed to run on hardware like the DGX Spark. The direction of travel is toward capable open models that can be deployed locally, and DeepSeek V4 is the most capable example of that right now.

The US business model problem

Here’s where this gets structurally interesting, and where the export control story has an ironic second chapter.

US open-source AI labs face a broken business model. If you train a model and open-source the weights, anyone can serve it — and they’ll have better margins than you because they didn’t pay for the training run. There’s no obvious way to recoup the investment.

China doesn’t have this problem in the same way. The CCP subsidizes companies it wants to win in strategic industries. DeepSeek doesn’t need to recoup its training costs through API revenue the way a US startup does. It can give the weights away and undercut US closed-source pricing, and the economics still work for them.

The one US company that might have a viable open-source AI business model is Nvidia. They’re investing $26 billion in open-source AI, and their incentive structure is different from everyone else’s: their competitors — the cloud providers serving open models — are also their customers. Every inference call served on a Neotron model, a Llama model, or a DeepSeek model runs on Nvidia chips. Nvidia is upstream of all of it. They make money whether the open model wins or the closed model wins, as long as the inference runs on their hardware.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

That’s a genuinely different position from Anthropic, which has no open-source strategy, or Meta, which was bullish on open-source and then pulled back. Poolside AI’s Laguna XS2 — a 33B open-weight model currently free to use — and Mistral Medium 3.5, a 128B dense model released as open weights for remote agents, are examples of US-adjacent labs trying to compete in this space. But neither has the structural advantage Nvidia has.

What this means if you’re building on top of models

If you’re an AI builder choosing a model for a production workflow, the DeepSeek V4 pricing story has a few practical implications.

For high-volume, non-frontier tasks: DeepSeek V4 at $1.74/M input and $3.48/M output is worth evaluating seriously. Document processing, classification, summarization, structured extraction — these don’t require frontier intelligence. They require reliable, fast, cheap inference. DeepSeek V4 benchmarks well enough for most of these.

For agentic workflows: Output token cost matters more than input token cost in agentic loops, because agents generate a lot of text. At $3.48/M output vs $30/M for GPT-5.5, the economics of running a long agentic loop look very different. Platforms like MindStudio handle this orchestration across 200+ models, which means you can route different tasks to different models — frontier models for hard reasoning, cheaper models for high-volume sub-tasks — without rewriting your agent logic.

For RAG and embeddings: If you’re building a retrieval-augmented system, the embedding model matters as much as the generation model. Qwen embedding models are worth considering for local or self-hosted RAG pipelines — they’re efficient, well-documented, and work well in agent contexts. The generation model can be DeepSeek V4 for cost, with Qwen embeddings handling retrieval.

For privacy-sensitive workloads: The open-weight nature of DeepSeek V4 means self-hosting is possible. If your data can’t leave your infrastructure, this is a real option in a way that GPT-5.5 and Claude Opus 4.7 simply aren’t.

When you’re building the application layer on top of these models — not just calling an API but actually compiling a full-stack product — tools like Remy take a different approach: you write a spec in annotated markdown, and the full-stack application (TypeScript backend, SQLite database, auth, deployment) gets compiled from it. The spec is the source of truth; the model choice becomes a configuration detail rather than an architectural commitment.

The real failure modes to watch for

Censorship and content filtering. DeepSeek models are trained in China and have documented restrictions around politically sensitive topics. For most enterprise use cases this doesn’t matter, but it’s worth testing your specific use case before committing.

Latency from Chinese infrastructure. If you’re using DeepSeek’s own API, you may see higher latency than US-based providers depending on your location. Third-party hosts (Together AI, Fireworks, others) serve DeepSeek models from US infrastructure and can reduce this.

Model updates and versioning. Open-weight models don’t get silent updates the way closed APIs do. That’s mostly a feature — your production behavior is stable — but it also means you’re responsible for tracking when a newer version is worth migrating to.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

The geopolitical risk is real but often overstated. The concern that Chinese open-weight models could contain subtle biases or backdoors is legitimate and worth taking seriously for sensitive applications. For most business applications, the practical risk is lower than the theoretical one, but it’s not zero. The Anthropic compute shortage that’s been tightening Claude quotas is a reminder that closed-source providers have their own reliability risks too.

Where this goes next

The export control story isn’t over. The US has continued tightening restrictions, and China has continued finding workarounds — both in hardware (domestic GPU development) and in software (algorithmic efficiency). The pattern of constraint producing efficiency is likely to repeat.

For the models available right now, the comparison worth watching is how Google Gemma 4’s open-weight Apache 2.0 model develops relative to DeepSeek V4. Gemma 4 is designed for local deployment and has a more permissive license, but it’s targeting a different capability tier. DeepSeek V4 is the only open-weight model currently competing with GPT-5.4-class performance at sub-$2/M input pricing.

The deeper question for anyone building AI products is whether the efficiency gap between US and Chinese models is temporary or structural. If it’s temporary — if US labs adopt MoE and similar techniques at scale — then the pricing gap closes. If it’s structural — if the CCP subsidy model means Chinese labs can always afford to train more efficiently and give weights away — then the US open-source problem Matthew Berman describes doesn’t have an easy fix.

Either way, the export controls that were meant to slow DeepSeek down ended up producing a model that’s cheaper to run than anything the US has released. That’s the actual story behind the $1.74 price tag.