How to Run Claude Code with Cheaper Models: OpenRouter, NVIDIA NIM, and Ollama
Use Claude Code's interface with DeepSeek, Gemma, and other affordable models via proxy. Get 80–90% of Opus quality at 2–5% of the cost.
Why Paying Full Anthropic Prices for Every Claude Code Task Doesn’t Make Sense
Claude Code is one of the most capable AI coding assistants available. But if you’ve been running it exclusively on Claude Opus 4 or Sonnet, you’ve probably noticed the bills adding up fast — especially for routine tasks like code review, documentation, test writing, or basic refactoring.
The good news: Claude Code supports a configuration that lets you point it at any OpenAI-compatible API. That means you can run the same interface and workflows you’re used to, but route requests to significantly cheaper models — DeepSeek, Gemma 3, Llama, Mistral, and others — through providers like OpenRouter, NVIDIA NIM, or a local Ollama instance.
For many workloads, you’re looking at 80–90% of the output quality at 2–5% of the cost. This guide walks through the exact setup for each provider, which models work well, and where the tradeoffs actually matter.
How Claude Code’s Model Routing Works
Claude Code is Anthropic’s agentic coding tool — it runs in your terminal, reads and writes files, executes commands, and handles multi-step coding tasks with minimal hand-holding. By default, it calls Anthropic’s API directly.
But it exposes two environment variables that let you override that behavior:
ANTHROPIC_BASE_URL— redirects API calls to any compatible endpointANTHROPIC_API_KEY— sends whatever key the proxy provider requires
One coffee. One working app.
You bring the idea. Remy manages the project.
When you set these, Claude Code behaves identically from a UX standpoint. The same commands, the same slash commands, the same context window behavior. What changes is where the inference actually happens — and what it costs.
This works because most proxy providers implement the Anthropic Messages API spec (or translate between OpenAI and Anthropic formats on the fly). Your requests go to the proxy, the proxy routes to the underlying model, and the response comes back in the format Claude Code expects.
What “Compatible” Actually Means
Not every model supports every Claude Code feature out of the box. Claude Code relies heavily on:
- Long context windows (often 32k–200k tokens for large codebases)
- Tool use / function calling
- Instruction-following with structured outputs
- Multi-turn conversation state
Models that lack solid tool-use support will fail on agentic tasks. Models with short context windows will lose track of large files. Keep both in mind when selecting a model for a specific use case.
Option 1: OpenRouter
OpenRouter is a unified API layer that gives you access to 200+ models from different providers — all through one endpoint, one API key, and one billing account. It’s the most flexible option and the easiest to get started with.
Setting Up OpenRouter with Claude Code
Step 1: Create an account and get your API key
Sign up at openrouter.ai, add credits, and grab your API key from the dashboard. You can start with $5–10 to test different models before committing.
Step 2: Set environment variables
In your terminal (or add to your shell profile):
export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_API_KEY="your-openrouter-api-key"
Step 3: Configure the model
OpenRouter uses model identifiers like deepseek/deepseek-r1 or google/gemma-3-27b-it. You can set this in your Claude Code configuration or pass it via the --model flag when supported.
Alternatively, create or edit ~/.claude.json and add:
{
"model": "deepseek/deepseek-r1"
}
Step 4: Run Claude Code as normal
claude
That’s it. Claude Code will now route through OpenRouter using whichever model you’ve specified.
Best Models on OpenRouter for Claude Code Tasks
| Model | Best For | Approx. Cost vs Opus |
|---|---|---|
| DeepSeek R1 | Complex reasoning, debugging, architecture decisions | ~3% |
| DeepSeek V3 | General coding, faster responses | ~1% |
| Gemma 3 27B | Lightweight tasks, documentation, comments | ~0.5% |
| Llama 3.1 405B | Balanced quality/cost, general coding | ~5% |
| Mistral Large 2 | Code generation, structured output | ~8% |
| Qwen2.5-Coder 32B | Code-specific tasks, strong instruction-following | ~2% |
DeepSeek R1 is worth highlighting specifically. On coding benchmarks, it performs near or at Claude Opus 3 levels on many tasks — particularly algorithmic reasoning and debugging — at a fraction of the cost. For non-trivial code tasks, it’s the default recommendation.
OpenRouter-Specific Tips
- Use OpenRouter’s free tier models for low-stakes tasks like writing docstrings or formatting code. Several capable models are available at zero cost with rate limits.
- Enable fallback routing in your OpenRouter settings to automatically switch to a backup model if your primary is unavailable or overloaded.
- Check the model context window before using it on large codebases. Some cheaper models cap out at 8k or 16k tokens — fine for small files, problematic for whole-repo tasks.
Option 2: NVIDIA NIM
NVIDIA NIM (NVIDIA Inference Microservices) is NVIDIA’s hosted inference platform. It runs optimized versions of open-source models on NVIDIA’s own GPU infrastructure, with a focus on low latency and high throughput.
NIM is worth considering if:
- You’re already in the NVIDIA ecosystem (NGC account, enterprise contracts)
- You need consistent, low-latency responses for production workflows
- You want models that have been specifically optimized for NVIDIA hardware
Setting Up NVIDIA NIM
Step 1: Get an NGC API key
Sign up at build.nvidia.com (NVIDIA’s developer portal for NIM). Go to your profile, select “Generate Personal Key,” and copy it.
Step 2: Set environment variables
export ANTHROPIC_BASE_URL="https://integrate.api.nvidia.com/v1"
export ANTHROPIC_API_KEY="nvapi-your-key-here"
Step 3: Set your model
NVIDIA NIM uses model identifiers like:
meta/llama-3.1-70b-instructdeepseek-ai/deepseek-r1google/gemma-3-27b-itmistralai/mistral-large-2-instruct
Update your ~/.claude.json:
{
"model": "meta/llama-3.1-70b-instruct"
}
Step 4: Verify the connection
Run a simple task in Claude Code to confirm requests are routing correctly. NVIDIA NIM returns standard API responses, so Claude Code should behave normally.
NIM-Specific Considerations
NVIDIA NIM optimizes inference using TensorRT-LLM, which means you typically get faster token generation than general-purpose cloud providers running the same model weights. For long coding sessions where latency matters, this is a real advantage.
The model selection on NIM is narrower than OpenRouter — you’re working with a curated set of open-source models rather than 200+ options. But the models available (Llama 3.1, Mistral, Gemma, DeepSeek) cover most practical use cases well.
NIM also offers a local deployment option for enterprise teams. You can run NIM microservices on your own NVIDIA GPU infrastructure, which is useful if you have data residency requirements or want to avoid sending code to third-party APIs entirely.
Option 3: Ollama (Fully Local)
Ollama runs open-source models entirely on your local machine. No API keys, no usage costs, no data leaving your computer. If you have capable hardware, this is the most private and cost-effective option.
Hardware Requirements
Running useful models locally requires meaningful GPU or unified memory:
- Minimum (small models, 7B–8B): 8GB VRAM or Apple Silicon M2 with 16GB RAM
- Recommended (mid-range, 13B–27B): 24GB VRAM or M3 Pro/Max with 36GB+ RAM
- For larger models (70B+): 48GB+ VRAM or multiple GPUs
On Apple Silicon, performance is particularly strong — Metal acceleration means M2/M3 Macs run 13B–27B models at practical speeds.
Setting Up Ollama with Claude Code
Step 1: Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Step 2: Pull a model
ollama pull deepseek-r1:14b
# or
ollama pull gemma3:27b
# or
ollama pull qwen2.5-coder:32b
Step 3: Start the Ollama server
ollama serve
By default, this runs on http://localhost:11434.
Step 4: Configure Claude Code
Ollama exposes an OpenAI-compatible API endpoint. Set:
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export ANTHROPIC_API_KEY="ollama"
The API key can be any non-empty string — Ollama doesn’t validate it.
Step 5: Set the model
{
"model": "deepseek-r1:14b"
}
Use the exact model name as it appears in ollama list.
Best Local Models for Coding Tasks
DeepSeek R1 (7B/14B/32B): Exceptional reasoning-to-size ratio. The 14B version runs well on most modern Macs and handles real coding tasks competently. The 32B version requires more RAM but approaches frontier model quality on many benchmarks.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
Qwen2.5-Coder (7B/32B): Alibaba’s code-specific model family. Strong on code generation and instruction-following. The 7B version is fast enough for interactive use even on modest hardware.
Gemma 3 (4B/12B/27B): Google’s latest open-source family. Well-balanced for general coding tasks, solid instruction-following, and available in sizes that work on consumer hardware.
Llama 3.2/3.3: Meta’s general-purpose models. Good breadth of capability, wide community support, many fine-tunes available.
Ollama Tradeoffs
The obvious downside: local inference is slower than cloud inference unless you have high-end hardware. On a MacBook Pro M3 Max, a 14B model generates tokens at 30–60 tokens/second — usable but not as fast as API-based options.
The upside: your code never leaves your machine. For work involving proprietary codebases, client data, or anything under NDA, local inference removes a significant risk category entirely.
Choosing the Right Setup for Your Workflow
The right option depends on what you’re optimizing for:
Use OpenRouter if:
- You want the widest model selection
- You’re experimenting with different models to find the best fit
- You need access to the latest releases quickly
- You want a single API key and billing account for everything
Use NVIDIA NIM if:
- Low latency is critical for your workflow
- You’re in an enterprise environment with existing NVIDIA relationships
- You want production-grade reliability with SLAs
- You’re considering on-premises deployment
Use Ollama if:
- Privacy and data security are non-negotiable
- You have capable local hardware
- You want zero ongoing API costs
- You’re working offline or in air-gapped environments
Cost Reality Check
To make the cost difference concrete: running Claude Opus 4 via Anthropic’s API costs roughly $15 per million input tokens and $75 per million output tokens. A heavy day of coding with Claude Code can easily consume 500k–2M tokens.
Compare that to:
- DeepSeek R1 via OpenRouter: ~$0.55 per million input tokens
- Llama 3.1 70B via NVIDIA NIM: ~$0.35 per million input tokens
- Local Ollama: $0 per token (hardware cost aside)
For teams running Claude Code at scale, the savings are significant. Even routing just routine tasks (test generation, docstrings, code review) to cheaper models while reserving Opus for complex architecture decisions can cut costs by 60–80%.
Common Issues and How to Fix Them
Tool Use Failures
Some open-source models don’t implement tool use correctly, causing Claude Code to fail on agentic tasks. If you see errors related to function calling or tool invocation, switch to a model with stronger tool-use support — DeepSeek R1, Llama 3.1 70B, and Qwen2.5-Coder all handle this reliably.
Context Window Cutoffs
If Claude Code seems to “forget” earlier parts of a conversation or loses track of large files, your model’s context window may be too short. Check the model’s documented context length before using it on large codebases. Prefer models with 32k+ context for anything beyond small projects.
Slow Local Inference
If local inference is too slow for interactive use, try:
- Dropping to a smaller model size (14B instead of 32B)
- Using quantized versions (Q4_K_M quantization is a good quality/speed balance)
- Checking GPU utilization — on Apple Silicon, make sure Metal is active
Response Format Errors
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Some models don’t perfectly match the Anthropic API response format. If you see JSON parsing errors, check whether your proxy provider fully implements the Anthropic Messages API. OpenRouter and NVIDIA NIM both handle this well; some smaller providers don’t.
Rate Limiting
On free-tier or low-credit OpenRouter accounts, you may hit rate limits. Either add credits, switch to a model with a higher free rate limit, or implement brief pauses between requests using Claude Code’s settings.
Where MindStudio Fits Into This
If you’re routing Claude Code through cheaper models to reduce costs, you’re already thinking about running AI workflows efficiently. MindStudio takes that a step further for teams who want to deploy those workflows — not just run them locally.
MindStudio gives you access to 200+ AI models without needing to manage API keys, proxy configurations, or endpoint routing. You can build agents that call DeepSeek, Gemma, Llama, or any other model in a no-code visual builder, deploy them as real applications, and let MindStudio handle the infrastructure layer.
For developers who are already working with Claude Code and open-source models, the MindStudio Agent Skills Plugin is particularly relevant. It’s an npm SDK that lets Claude Code (and any other agent) call MindStudio’s 120+ typed capabilities — things like agent.searchGoogle(), agent.sendEmail(), or agent.runWorkflow() — as simple method calls. Your agent handles the reasoning; MindStudio handles the plumbing.
The result: you keep Claude Code’s powerful coding interface, you run it on whatever model makes sense for the task, and you connect it to real business tools without rebuilding that integration layer from scratch.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
Does Claude Code work the same way with non-Anthropic models?
Mostly yes, with some caveats. The core interface — terminal commands, file reading/writing, multi-turn conversations — works the same. Agentic features that rely on tool use may behave differently depending on how well the underlying model supports function calling. Models like DeepSeek R1, Llama 3.1 70B, and Qwen2.5-Coder handle this reliably. Smaller or older models may struggle with complex multi-step tasks.
Is it safe to send my code to OpenRouter or NVIDIA NIM?
Both providers have data policies you should review before using them with sensitive code. OpenRouter routes requests to third-party model providers, so your data may pass through multiple systems. NVIDIA NIM offers enterprise agreements with stronger data handling guarantees. For maximum privacy, use Ollama — your code stays on your machine entirely.
What’s the best cheap model for Claude Code right now?
For most coding tasks, DeepSeek R1 (via OpenRouter or locally via Ollama) delivers the best quality-to-cost ratio. It performs near frontier model levels on reasoning and debugging tasks at a tiny fraction of the cost. Qwen2.5-Coder is a strong alternative specifically for code generation and instruction-following. Gemma 3 27B is a good choice for lighter tasks where speed matters more than depth.
Can I use different models for different tasks in the same workflow?
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Not natively within Claude Code itself — it uses whichever model you’ve configured. But you can manually switch models between sessions by updating your environment variables or ~/.claude.json. For more sophisticated model routing (e.g., cheap model for simple tasks, expensive model for complex ones), you’d need a proxy layer or a platform like MindStudio that supports conditional model selection within a workflow.
Does this work on Windows?
Yes. Set the environment variables in PowerShell ($env:ANTHROPIC_BASE_URL = "...") or add them to your system environment settings. Ollama has a native Windows installer. The Claude Code setup process is otherwise identical.
What happens if the proxy goes down?
Claude Code will return an API error and stop the current task. OpenRouter and NVIDIA NIM both have high uptime, but neither offers the same SLA as calling Anthropic directly. For critical production workflows, either use NVIDIA NIM’s enterprise tier or implement a fallback in your proxy configuration. Local Ollama has no external dependency, so it’s the most resilient option for offline or reliability-sensitive use.
Key Takeaways
- Claude Code supports full model routing via
ANTHROPIC_BASE_URLandANTHROPIC_API_KEY— no code changes required - OpenRouter gives you the widest model selection and easiest setup; NVIDIA NIM offers lower latency and enterprise options; Ollama provides full local privacy at zero per-token cost
- DeepSeek R1 and Qwen2.5-Coder are the strongest open-source alternatives for coding tasks, delivering near-frontier quality at 1–5% of Opus pricing
- Tool use support and context window length are the two most important factors when evaluating a model for Claude Code
- Routing routine tasks to cheaper models while reserving premium models for complex work typically cuts overall costs by 60–80% without meaningful quality loss
- If you want to go beyond local workflows and deploy these agents into real applications, MindStudio handles the model access and infrastructure layer so you can focus on building