How to Run Claude Code with Cheaper Models: OpenRouter, NVIDIA NIM, and Ollama

Why Paying Full Anthropic Prices for Every Claude Code Task Doesn’t Make Sense

Claude Code is one of the most capable AI coding assistants available. But if you’ve been running it exclusively on Claude Opus 4 or Sonnet, you’ve probably noticed the bills adding up fast — especially for routine tasks like code review, documentation, test writing, or basic refactoring.

The good news: Claude Code supports a configuration that lets you point it at any OpenAI-compatible API. That means you can run the same interface and workflows you’re used to, but route requests to significantly cheaper models — DeepSeek, Gemma 3, Llama, Mistral, and others — through providers like OpenRouter, NVIDIA NIM, or a local Ollama instance.

For many workloads, you’re looking at 80–90% of the output quality at 2–5% of the cost. This guide walks through the exact setup for each provider, which models work well, and where the tradeoffs actually matter.

How Claude Code’s Model Routing Works

Claude Code is Anthropic’s agentic coding tool — it runs in your terminal, reads and writes files, executes commands, and handles multi-step coding tasks with minimal hand-holding. By default, it calls Anthropic’s API directly.

But it exposes two environment variables that let you override that behavior:

ANTHROPIC_BASE_URL — redirects API calls to any compatible endpoint
ANTHROPIC_API_KEY — sends whatever key the proxy provider requires

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

When you set these, Claude Code behaves identically from a UX standpoint. The same commands, the same slash commands, the same context window behavior. What changes is where the inference actually happens — and what it costs.

This works because most proxy providers implement the Anthropic Messages API spec (or translate between OpenAI and Anthropic formats on the fly). Your requests go to the proxy, the proxy routes to the underlying model, and the response comes back in the format Claude Code expects.

What “Compatible” Actually Means

Not every model supports every Claude Code feature out of the box. Claude Code relies heavily on:

Long context windows (often 32k–200k tokens for large codebases)
Tool use / function calling
Instruction-following with structured outputs
Multi-turn conversation state

Models that lack solid tool-use support will fail on agentic tasks. Models with short context windows will lose track of large files. Keep both in mind when selecting a model for a specific use case.

Option 1: OpenRouter

OpenRouter is a unified API layer that gives you access to 200+ models from different providers — all through one endpoint, one API key, and one billing account. It’s the most flexible option and the easiest to get started with.

Setting Up OpenRouter with Claude Code

Step 1: Create an account and get your API key

Sign up at openrouter.ai, add credits, and grab your API key from the dashboard. You can start with $5–10 to test different models before committing.

Step 2: Set environment variables

In your terminal (or add to your shell profile):

export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_API_KEY="your-openrouter-api-key"

Step 3: Configure the model

OpenRouter uses model identifiers like deepseek/deepseek-r1 or google/gemma-3-27b-it. You can set this in your Claude Code configuration or pass it via the --model flag when supported.

Alternatively, create or edit ~/.claude.json and add:

{
  "model": "deepseek/deepseek-r1"
}

Step 4: Run Claude Code as normal

claude

That’s it. Claude Code will now route through OpenRouter using whichever model you’ve specified.

Best Models on OpenRouter for Claude Code Tasks

Model	Best For	Approx. Cost vs Opus
DeepSeek R1	Complex reasoning, debugging, architecture decisions	~3%
DeepSeek V3	General coding, faster responses	~1%
Gemma 3 27B	Lightweight tasks, documentation, comments	~0.5%
Llama 3.1 405B	Balanced quality/cost, general coding	~5%
Mistral Large 2	Code generation, structured output	~8%
Qwen2.5-Coder 32B	Code-specific tasks, strong instruction-following	~2%

DeepSeek R1 is worth highlighting specifically. On coding benchmarks, it performs near or at Claude Opus 3 levels on many tasks — particularly algorithmic reasoning and debugging — at a fraction of the cost. For non-trivial code tasks, it’s the default recommendation.

OpenRouter-Specific Tips

Use OpenRouter’s free tier models for low-stakes tasks like writing docstrings or formatting code. Several capable models are available at zero cost with rate limits.
Enable fallback routing in your OpenRouter settings to automatically switch to a backup model if your primary is unavailable or overloaded.
Check the model context window before using it on large codebases. Some cheaper models cap out at 8k or 16k tokens — fine for small files, problematic for whole-repo tasks.

Option 2: NVIDIA NIM

NVIDIA NIM (NVIDIA Inference Microservices) is NVIDIA’s hosted inference platform. It runs optimized versions of open-source models on NVIDIA’s own GPU infrastructure, with a focus on low latency and high throughput.

NIM is worth considering if:

You’re already in the NVIDIA ecosystem (NGC account, enterprise contracts)
You need consistent, low-latency responses for production workflows
You want models that have been specifically optimized for NVIDIA hardware

Setting Up NVIDIA NIM

Step 1: Get an NGC API key

Sign up at build.nvidia.com (NVIDIA’s developer portal for NIM). Go to your profile, select “Generate Personal Key,” and copy it.

Step 2: Set environment variables

export ANTHROPIC_BASE_URL="https://integrate.api.nvidia.com/v1"
export ANTHROPIC_API_KEY="nvapi-your-key-here"

Step 3: Set your model

NVIDIA NIM uses model identifiers like:

meta/llama-3.1-70b-instruct
deepseek-ai/deepseek-r1
google/gemma-3-27b-it
mistralai/mistral-large-2-instruct

Update your ~/.claude.json:

{
  "model": "meta/llama-3.1-70b-instruct"
}

Step 4: Verify the connection

Run a simple task in Claude Code to confirm requests are routing correctly. NVIDIA NIM returns standard API responses, so Claude Code should behave normally.

NIM-Specific Considerations

NVIDIA NIM optimizes inference using TensorRT-LLM, which means you typically get faster token generation than general-purpose cloud providers running the same model weights. For long coding sessions where latency matters, this is a real advantage.

The model selection on NIM is narrower than OpenRouter — you’re working with a curated set of open-source models rather than 200+ options. But the models available (Llama 3.1, Mistral, Gemma, DeepSeek) cover most practical use cases well.

NIM also offers a local deployment option for enterprise teams. You can run NIM microservices on your own NVIDIA GPU infrastructure, which is useful if you have data residency requirements or want to avoid sending code to third-party APIs entirely.

Option 3: Ollama (Fully Local)

Ollama runs open-source models entirely on your local machine. No API keys, no usage costs, no data leaving your computer. If you have capable hardware, this is the most private and cost-effective option.

Hardware Requirements

Running useful models locally requires meaningful GPU or unified memory:

Minimum (small models, 7B–8B): 8GB VRAM or Apple Silicon M2 with 16GB RAM
Recommended (mid-range, 13B–27B): 24GB VRAM or M3 Pro/Max with 36GB+ RAM
For larger models (70B+): 48GB+ VRAM or multiple GPUs

On Apple Silicon, performance is particularly strong — Metal acceleration means M2/M3 Macs run 13B–27B models at practical speeds.

Setting Up Ollama with Claude Code

Step 1: Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Step 2: Pull a model

ollama pull deepseek-r1:14b
# or
ollama pull gemma3:27b
# or
ollama pull qwen2.5-coder:32b

Step 3: Start the Ollama server

ollama serve

By default, this runs on http://localhost:11434.

Step 4: Configure Claude Code

Ollama exposes an OpenAI-compatible API endpoint. Set:

export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export ANTHROPIC_API_KEY="ollama"

The API key can be any non-empty string — Ollama doesn’t validate it.

Step 5: Set the model

{
  "model": "deepseek-r1:14b"
}

Use the exact model name as it appears in ollama list.

Best Local Models for Coding Tasks

DeepSeek R1 (7B/14B/32B): Exceptional reasoning-to-size ratio. The 14B version runs well on most modern Macs and handles real coding tasks competently. The 32B version requires more RAM but approaches frontier model quality on many benchmarks.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

Qwen2.5-Coder (7B/32B): Alibaba’s code-specific model family. Strong on code generation and instruction-following. The 7B version is fast enough for interactive use even on modest hardware.

Gemma 3 (4B/12B/27B): Google’s latest open-source family. Well-balanced for general coding tasks, solid instruction-following, and available in sizes that work on consumer hardware.

Llama 3.2/3.3: Meta’s general-purpose models. Good breadth of capability, wide community support, many fine-tunes available.

Ollama Tradeoffs

The obvious downside: local inference is slower than cloud inference unless you have high-end hardware. On a MacBook Pro M3 Max, a 14B model generates tokens at 30–60 tokens/second — usable but not as fast as API-based options.

The upside: your code never leaves your machine. For work involving proprietary codebases, client data, or anything under NDA, local inference removes a significant risk category entirely.

Choosing the Right Setup for Your Workflow

The right option depends on what you’re optimizing for:

Use OpenRouter if:

You want the widest model selection
You’re experimenting with different models to find the best fit
You need access to the latest releases quickly
You want a single API key and billing account for everything

Use NVIDIA NIM if:

Low latency is critical for your workflow
You’re in an enterprise environment with existing NVIDIA relationships
You want production-grade reliability with SLAs
You’re considering on-premises deployment

Use Ollama if:

Privacy and data security are non-negotiable
You have capable local hardware
You want zero ongoing API costs
You’re working offline or in air-gapped environments

Cost Reality Check

To make the cost difference concrete: running Claude Opus 4 via Anthropic’s API costs roughly $15 per million input tokens and $75 per million output tokens. A heavy day of coding with Claude Code can easily consume 500k–2M tokens.

Compare that to:

DeepSeek R1 via OpenRouter: ~$0.55 per million input tokens
Llama 3.1 70B via NVIDIA NIM: ~$0.35 per million input tokens
Local Ollama: $0 per token (hardware cost aside)

For teams running Claude Code at scale, the savings are significant. Even routing just routine tasks (test generation, docstrings, code review) to cheaper models while reserving Opus for complex architecture decisions can cut costs by 60–80%.

Common Issues and How to Fix Them

Tool Use Failures

Some open-source models don’t implement tool use correctly, causing Claude Code to fail on agentic tasks. If you see errors related to function calling or tool invocation, switch to a model with stronger tool-use support — DeepSeek R1, Llama 3.1 70B, and Qwen2.5-Coder all handle this reliably.

Context Window Cutoffs

If Claude Code seems to “forget” earlier parts of a conversation or loses track of large files, your model’s context window may be too short. Check the model’s documented context length before using it on large codebases. Prefer models with 32k+ context for anything beyond small projects.

Slow Local Inference

If local inference is too slow for interactive use, try:

Dropping to a smaller model size (14B instead of 32B)
Using quantized versions (Q4_K_M quantization is a good quality/speed balance)
Checking GPU utilization — on Apple Silicon, make sure Metal is active

Response Format Errors

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Some models don’t perfectly match the Anthropic API response format. If you see JSON parsing errors, check whether your proxy provider fully implements the Anthropic Messages API. OpenRouter and NVIDIA NIM both handle this well; some smaller providers don’t.

Rate Limiting

On free-tier or low-credit OpenRouter accounts, you may hit rate limits. Either add credits, switch to a model with a higher free rate limit, or implement brief pauses between requests using Claude Code’s settings.

Where MindStudio Fits Into This

If you’re routing Claude Code through cheaper models to reduce costs, you’re already thinking about running AI workflows efficiently. MindStudio takes that a step further for teams who want to deploy those workflows — not just run them locally.

MindStudio gives you access to 200+ AI models without needing to manage API keys, proxy configurations, or endpoint routing. You can build agents that call DeepSeek, Gemma, Llama, or any other model in a no-code visual builder, deploy them as real applications, and let MindStudio handle the infrastructure layer.

For developers who are already working with Claude Code and open-source models, the MindStudio Agent Skills Plugin is particularly relevant. It’s an npm SDK that lets Claude Code (and any other agent) call MindStudio’s 120+ typed capabilities — things like agent.searchGoogle(), agent.sendEmail(), or agent.runWorkflow() — as simple method calls. Your agent handles the reasoning; MindStudio handles the plumbing.

The result: you keep Claude Code’s powerful coding interface, you run it on whatever model makes sense for the task, and you connect it to real business tools without rebuilding that integration layer from scratch.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

Does Claude Code work the same way with non-Anthropic models?

Mostly yes, with some caveats. The core interface — terminal commands, file reading/writing, multi-turn conversations — works the same. Agentic features that rely on tool use may behave differently depending on how well the underlying model supports function calling. Models like DeepSeek R1, Llama 3.1 70B, and Qwen2.5-Coder handle this reliably. Smaller or older models may struggle with complex multi-step tasks.

Is it safe to send my code to OpenRouter or NVIDIA NIM?

Both providers have data policies you should review before using them with sensitive code. OpenRouter routes requests to third-party model providers, so your data may pass through multiple systems. NVIDIA NIM offers enterprise agreements with stronger data handling guarantees. For maximum privacy, use Ollama — your code stays on your machine entirely.

What’s the best cheap model for Claude Code right now?

For most coding tasks, DeepSeek R1 (via OpenRouter or locally via Ollama) delivers the best quality-to-cost ratio. It performs near frontier model levels on reasoning and debugging tasks at a tiny fraction of the cost. Qwen2.5-Coder is a strong alternative specifically for code generation and instruction-following. Gemma 3 27B is a good choice for lighter tasks where speed matters more than depth.

Can I use different models for different tasks in the same workflow?

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Not natively within Claude Code itself — it uses whichever model you’ve configured. But you can manually switch models between sessions by updating your environment variables or ~/.claude.json. For more sophisticated model routing (e.g., cheap model for simple tasks, expensive model for complex ones), you’d need a proxy layer or a platform like MindStudio that supports conditional model selection within a workflow.

Does this work on Windows?

Yes. Set the environment variables in PowerShell ($env:ANTHROPIC_BASE_URL = "...") or add them to your system environment settings. Ollama has a native Windows installer. The Claude Code setup process is otherwise identical.

What happens if the proxy goes down?

Claude Code will return an API error and stop the current task. OpenRouter and NVIDIA NIM both have high uptime, but neither offers the same SLA as calling Anthropic directly. For critical production workflows, either use NVIDIA NIM’s enterprise tier or implement a fallback in your proxy configuration. Local Ollama has no external dependency, so it’s the most resilient option for offline or reliability-sensitive use.

Key Takeaways

Claude Code supports full model routing via ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY — no code changes required
OpenRouter gives you the widest model selection and easiest setup; NVIDIA NIM offers lower latency and enterprise options; Ollama provides full local privacy at zero per-token cost
DeepSeek R1 and Qwen2.5-Coder are the strongest open-source alternatives for coding tasks, delivering near-frontier quality at 1–5% of Opus pricing
Tool use support and context window length are the two most important factors when evaluating a model for Claude Code
Routing routine tasks to cheaper models while reserving premium models for complex work typically cuts overall costs by 60–80% without meaningful quality loss
If you want to go beyond local workflows and deploy these agents into real applications, MindStudio handles the model access and infrastructure layer so you can focus on building