How to Use Free Alternatives to Claude Code: OpenRouter, NVIDIA NIM, and Ollama
Run Claude Code's interface with DeepSeek, GLM-4.7, or local models via a free proxy. Get 80–90% of Opus quality at 2–5% of the cost.
The Real Cost of Claude Code — And What to Do About It
Claude Code is genuinely impressive. Drop it into a codebase, describe what you want, and it reasons through files, writes patches, and runs terminal commands with minimal hand-holding. But if you’ve checked your Anthropic bill after a heavy week of use, you already know the problem: Claude Opus-class models are expensive, and Claude Code burns through tokens fast.
The good news is that Claude Code’s architecture is more flexible than most users realize. It’s designed to route API calls through a configurable base URL — which means you can point it at OpenRouter, NVIDIA NIM, or a local Ollama instance running DeepSeek, GLM-4.7, or any other capable model. For many coding tasks, you can get 80–90% of Opus-level results at a fraction of the cost.
This guide covers exactly how to set that up, which models are worth using, and where the tradeoffs actually show up.
How Claude Code’s API Routing Works
Before setting up any alternative backend, it helps to understand what you’re actually changing.
Claude Code communicates with models through Anthropic’s API. By default, it uses https://api.anthropic.com as its base URL and requires a valid Anthropic API key. But two environment variables let you override both:
ANTHROPIC_BASE_URL— the API endpoint Claude Code sends requests toANTHROPIC_API_KEY— the key sent with each request (can be any provider’s key when routing elsewhere)
One coffee. One working app.
You bring the idea. Remy manages the project.
When you set ANTHROPIC_BASE_URL to a compatible endpoint, Claude Code thinks it’s talking to Anthropic. The provider on the other end just needs to implement an Anthropic-compatible API — meaning it accepts requests in Anthropic’s message format and returns responses in the same shape.
OpenRouter and NVIDIA NIM both offer this compatibility. Ollama doesn’t natively, but a lightweight local proxy handles the translation.
Method 1: Using OpenRouter as a Claude Code Backend
OpenRouter aggregates dozens of model providers under a single API. It supports an Anthropic-compatible endpoint, which makes it the easiest drop-in replacement for Claude Code’s default routing.
Setting Up Your OpenRouter Account
Go to OpenRouter and create an account. You’ll get an API key from the dashboard. OpenRouter offers free credits for new accounts, and many of its hosted models are significantly cheaper than their Anthropic equivalents — or outright free with rate limits.
Configuring the Environment Variables
Set the following in your terminal session, .zshrc, .bashrc, or project-level .env:
export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_API_KEY="sk-or-your-openrouter-key"
Then launch Claude Code as normal. It will now send all requests to OpenRouter.
Choosing a Model
By default, Claude Code will attempt to use whatever model is configured. You can override the model with the --model flag:
claude --model deepseek/deepseek-chat
Or set it in your Claude Code config. OpenRouter uses the provider/model-name format. Some strong options for coding tasks:
deepseek/deepseek-chat— DeepSeek V3, widely regarded as one of the best open-weight models for codedeepseek/deepseek-r1— reasoning-focused, good for complex debuggingqwen/qwen-2.5-coder-32b-instruct— Alibaba’s dedicated coding model, competitive with mid-tier Claudethudm/glm-4-32b— GLM-4 from Zhipu AI, strong multilingual and code performancegoogle/gemini-2.5-flash— fast and cheap, surprisingly capable for routine code tasks
Cost Reality Check
DeepSeek V3 on OpenRouter runs at roughly $0.14 per million input tokens and $0.28 per million output tokens. Claude Opus 4 on Anthropic’s API is over $15 per million input tokens. For tasks like refactoring, documentation, or writing tests — where the model doesn’t need frontier-level reasoning — the cheaper option often performs nearly as well.
Method 2: NVIDIA NIM for GPU-Accelerated Inference
NVIDIA NIM (NVIDIA Inference Microservices) provides cloud-hosted inference for a growing catalog of open models, optimized for NVIDIA hardware. It’s particularly relevant if you want consistent, low-latency responses for larger models.
Getting Access
NVIDIA offers NIM through its AI platform. New accounts get free API credits to test available models. The platform supports models like Llama 3.1 405B, DeepSeek Coder V2, and several Mistral variants.
Configuring Claude Code for NIM
NVIDIA NIM uses an OpenAI-compatible API, not Anthropic’s format — so you can’t point Claude Code at it directly. You need a translation layer. The cleanest approach is litellm, an open-source proxy that converts between API formats.
Install it:
pip install litellm
Start a local proxy that converts Anthropic-format requests to OpenAI-format and forwards them to NVIDIA NIM:
litellm --model nvidia_nim/nvidia/llama-3.1-nemotron-70b-instruct \
--api_base https://integrate.api.nvidia.com/v1 \
--api_key "nvapi-your-key"
By default, litellm runs on port 4000. Then configure Claude Code:
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="anything"
When NIM Makes Sense
NIM is worth the extra setup step when:
- You want to run large models (70B+) without managing your own GPU infrastructure
- Latency consistency matters more than absolute cost
- You’re already working in an NVIDIA ecosystem
For purely price-sensitive use cases, OpenRouter is simpler.
Method 3: Running Local Models with Ollama
Ollama lets you run models entirely on your own hardware — no API costs, no data leaving your machine. This matters for teams working with proprietary codebases or anyone who wants to avoid per-token billing entirely.
Installing Ollama
Download Ollama from ollama.com and install it for your OS. Then pull a model:
ollama pull deepseek-coder-v2
Or for a lighter model on less powerful hardware:
ollama pull qwen2.5-coder:7b
Ollama runs a local server at http://localhost:11434 with an OpenAI-compatible API. Like NVIDIA NIM, it doesn’t speak Anthropic’s format natively.
Bridging Ollama to Claude Code
Again, litellm handles the translation:
litellm --model ollama/deepseek-coder-v2 \
--api_base http://localhost:11434
With litellm running on port 4000:
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="local"
Alternatively, a purpose-built tool called claude-code-proxy (available on npm) provides a simpler wrapper specifically designed for this use case:
npx claude-code-proxy --backend ollama --model deepseek-coder-v2
Check the project’s documentation for current flags and options, since the interface evolves quickly.
Hardware Requirements
What you can run locally depends entirely on your machine:
| Model Size | Minimum RAM (GPU) | Suitable For |
|---|---|---|
| 7B | 6–8 GB VRAM | Autocomplete, simple tasks |
| 14B–20B | 12–16 GB VRAM | Most coding tasks |
| 32B | 20–24 GB VRAM | Complex reasoning |
| 70B+ | 40–80 GB VRAM | Near-frontier performance |
On Apple Silicon, Ollama uses unified memory, so an M2/M3 Pro or Max with 36GB+ RAM can run 32B models at usable speeds.
Which Models Actually Work Well for Coding
Not all models perform equally inside Claude Code’s agentic loop. The model needs to handle multi-turn context, follow tool-use conventions, and produce structured edits. Here’s a practical breakdown:
DeepSeek V3 and DeepSeek Coder V2
The strongest open-weight choice for most users. DeepSeek V3 benchmarks competitively with Claude Sonnet on code generation, and DeepSeek Coder V2 was specifically trained on code. Both handle Claude Code’s tool-calling patterns reliably.
GLM-4 (Zhipu AI)
GLM-4 series models — available on OpenRouter as thudm/glm-4-32b and similar — are particularly strong on Chinese-language codebases and documentation. For international teams or multilingual projects, GLM-4 often outperforms equivalently-sized alternatives. The 32B variant handles complex, multi-file refactoring reasonably well.
Qwen 2.5 Coder
Alibaba’s coding-focused model series. The 32B instruct variant is competitive with Sonnet-class performance on many benchmarks. The smaller 7B and 14B versions are solid for constrained hardware.
Llama 3.1 and 3.3
Meta’s Llama models are broadly capable but not coding-specialized. They work, but for purely coding use cases, DeepSeek or Qwen variants tend to perform better at comparable sizes.
What to Avoid
Avoid models below 7B for anything beyond simple autocomplete inside Claude Code’s agentic workflows. The multi-step reasoning and file editing tasks Claude Code issues require a minimum level of context handling that smaller models struggle with.
Performance vs. Cost: What the Tradeoff Looks Like in Practice
Here’s an honest assessment of where alternative models fall short, and where they’re effectively equivalent:
Where Alternatives Match Claude Opus
- Writing new functions from clear specifications
- Adding tests for existing code
- Refactoring for readability
- Generating documentation and comments
- Fixing linter errors
- Implementing standard patterns (CRUD operations, API clients, etc.)
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
Where Claude Opus Still Wins
- Navigating very large, poorly documented codebases
- Debugging subtle concurrency or memory issues
- Architectural reasoning across many interdependent components
- Tasks requiring nuanced judgment about tradeoffs
For a typical development workflow — where you’re building on top of existing patterns rather than diagnosing deep system-level bugs — alternatives handle 80–90% of tasks well. The remaining 10–20% is where having Anthropic’s frontier model as a fallback still makes sense.
A practical approach: use a cheap model (DeepSeek V3 via OpenRouter) for routine tasks, and switch to Claude Opus only when you’re genuinely stuck.
Troubleshooting Common Setup Issues
”Model not found” errors
Claude Code may pass the model name it’s configured with directly to the provider. If OpenRouter or your proxy doesn’t recognize the model string, you’ll get a 404 or “model not found” error. Double-check that the model ID matches exactly what the provider expects — including the provider/model-name format for OpenRouter.
Tool call failures
Some models don’t handle tool-use formatting correctly. If Claude Code hangs or returns malformed tool calls, try a different model. DeepSeek V3 and Qwen 2.5 Coder handle this more reliably than many alternatives.
Slow local inference
If Ollama is running on CPU instead of GPU, inference will be unusably slow for anything above 7B. Verify GPU usage with ollama ps and check that your CUDA or Metal drivers are current.
Context window limitations
Claude Code can generate long context windows during complex tasks. Some models or providers cap at 32K tokens. If you’re hitting cutoffs mid-task, check your provider’s context limits and consider a model with a larger window.
Authentication errors with proxy
When using litellm as a local proxy, ANTHROPIC_API_KEY can be any string — it’s not validated against Anthropic’s servers. But it can’t be empty. Set it to something like "local" or "not-used" if you’re proxying to Ollama.
Where MindStudio Fits Into This Picture
Claude Code with alternative backends solves the cost problem for coding tasks. But coding is only one part of most real workflows. You still need to connect your code outputs to other systems — send results to Slack, update records in Airtable, trigger downstream processes, or expose your agent logic to other tools.
That’s where MindStudio comes in. MindStudio gives you access to 200+ AI models in a single platform — including Claude, DeepSeek, Gemini, and Qwen variants — with no separate API keys or accounts required. You can build agents that combine model calls with real integrations: Google Workspace, HubSpot, Notion, Salesforce, and 1,000+ other tools.
For developers already running Claude Code with custom backends, MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) is particularly relevant. It lets your agents call capabilities like agent.sendEmail(), agent.searchGoogle(), or agent.runWorkflow() as simple method calls, without managing the infrastructure for each integration separately.
If you’re evaluating alternative AI backends specifically to reduce costs, MindStudio’s model access can extend that same logic to your broader automation stack — one platform, many models, no per-API-key overhead. You can try it free at mindstudio.ai.
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Frequently Asked Questions
Can Claude Code officially use non-Anthropic models?
Claude Code was built by Anthropic to use Anthropic’s models, and official documentation doesn’t explicitly document third-party model support. However, the ANTHROPIC_BASE_URL environment variable is functional and widely used in the developer community to route requests to alternative providers. This is a supported configuration mechanism, even if non-Anthropic models are not officially endorsed by Anthropic.
Is it safe to use a local Ollama model with Claude Code?
Yes — and for proprietary codebases, it’s often preferable. When running Ollama locally, no code or prompts leave your machine. The litellm proxy runs locally as well, so the entire chain stays on-device. Just ensure Ollama’s local server isn’t exposed to external network interfaces.
Does OpenRouter have free models that work with Claude Code?
OpenRouter lists several free-tier models, including some Llama variants and smaller Mistral models. These have rate limits but no per-token cost. They’re adequate for light use and testing the setup before committing to paid models.
Will Claude Code’s agentic features work with alternative models?
Most features work, but reliability varies by model. File editing, bash command execution, and multi-step task planning all depend on the model following structured output conventions correctly. DeepSeek V3 and Qwen 2.5 Coder handle this reliably. Smaller or less instruction-tuned models may fail at tool-use steps.
What is GLM-4.7 and how does it compare for coding?
GLM-4 is a model series from Zhipu AI, a Beijing-based AI lab. The 4.7 designation refers to a specific variant in that series. Zhipu’s models are competitive on code benchmarks, particularly for tasks involving Chinese-language documentation or codebases. On OpenRouter, GLM-4 variants are available under the thudm provider prefix. For pure English coding tasks, DeepSeek typically edges it out; for multilingual work, GLM-4 is often the better choice.
How much can I actually save using alternative backends?
The cost difference is substantial. Running Claude Code heavily (say, 2–3 million tokens per week) on Opus costs $30–$45 per week in API fees at standard pricing. The same volume on DeepSeek V3 via OpenRouter costs under $1. For Ollama, the marginal cost is effectively zero after hardware. Even accounting for Ollama’s setup time, teams doing frequent coding sessions recover the effort in the first week.
Key Takeaways
- Claude Code supports custom API endpoints via
ANTHROPIC_BASE_URL, enabling third-party model backends - OpenRouter is the easiest setup — one environment variable change routes all requests, with 50+ capable models available
- NVIDIA NIM and Ollama both require a litellm or similar proxy to handle API format translation
- DeepSeek V3, Qwen 2.5 Coder, and GLM-4 are the strongest alternatives for coding tasks
- For 80–90% of routine coding work, alternative models perform comparably to Claude Opus at 2–5% of the cost
- A hybrid approach — cheap model for routine tasks, Opus for hard problems — gives the best practical balance
- MindStudio extends this cost-efficiency logic to your full automation stack, with 200+ models and integrations in one place without managing separate API accounts