How to Run Gemma 4 Locally with Ollama: Step-by-Step Setup Guide
Learn how to download and run Google's Gemma 4 locally using Ollama, check VRAM requirements, and connect it to Claude Code for free.
Why Running Gemma 4 Locally Actually Makes Sense
Google’s Gemma 4 is one of the most capable open-weight model families available right now. And with Ollama, you can run Gemma 4 entirely on your own machine — no API keys, no usage costs, no data leaving your system.
This guide walks through the full setup: installing Ollama, choosing the right Gemma 4 variant for your hardware, running it locally, and optionally connecting it to Claude Code so you get a powerful coding assistant without paying per token.
Whether you’re a developer who wants a private inference endpoint or someone experimenting with local AI, this is one of the cleaner setups you can run today.
What Is Gemma 4 and What Can It Do
Gemma 4 is Google’s fourth generation of open-weight models, designed to run efficiently on consumer hardware while delivering performance that rivals much larger closed models.
The Gemma 4 family includes several size variants — from compact 1B and 4B parameter models to the more capable 12B and 27B versions. The instruction-tuned variants (labeled -it) are what you’ll want for interactive use, following instructions and handling multi-turn conversations well.
Key capabilities across the family:
- Multimodal input — the larger variants support image understanding alongside text
- Long context windows — up to 128K tokens depending on the model
- Strong coding and reasoning — particularly notable in the 12B and 27B sizes
- Apache 2.0 license — meaning commercial use is permitted
For most local setups, the 4B or 12B variants hit the best balance between performance and hardware requirements. The 27B model is excellent but demands more GPU memory.
VRAM and Hardware Requirements
Before pulling any model, you need to know whether your hardware can handle it. Running a model entirely in VRAM gives you the fastest inference. You can fall back to CPU or split across CPU/GPU, but it’s significantly slower.
Here’s a practical breakdown:
| Model Variant | Minimum VRAM | Recommended VRAM | Notes |
|---|---|---|---|
| gemma4:1b | 2 GB | 4 GB | Runs fine on integrated GPU or CPU |
| gemma4:4b | 4 GB | 6 GB | Good for most laptops with discrete GPU |
| gemma4:12b | 8 GB | 12 GB | Best balance for mid-range GPUs |
| gemma4:27b | 16 GB | 24 GB | Needs a high-end GPU or quantized weights |
These figures assume you’re using the default quantization that Ollama ships with (usually Q4_K_M). If you want higher quality output, pulling a Q8 or full-precision variant will roughly double the memory requirements.
Running Without a GPU
Ollama can run models on CPU-only setups. It’s slower — expect several seconds per token on a 12B model — but it works. If you’re on a machine with no discrete GPU or less than 4 GB VRAM, stick with the 1B or 4B variant to get usable response times.
Apple Silicon Macs are an exception here. The unified memory architecture means M1/M2/M3 machines handle larger models surprisingly well. A MacBook Pro with 32 GB unified memory can run the 27B model at reasonable speeds.
Install Ollama
Ollama handles model management, inference, and a local API endpoint. Installation is straightforward across all platforms.
macOS and Linux
Open a terminal and run:
curl -fsSL https://ollama.com/install.sh | sh
On macOS, you can also install via Homebrew:
brew install ollama
After installation, start the Ollama service:
ollama serve
This launches the local API server at http://localhost:11434. On macOS, the app also runs as a menu bar item and starts automatically on login.
Windows
Download the installer from ollama.com. Run the .exe, and Ollama will install and start as a background service. No additional configuration is needed.
Verify the Installation
Check that Ollama is running:
ollama --version
You should see the version number printed. If the service isn’t running yet, start it with ollama serve in a terminal.
Download and Run Gemma 4
With Ollama installed, pulling a model is a single command. Ollama handles downloading the weights and setting up everything needed to run inference.
Pull the Model
Choose the variant that fits your hardware:
# Compact — good for most hardware
ollama pull gemma4:4b
# Mid-range — best for 8-12 GB VRAM
ollama pull gemma4:12b
# High-end — 16+ GB VRAM recommended
ollama pull gemma4:27b
The download size ranges from around 2.5 GB for the 4B model to roughly 17 GB for the 27B. Depending on your connection speed, expect the first pull to take several minutes.
Run an Interactive Session
Once downloaded, start a chat session directly in your terminal:
ollama run gemma4:12b
You’ll see a prompt where you can type messages. This is useful for quick testing to confirm everything is working before connecting any other tools.
To exit the session, type /bye or press Ctrl+D.
Test via the API
Ollama exposes an OpenAI-compatible REST API. You can test it with curl:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:12b",
"messages": [
{"role": "user", "content": "Explain how transformers work in two sentences."}
]
}'
If you get a JSON response back with a completion, your local Gemma 4 endpoint is working.
List Installed Models
To see which models you’ve downloaded:
ollama list
To remove a model and free up disk space:
ollama rm gemma4:4b
Connect Gemma 4 to Claude Code
Claude Code is Anthropic’s CLI-based coding assistant. By default it uses Anthropic’s API, but you can point it at any OpenAI-compatible endpoint — including your local Ollama server running Gemma 4.
This means you get a coding assistant experience without API usage costs.
Set the Environment Variables
Claude Code checks a few environment variables to determine which model and endpoint to use:
export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_API_KEY=ollama
The ANTHROPIC_API_KEY value doesn’t matter here since Ollama doesn’t require authentication — you just need it to be set to something non-empty.
Launch Claude Code with Gemma 4
claude --model gemma4:12b
If the environment variables are set correctly, Claude Code will route requests to your local Ollama endpoint instead of Anthropic’s servers. You’ll see responses generated by Gemma 4.
Using a Proxy for More Control
For more complex setups — like routing to different models or adding logging — LiteLLM works well as a proxy layer between Claude Code and Ollama:
pip install litellm
litellm --model ollama/gemma4:12b --port 8000
Then set:
export ANTHROPIC_BASE_URL=http://localhost:8000
This approach is useful if you want to swap models without changing Claude Code’s configuration each time.
Troubleshooting Common Issues
Model Loads Slowly or Times Out
This usually means the model is being loaded into RAM or is running on CPU. Check whether Ollama is detecting your GPU:
ollama run gemma4:4b --verbose
Look for lines indicating GPU layers. If all layers are on CPU, Ollama didn’t detect your GPU. On Linux, make sure your CUDA drivers are up to date. On Windows, check that you have the latest NVIDIA or AMD drivers.
Out of Memory Errors
If you see an OOM error when loading the model, you’re either over the VRAM limit or another process is using too much GPU memory. Options:
- Drop to a smaller variant (e.g., switch from 27b to 12b)
- Close other GPU-heavy applications
- Use a more aggressive quantization:
ollama pull gemma4:27b:q4_0
API Connection Refused
If your app can’t connect to http://localhost:11434, the Ollama service isn’t running. Start it manually:
ollama serve
On Linux, you can also run it as a systemd service so it starts automatically:
sudo systemctl enable ollama
sudo systemctl start ollama
Slow Response Times
Slow inference typically means the model is partially or fully on CPU. Beyond hardware limitations, a few things help:
- Set
OLLAMA_NUM_GPU=1to ensure GPU is used - Reduce context length if you’re sending very long prompts
- Use a smaller model variant
Where MindStudio Fits Into This
Running Gemma 4 locally gives you a capable, private model endpoint. But using that endpoint as part of a larger workflow — one that touches external APIs, databases, or business tools — requires additional infrastructure.
That’s where MindStudio is worth knowing about. It’s a no-code platform for building AI agents and automated workflows, and it natively supports local models through Ollama and LMStudio alongside hosted models.
This means you can build an agent that:
- Accepts input from a web form, email, or webhook
- Routes certain tasks to your local Gemma 4 instance (for privacy-sensitive work)
- Calls other tools — Google Workspace, Slack, Salesforce, a custom API — as part of the same workflow
- Returns a structured output or triggers a downstream action
One specific use case: teams that want to keep sensitive document analysis off hosted APIs can run Gemma 4 locally via Ollama and plug it into a MindStudio agent that handles the surrounding orchestration — input collection, formatting, logging, output delivery — without writing infrastructure code.
MindStudio also gives you access to 200+ other models (Claude, GPT-4o, Gemini, and more) in the same builder, so you can mix models based on cost, capability, and privacy requirements within a single workflow. You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What hardware do I need to run Gemma 4 locally?
At minimum, a machine with 8 GB of RAM and a modern CPU will run the smaller Gemma 4 variants on CPU-only mode. For practical speeds, you’ll want a GPU with at least 6 GB VRAM for the 4B model, or 12 GB for the 12B variant. Apple Silicon Macs handle larger models well due to unified memory — an M2 Pro with 16 GB can run the 12B model comfortably.
Is Gemma 4 free to use commercially?
Yes. Gemma 4 is released under the Apache 2.0 license, which permits commercial use. There are no royalty requirements or usage restrictions beyond what the license terms outline. This makes it one of the more permissive open-weight model families available.
How does Gemma 4 compare to other local models like Llama 3 or Mistral?
Gemma 4 is competitive with Llama 3 and Mistral at equivalent parameter counts. In coding and reasoning benchmarks, the Gemma 4 12B and 27B variants generally perform well, and the multimodal support in larger variants is a differentiator. Mistral models tend to be faster at inference; Llama 3 has a larger ecosystem of fine-tunes. The best choice depends on your specific use case — benchmarks like Open LLM Leaderboard are a good reference.
Can I run Gemma 4 on a laptop without a dedicated GPU?
Yes, but with caveats. The 1B and 4B variants will run at usable speeds on a modern CPU, especially one with AVX2 support. Expect 5–15 tokens per second on a capable CPU — slow enough to be noticeable but functional. Apple Silicon laptops are the exception: their neural engines and unified memory make them genuinely good at running local models without a discrete GPU.
How do I update Gemma 4 when a new version comes out on Ollama?
Re-running the pull command will fetch the latest version:
ollama pull gemma4:12b
Ollama compares the manifest and only downloads changed layers, so updates are usually faster than the initial download. You can check your current model version with ollama show gemma4:12b.
Does running Gemma 4 locally mean my data is private?
Yes. When running via Ollama on your local machine, inference happens entirely on your hardware. No prompts, responses, or context are sent to external servers. This is the primary reason many teams choose local models for sensitive use cases — legal document analysis, internal code review, customer data processing — where sending data to a hosted API creates compliance or confidentiality concerns.
Key Takeaways
- Ollama makes running Gemma 4 locally a straightforward process — install, pull, run, done.
- Match your model variant to your hardware: 4B for most laptops, 12B for mid-range GPUs, 27B for high-end setups.
- Ollama’s OpenAI-compatible API lets you connect Gemma 4 to tools like Claude Code without API costs.
- Apple Silicon is surprisingly capable for local inference — 16 GB+ unified memory handles larger models well.
- For building full workflows around a local model, MindStudio supports Ollama natively and handles the orchestration layer without additional code.
If you want to go further with local AI — whether that’s building agents that call tools, automating workflows, or connecting Gemma 4 to business apps — MindStudio is worth exploring. The free tier covers most of what you need to get started.