How to Run Gemma 4 Locally with Ollama: Step-by-Step Setup Guide

Q: How do I update Gemma 4 when a new version comes out on Ollama?

Re-running the pull command will fetch the latest version: ```bash ollama pull gemma4:12b ``` Ollama compares the manifest and only downloads changed layers, so updates are usually faster than the initial download. You can check your current model version with ollama show gemma4:12b.

Why Running Gemma 4 Locally Actually Makes Sense

Google’s Gemma 4 is one of the most capable open-weight model families available right now. And with Ollama, you can run Gemma 4 entirely on your own machine — no API keys, no usage costs, no data leaving your system.

This guide walks through the full setup: installing Ollama, choosing the right Gemma 4 variant for your hardware, running it locally, and optionally connecting it to Claude Code so you get a powerful coding assistant without paying per token.

Whether you’re a developer who wants a private inference endpoint or someone experimenting with local AI, this is one of the cleaner setups you can run today.

What Is Gemma 4 and What Can It Do

Gemma 4 is Google’s fourth generation of open-weight models, designed to run efficiently on consumer hardware while delivering performance that rivals much larger closed models.

The Gemma 4 family includes several size variants — from compact 1B and 4B parameter models to the more capable 12B and 27B versions. The instruction-tuned variants (labeled -it) are what you’ll want for interactive use, following instructions and handling multi-turn conversations well.

Key capabilities across the family:

Multimodal input — the larger variants support image understanding alongside text
Long context windows — up to 128K tokens depending on the model
Strong coding and reasoning — particularly notable in the 12B and 27B sizes
Apache 2.0 license — meaning commercial use is permitted

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

For most local setups, the 4B or 12B variants hit the best balance between performance and hardware requirements. The 27B model is excellent but demands more GPU memory.

VRAM and Hardware Requirements

Before pulling any model, you need to know whether your hardware can handle it. Running a model entirely in VRAM gives you the fastest inference. You can fall back to CPU or split across CPU/GPU, but it’s significantly slower.

Here’s a practical breakdown:

Model Variant	Minimum VRAM	Recommended VRAM	Notes
gemma4:1b	2 GB	4 GB	Runs fine on integrated GPU or CPU
gemma4:4b	4 GB	6 GB	Good for most laptops with discrete GPU
gemma4:12b	8 GB	12 GB	Best balance for mid-range GPUs
gemma4:27b	16 GB	24 GB	Needs a high-end GPU or quantized weights

These figures assume you’re using the default quantization that Ollama ships with (usually Q4_K_M). If you want higher quality output, pulling a Q8 or full-precision variant will roughly double the memory requirements.

Running Without a GPU

Ollama can run models on CPU-only setups. It’s slower — expect several seconds per token on a 12B model — but it works. If you’re on a machine with no discrete GPU or less than 4 GB VRAM, stick with the 1B or 4B variant to get usable response times.

Apple Silicon Macs are an exception here. The unified memory architecture means M1/M2/M3 machines handle larger models surprisingly well. A MacBook Pro with 32 GB unified memory can run the 27B model at reasonable speeds.

Install Ollama

Ollama handles model management, inference, and a local API endpoint. Installation is straightforward across all platforms.

macOS and Linux

Open a terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

On macOS, you can also install via Homebrew:

brew install ollama

After installation, start the Ollama service:

ollama serve

This launches the local API server at http://localhost:11434. On macOS, the app also runs as a menu bar item and starts automatically on login.

Windows

Download the installer from ollama.com. Run the .exe, and Ollama will install and start as a background service. No additional configuration is needed.

Verify the Installation

Check that Ollama is running:

ollama --version

You should see the version number printed. If the service isn’t running yet, start it with ollama serve in a terminal.

Download and Run Gemma 4

With Ollama installed, pulling a model is a single command. Ollama handles downloading the weights and setting up everything needed to run inference.

Pull the Model

Choose the variant that fits your hardware:

# Compact — good for most hardware
ollama pull gemma4:4b

# Mid-range — best for 8-12 GB VRAM
ollama pull gemma4:12b

# High-end — 16+ GB VRAM recommended
ollama pull gemma4:27b

The download size ranges from around 2.5 GB for the 4B model to roughly 17 GB for the 27B. Depending on your connection speed, expect the first pull to take several minutes.

Run an Interactive Session

Once downloaded, start a chat session directly in your terminal:

ollama run gemma4:12b

You’ll see a prompt where you can type messages. This is useful for quick testing to confirm everything is working before connecting any other tools.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

To exit the session, type /bye or press Ctrl+D.

Test via the API

Ollama exposes an OpenAI-compatible REST API. You can test it with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [
      {"role": "user", "content": "Explain how transformers work in two sentences."}
    ]
  }'

If you get a JSON response back with a completion, your local Gemma 4 endpoint is working.

List Installed Models

To see which models you’ve downloaded:

ollama list

To remove a model and free up disk space:

ollama rm gemma4:4b

Connect Gemma 4 to Claude Code

Claude Code is Anthropic’s CLI-based coding assistant. By default it uses Anthropic’s API, but you can point it at any OpenAI-compatible endpoint — including your local Ollama server running Gemma 4.

This means you get a coding assistant experience without API usage costs.

Set the Environment Variables

Claude Code checks a few environment variables to determine which model and endpoint to use:

export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_API_KEY=ollama

The ANTHROPIC_API_KEY value doesn’t matter here since Ollama doesn’t require authentication — you just need it to be set to something non-empty.

Launch Claude Code with Gemma 4

claude --model gemma4:12b

If the environment variables are set correctly, Claude Code will route requests to your local Ollama endpoint instead of Anthropic’s servers. You’ll see responses generated by Gemma 4.

Using a Proxy for More Control

For more complex setups — like routing to different models or adding logging — LiteLLM works well as a proxy layer between Claude Code and Ollama:

pip install litellm
litellm --model ollama/gemma4:12b --port 8000

Then set:

export ANTHROPIC_BASE_URL=http://localhost:8000

This approach is useful if you want to swap models without changing Claude Code’s configuration each time.

Troubleshooting Common Issues

Model Loads Slowly or Times Out

This usually means the model is being loaded into RAM or is running on CPU. Check whether Ollama is detecting your GPU:

ollama run gemma4:4b --verbose

Look for lines indicating GPU layers. If all layers are on CPU, Ollama didn’t detect your GPU. On Linux, make sure your CUDA drivers are up to date. On Windows, check that you have the latest NVIDIA or AMD drivers.

Out of Memory Errors

If you see an OOM error when loading the model, you’re either over the VRAM limit or another process is using too much GPU memory. Options:

Drop to a smaller variant (e.g., switch from 27b to 12b)
Close other GPU-heavy applications
Use a more aggressive quantization: ollama pull gemma4:27b:q4_0

API Connection Refused

If your app can’t connect to http://localhost:11434, the Ollama service isn’t running. Start it manually:

ollama serve

On Linux, you can also run it as a systemd service so it starts automatically:

sudo systemctl enable ollama
sudo systemctl start ollama

Slow Response Times

Slow inference typically means the model is partially or fully on CPU. Beyond hardware limitations, a few things help:

Set OLLAMA_NUM_GPU=1 to ensure GPU is used
Reduce context length if you’re sending very long prompts
Use a smaller model variant

Where MindStudio Fits Into This

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Running Gemma 4 locally gives you a capable, private model endpoint. But using that endpoint as part of a larger workflow — one that touches external APIs, databases, or business tools — requires additional infrastructure.

That’s where MindStudio is worth knowing about. It’s a no-code platform for building AI agents and automated workflows, and it natively supports local models through Ollama and LMStudio alongside hosted models.

This means you can build an agent that:

Accepts input from a web form, email, or webhook
Routes certain tasks to your local Gemma 4 instance (for privacy-sensitive work)
Calls other tools — Google Workspace, Slack, Salesforce, a custom API — as part of the same workflow
Returns a structured output or triggers a downstream action

One specific use case: teams that want to keep sensitive document analysis off hosted APIs can run Gemma 4 locally via Ollama and plug it into a MindStudio agent that handles the surrounding orchestration — input collection, formatting, logging, output delivery — without writing infrastructure code.

MindStudio also gives you access to 200+ other models (Claude, GPT-4o, Gemini, and more) in the same builder, so you can mix models based on cost, capability, and privacy requirements within a single workflow. You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What hardware do I need to run Gemma 4 locally?

At minimum, a machine with 8 GB of RAM and a modern CPU will run the smaller Gemma 4 variants on CPU-only mode. For practical speeds, you’ll want a GPU with at least 6 GB VRAM for the 4B model, or 12 GB for the 12B variant. Apple Silicon Macs handle larger models well due to unified memory — an M2 Pro with 16 GB can run the 12B model comfortably.

Is Gemma 4 free to use commercially?

Yes. Gemma 4 is released under the Apache 2.0 license, which permits commercial use. There are no royalty requirements or usage restrictions beyond what the license terms outline. This makes it one of the more permissive open-weight model families available.

How does Gemma 4 compare to other local models like Llama 3 or Mistral?

Gemma 4 is competitive with Llama 3 and Mistral at equivalent parameter counts. In coding and reasoning benchmarks, the Gemma 4 12B and 27B variants generally perform well, and the multimodal support in larger variants is a differentiator. Mistral models tend to be faster at inference; Llama 3 has a larger ecosystem of fine-tunes. The best choice depends on your specific use case — benchmarks like Open LLM Leaderboard are a good reference.

Can I run Gemma 4 on a laptop without a dedicated GPU?

Yes, but with caveats. The 1B and 4B variants will run at usable speeds on a modern CPU, especially one with AVX2 support. Expect 5–15 tokens per second on a capable CPU — slow enough to be noticeable but functional. Apple Silicon laptops are the exception: their neural engines and unified memory make them genuinely good at running local models without a discrete GPU.

How do I update Gemma 4 when a new version comes out on Ollama?

Re-running the pull command will fetch the latest version:

ollama pull gemma4:12b

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Ollama compares the manifest and only downloads changed layers, so updates are usually faster than the initial download. You can check your current model version with ollama show gemma4:12b.

Does running Gemma 4 locally mean my data is private?

Yes. When running via Ollama on your local machine, inference happens entirely on your hardware. No prompts, responses, or context are sent to external servers. This is the primary reason many teams choose local models for sensitive use cases — legal document analysis, internal code review, customer data processing — where sending data to a hosted API creates compliance or confidentiality concerns.

Key Takeaways

Ollama makes running Gemma 4 locally a straightforward process — install, pull, run, done.
Match your model variant to your hardware: 4B for most laptops, 12B for mid-range GPUs, 27B for high-end setups.
Ollama’s OpenAI-compatible API lets you connect Gemma 4 to tools like Claude Code without API costs.
Apple Silicon is surprisingly capable for local inference — 16 GB+ unified memory handles larger models well.
For building full workflows around a local model, MindStudio supports Ollama natively and handles the orchestration layer without additional code.

If you want to go further with local AI — whether that’s building agents that call tools, automating workflows, or connecting Gemma 4 to business apps — MindStudio is worth exploring. The free tier covers most of what you need to get started.