How to Run Gemma 4 Locally on Your Phone or Laptop With Ollama
Gemma 4's 4B model runs on an iPhone 15 Pro. Here's how to download and run Gemma 4 locally using Ollama for free, private, offline AI workflows.
Why Running AI Locally Is Worth Your Attention
Running AI locally used to mean owning a server rack. Now it means opening a terminal. Google’s Gemma 4 — specifically the 4B variant — is small enough to run on an iPhone 15 Pro and fast enough on a modern laptop to feel like a real assistant, not a novelty.
The pitch for local AI comes down to three things: privacy, cost, and availability. Your data never leaves your device. You don’t pay per token. And the model works whether you’re on a plane, in a coffee shop with spotty Wi-Fi, or in a region with strict data compliance requirements.
This guide walks through exactly how to run Gemma 4 locally using Ollama — on a laptop running Mac, Windows, or Linux, and on a phone. No cloud accounts required.
What Gemma 4 Is (and Why the 4B Model Is the Interesting One)
Gemma 4 is Google’s fourth generation of open-weight language models, built on the same research underpinning Gemini. The family includes multiple sizes — 4B, 12B, and 27B parameters — but the 4B model is the one worth paying attention to for local use.
At roughly 2.5–3GB when quantized, the 4B model fits comfortably in the RAM of recent smartphones and most laptops made in the last three to four years. Despite its size, it punches above its weight on reasoning, instruction following, and multilingual tasks.
Key things to know about Gemma 4:
- Multimodal support — The model can handle both text and images, so you can ask questions about photos, screenshots, and documents
- 128K context window — Large enough for long documents, code files, or extended conversations
- Open weights — You can download and run it without any usage agreement beyond Google’s Gemma license, which allows commercial use with some restrictions
- Quantized versions available — Ollama automatically serves a 4-bit quantized version, reducing size and RAM requirements without destroying quality
The 12B and 27B models are worth running if you have a dedicated GPU or high-end workstation. For most people on a laptop or phone, 4B is the right starting point.
What You Need Before You Start
Before installing anything, check that your hardware meets the minimum requirements.
For Laptops
Minimum:
- 8GB RAM (the 4B model needs around 4–5GB free)
- macOS 12+, Windows 10/11, or a modern Linux distro
- ~5GB free disk space
Recommended:
- 16GB RAM or more
- Apple Silicon Mac (M1 or later), or a PC with a dedicated GPU for faster inference
- SSD storage for faster model loading
For the 12B or 27B models:
- 16–32GB RAM minimum
- A dedicated GPU with 8–24GB VRAM is strongly preferred
For Phones
Ollama itself doesn’t run natively on iOS or Android. The realistic options for running Gemma 4 on your phone are:
- Connect your phone to Ollama running on your home network — This is the easiest approach and what most people mean when they say “running AI on your phone.” Your laptop does the inference; your phone is the interface.
- Use a native on-device app — Apps like MLC Chat (iOS and Android) can run quantized Gemma models directly on the device without any server.
This guide covers both. The laptop setup comes first because it’s the foundation for option one.
How to Install Ollama on Your Laptop
Ollama is an open-source tool that makes it simple to download and run large language models locally. It handles model management, a local API server, and the runtime — all through one lightweight application.
macOS
Download the Ollama app from the official site and drag it into your Applications folder. Once it’s running, you’ll see the Ollama icon in your menu bar. Then open Terminal and run:
ollama --version
If you see a version number, the install worked.
Windows
Download the Windows installer from the Ollama website. Run it, follow the prompts, and restart your terminal. Ollama runs as a background service automatically.
Linux
Run the install script:
curl -fsSL https://ollama.com/install.sh | sh
This works on Ubuntu, Debian, Fedora, and most common distros. On systems without systemd, you may need to start Ollama manually with ollama serve.
Running Gemma 4 on Your Laptop
Once Ollama is installed, running Gemma 4 is a single command.
Pull and Run the 4B Model
ollama run gemma4:4b
The first time you run this, Ollama downloads the model. Expect around 2.5–3GB depending on quantization. On a fast connection, this takes a few minutes. Once it’s cached locally, it starts in seconds on future runs.
After the model loads, you’ll see a prompt:
>>> Send a message (/? for help)
Type anything and press Enter. The model responds locally. No internet required after the initial download.
Run the Larger Models
If your machine has more headroom, try the 12B or 27B variants:
ollama run gemma4:12b
ollama run gemma4:27b
The 27B model gives noticeably better quality on complex tasks but needs at least 16–20GB of available RAM. If you have an NVIDIA GPU with enough VRAM, Ollama will offload layers to it automatically.
Using the API
Ollama runs a local REST API at http://localhost:11434. This means you can integrate Gemma 4 into scripts, tools, and apps using standard HTTP calls.
A quick test with curl:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:4b",
"prompt": "Summarize the key ideas behind transformer architecture in three bullet points.",
"stream": false
}'
This same API works with any tool that supports OpenAI-compatible endpoints — including many chat UIs, VS Code extensions, and workflow tools.
Using a Chat UI
If you prefer a browser-based interface to the terminal, install Open WebUI. It connects to your local Ollama instance and provides a clean chat experience similar to ChatGPT, with model switching, conversation history, and document uploads.
Running Gemma 4 on Your Phone
Option 1: Connect Your Phone to Ollama on Your Laptop
This approach lets your phone’s browser or an app send requests to Ollama running on your laptop, with the laptop doing the actual inference. It works well at home or on a local network.
Step 1: Expose Ollama on your local network
By default, Ollama only listens on localhost. To allow your phone to connect, set the host to accept all interfaces. On Mac or Linux:
OLLAMA_HOST=0.0.0.0 ollama serve
On Windows, set the environment variable OLLAMA_HOST to 0.0.0.0 in System Properties, then restart the Ollama service.
Step 2: Find your laptop’s local IP address
On Mac: Go to System Settings → Network and look for your IP (usually something like 192.168.1.x).
On Windows: Run ipconfig in Command Prompt.
Step 3: Install Enchanted (iOS) or a compatible app (Android)
Enchanted is a free, open-source iOS app that connects to Ollama-compatible endpoints. In the app settings, enter your laptop’s local IP and port:
http://192.168.1.x:11434
Select gemma4:4b from the model list, and you’re chatting via Gemma 4 with all inference running on your laptop.
For Android, PocketPal AI and Nextcloud AI both support Ollama-compatible endpoints and work the same way.
What this setup gives you: Full Gemma 4 capability on your phone screen, without the phone doing any heavy lifting.
Option 2: Run Gemma 4 Fully On-Device
If you want true on-device inference — no laptop required — you need an app that runs the model directly on your phone’s hardware.
MLC Chat (iOS and Android) uses the MLC-LLM engine to run quantized models on-device. The iPhone 15 Pro’s A17 Pro chip with 8GB RAM can handle Gemma 4’s 4B model in a 4-bit quantized format at a reasonable speed.
Steps:
- Download MLC Chat from the App Store or Google Play
- Browse the in-app model library
- Download Gemma 4 (4B, 4-bit quantized — approximately 2.4GB)
- Start chatting
Inference is slower than on a laptop — expect 5–15 tokens per second depending on the device — but it works fully offline.
LLM Farm is another iOS option that supports Gemma models and has a more streamlined UI for document-focused tasks.
Tips for Getting Better Results Locally
Running a model locally means you control everything — including how you interact with it.
Use System Prompts
Ollama lets you create custom “Modelfiles” that package a model with a system prompt, giving it a specific personality or set of instructions. For example, to create a focused coding assistant:
FROM gemma4:4b
SYSTEM """
You are a precise coding assistant. When answering questions, always provide working code examples. Explain any non-obvious choices briefly. Do not pad responses with unnecessary text.
"""
Save this as Modelfile, then run:
ollama create my-coder -f Modelfile
ollama run my-coder
Keep Context Short for Speed
The 4B model has a 128K context window, but longer context means slower inference on CPU. For most conversational tasks, keeping the active context under 4K tokens is enough and significantly faster.
Use GPU Acceleration When Available
On Apple Silicon Macs, Ollama uses the Metal GPU automatically. On PCs with NVIDIA GPUs, make sure your CUDA drivers are current. Inference on a GPU can be 5–10x faster than on CPU alone.
Try the Vision Capability
Gemma 4 is multimodal. You can pass images via the API or through supporting UIs:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:4b",
"prompt": "Describe what is in this image.",
"images": ["<base64-encoded-image>"]
}'
Open WebUI handles this automatically through its file upload button.
Connect Local Models to Automated Workflows with MindStudio
Running Gemma 4 locally is useful for chatting and quick queries. But if you want to connect it to real work — summarizing emails, processing documents, triggering actions in other tools — you need something beyond a terminal.
MindStudio is a no-code platform for building AI agents and automated workflows. It supports Ollama and other local model servers, so you can point it at your local Gemma 4 instance and build workflows that actually do things.
For example, you could build an agent that:
- Watches a folder for new PDF documents
- Sends each one to your local Gemma 4 instance for summarization
- Writes the summary to a Notion database or Slack channel
Or a scheduled agent that pulls data from a Google Sheet, runs it through Gemma 4 for analysis, and emails you a daily report.
MindStudio’s visual builder handles the workflow logic — branching, loops, data transformation — without code. And because it has 200+ models built in, you can swap Gemma 4 out for Claude or GPT-4o whenever a task needs more capability, without rebuilding anything.
It’s also worth noting: MindStudio offers direct access to models like Gemini 2.5 Pro and Flash through its platform, so if you’re working across different AI providers — some local, some cloud — you can manage them all from one place without juggling API keys.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
Does Gemma 4 actually run fast enough to be useful locally?
On a modern MacBook Pro with Apple Silicon (M2 or M3), the 4B model generates at 40–80 tokens per second — fast enough for fluid conversation. On a mid-range Windows laptop with a decent CPU and no GPU, expect 10–25 tokens per second, which is slower but workable. The 12B and 27B models are significantly slower without GPU acceleration.
Is Gemma 4 free to use commercially?
Google’s Gemma models are released under the Gemma Terms of Use, which allows commercial use for most organizations. There are restrictions for very large companies (over 1 billion monthly active users). For most businesses and independent developers, commercial use is permitted. Always check the current Gemma license directly before deploying in production.
What’s the difference between Gemma 4 and Gemini?
Gemma 4 is an open-weight model you can download and run yourself. Gemini (Flash, Pro, Ultra) is Google’s proprietary, closed model family accessible only through the API. Gemma 4 is smaller and less capable overall, but it’s free to run locally with no data sent to Google’s servers. Gemini models are generally more capable, especially on complex tasks, but require an API key and log usage.
Can I run Gemma 4 without a GPU?
Yes. Ollama runs on CPU-only systems. Performance is slower, but the 4B model is designed for efficiency and runs acceptably on modern multi-core CPUs. An Intel Core i7 or AMD Ryzen 7 from the last three to four years can run the 4B model well enough for practical use.
What’s the best Gemma 4 model size for a laptop?
For most laptops with 8–16GB RAM, the 4B model is the right choice. It’s fast, fits comfortably in memory, and handles most everyday tasks — writing, coding, summarization, Q&A — competently. If your machine has 32GB RAM or an 8GB+ GPU, the 12B model is worth trying for noticeably better reasoning quality.
How do I update Gemma 4 when a new version is released?
Run ollama pull gemma4:4b at any time to check for and download model updates. Ollama caches models locally and only downloads what’s changed.
Key Takeaways
- The 4B model is the practical choice — It fits in 4–5GB RAM and runs on laptops, phones, and modest hardware without a GPU.
- Ollama handles the complexity — One command installs the model, starts the server, and gives you both a chat interface and a local API.
- Phones need a bridge — Ollama doesn’t run natively on iOS or Android. Use Enchanted or PocketPal to connect to Ollama on your network, or MLC Chat for true on-device inference.
- System prompts make it more useful — Wrapping Gemma 4 in a Modelfile with a focused system prompt turns it into a more capable, task-specific tool.
- Local models plug into bigger workflows — Tools like MindStudio let you connect your local Gemma 4 instance to real automation pipelines without writing infrastructure code.
If you want to go further — building agents, connecting Gemma 4 to other tools, or experimenting with other models in the same environment — MindStudio is worth exploring. It’s free to start and takes about 15 minutes to build a working workflow.