How to Run Local AI Models with Ollama: A Beginner's Setup Guide for 2026
Learn how to install Ollama, download local models like Gemma and Qwen, and connect them to AI workspaces and agent tools in minutes.
Why Running AI Models Locally Is Worth Your Time
Privacy, cost, and control — those are the three reasons people keep coming back to local AI models. With Ollama, getting a capable language model running on your own machine takes less than ten minutes.
This guide covers everything you need to know to run local AI models with Ollama in 2026: installation on any operating system, pulling models like Gemma, Qwen, and LLaMA, basic commands, connecting Ollama to other tools, and troubleshooting the common issues that trip people up.
No cloud dependency. No per-token bill. Your data stays on your machine.
What Ollama Actually Is
Ollama is an open-source tool that makes it straightforward to download, run, and manage large language models (LLMs) locally. It handles the messy parts — model quantization, hardware acceleration, server setup — so you don’t have to.
Think of it as a package manager for AI models, similar in concept to Homebrew for software or pip for Python packages. You run one command, and the model is downloaded, configured, and ready to use.
Under the hood, Ollama runs a local server on port 11434 and exposes a REST API. That means any application that can make an HTTP request can talk to your local model — which is what makes it so useful for integrating with other tools.
What Makes Ollama Different from Other Local AI Setups
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
There are other ways to run local models — LM Studio, llama.cpp directly, Jan, GPT4All. Ollama stands out for a few reasons:
- CLI-first design — Pull and run models with single commands
- Clean REST API — OpenAI-compatible endpoints make integration simple
- Active model library — Hundreds of models available, updated regularly
- Cross-platform — Works on macOS, Windows, and Linux
- GPU acceleration — Automatically uses Apple Silicon, NVIDIA, and AMD GPUs when available
Prerequisites Before You Install
Before installing Ollama, check a few things:
Hardware minimums:
- At least 8 GB of RAM for smaller models (7B parameters)
- 16 GB RAM recommended for comfortable performance with 13B models
- GPU optional but strongly recommended — even an older NVIDIA card helps significantly
Storage:
- Models range from about 2 GB (small quantized models) to 40+ GB (70B parameter models)
- Have at least 10–20 GB free for experimenting with a few models
Operating system:
- macOS 11 Big Sur or later (M1/M2/M3 Macs get the best performance)
- Windows 10 or 11 (64-bit)
- Linux: most major distributions supported
You don’t need Python, Docker, or any other runtime installed. Ollama is self-contained.
Installing Ollama
macOS Installation
The fastest path on macOS is the official installer:
- Go to ollama.com and click Download
- Open the downloaded
.dmgfile and drag Ollama to your Applications folder - Launch Ollama — you’ll see a llama icon appear in your menu bar
- Open Terminal and verify it’s running:
ollama --version
Alternatively, if you use Homebrew:
brew install ollama
Then start the Ollama server manually:
ollama serve
Windows Installation
- Download the Windows installer from ollama.com
- Run the
.exefile — it installs and starts automatically - Ollama runs as a background service and appears in the system tray
- Open PowerShell or Command Prompt and verify:
ollama --version
Note on GPU support for Windows: Ollama supports NVIDIA GPUs with CUDA and AMD GPUs with ROCm on Windows. If you have a compatible GPU, Ollama detects and uses it automatically. No manual configuration needed in most cases.
Linux Installation
The one-liner install script handles everything:
curl -fsSL https://ollama.com/install.sh | sh
This downloads the binary, sets up a systemd service, and starts Ollama automatically. To verify:
ollama --version
systemctl status ollama
If you’re not using systemd, start the server manually:
ollama serve
GPU support on Linux: NVIDIA users need CUDA drivers installed separately. AMD GPU support via ROCm is available but requires a compatible GPU (RX 5000 series and newer generally work).
Downloading and Running Your First Model
With Ollama installed, you’re ready to pull a model. The command structure is simple:
ollama pull <model-name>
Recommended Starter Models for 2026
Here are solid choices depending on your use case and hardware:
For general chat and reasoning:
ollama pull qwen2.5:7b— Alibaba’s Qwen 2.5 at 7B parameters. Excellent English and Chinese performance, strong reasoning. About 4.7 GB.ollama pull llama3.2:3b— Meta’s compact 3B model. Fast on almost any hardware. About 2 GB.ollama pull gemma3:4b— Google’s Gemma 3 at 4B. Punches above its weight for instruction following. About 3.3 GB.
For coding:
ollama pull qwen2.5-coder:7b— Specifically trained on code. Handles Python, JavaScript, Go, and more. About 4.7 GB.ollama pull deepseek-coder-v2:16b— DeepSeek’s coding model at 16B. Requires 16+ GB RAM. About 9.1 GB.
For longer context and analysis:
ollama pull llama3.1:8b— Meta’s 8B model with 128K context window. About 4.9 GB.ollama pull mistral:7b— Mistral AI’s base 7B model. Fast and efficient.
If you have a powerful machine (32+ GB RAM):
ollama pull qwen2.5:32b— One of the strongest local models available in this size class.ollama pull llama3.3:70b— Meta’s flagship 70B. Outstanding quality, but demands serious hardware.
Running a Model
Once pulled, start a chat session:
ollama run qwen2.5:7b
You’ll get a prompt where you can type messages directly. Press Ctrl+D or type /bye to exit.
To run a model with a single prompt from the command line:
ollama run gemma3:4b "Explain how attention mechanisms work in transformers"
Checking What You Have Installed
ollama list
This shows all downloaded models, their sizes, and when they were last modified.
To remove a model you no longer need:
ollama rm mistral:7b
Using the Ollama API
Ollama’s local server exposes a REST API that’s partially compatible with the OpenAI API format. This is what makes it so easy to plug into other tools.
Basic API Calls
The server runs at http://localhost:11434 by default.
Generate a completion:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "What is retrieval-augmented generation?",
"stream": false
}'
Chat with conversation history:
curl http://localhost:11434/api/chat -d '{
"model": "gemma3:4b",
"messages": [
{
"role": "user",
"content": "Write a Python function to parse JSON"
}
]
}'
List available models via API:
curl http://localhost:11434/api/tags
Using Python with Ollama
Install the official Python library:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(
model='qwen2.5:7b',
messages=[
{'role': 'user', 'content': 'Summarize this in three bullet points: [your text here]'}
]
)
print(response['message']['content'])
For streaming responses (better for longer outputs):
import ollama
stream = ollama.chat(
model='llama3.1:8b',
messages=[{'role': 'user', 'content': 'Write a short story'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
OpenAI-Compatible Endpoint
Ollama supports the OpenAI API format at /v1/, which means you can use the OpenAI Python SDK pointed at your local server:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but can be any string
)
response = client.chat.completions.create(
model='qwen2.5:7b',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response.choices[0].message.content)
This compatibility is particularly useful when swapping out cloud models for local ones in existing applications — you change the base URL and model name, nothing else.
Connecting Ollama to AI Tools and Workspaces
Ollama’s API means it integrates with a wide range of tools out of the box.
Open WebUI (Browser Interface)
If you want a ChatGPT-style interface for your local models, Open WebUI is the most popular option. It’s a web app that connects directly to Ollama.
Install with Docker:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser. Open WebUI auto-detects your Ollama models and gives you a full chat interface with history, file uploads, and model switching.
Continue (VS Code Extension for Coding)
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
Continue is a VS Code extension that acts as an AI coding assistant. It supports Ollama natively. Add this to your Continue config:
{
"models": [
{
"title": "Qwen 2.5 Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
]
}
You get tab completion, inline edits, and a chat panel — all running locally.
LangChain and LlamaIndex
Both popular AI frameworks support Ollama as a provider. This is useful if you’re building more complex applications that need retrieval, agents, or tool use.
LangChain example:
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.1:8b")
result = llm.invoke("Explain vector embeddings simply")
print(result)
Accessing Ollama from Other Machines on Your Network
By default, Ollama only listens on localhost. To expose it to your local network (useful for connecting other devices or VMs):
Set the environment variable before starting Ollama:
OLLAMA_HOST=0.0.0.0 ollama serve
On Windows, set this as a system environment variable and restart the Ollama service.
Then other machines on your network can access it at http://YOUR_LOCAL_IP:11434.
Running Multimodal and Specialized Models
Ollama isn’t limited to text-only models. Several multimodal models let you analyze images alongside text.
Vision Models
Pull a vision-capable model:
ollama pull llava:7b
Or the more capable:
ollama pull llama3.2-vision:11b
Use it via the API with an image:
import ollama
with open('image.jpg', 'rb') as f:
image_data = f.read()
response = ollama.chat(
model='llama3.2-vision:11b',
messages=[
{
'role': 'user',
'content': 'What is in this image?',
'images': [image_data]
}
]
)
print(response['message']['content'])
Embedding Models
For RAG (retrieval-augmented generation) applications, you’ll want an embedding model:
ollama pull nomic-embed-text
Generate embeddings via the API:
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox"
}'
These embeddings integrate with vector databases like ChromaDB, Qdrant, or pgvector for building search and retrieval applications.
Creating Custom Model Variants with Modelfiles
Ollama supports Modelfiles — simple configuration files that let you customize model behavior, set system prompts, and adjust parameters.
Create a file called Modelfile:
FROM qwen2.5:7b
SYSTEM You are a concise technical writing assistant. Always respond in plain English without jargon. Keep answers under 200 words unless specifically asked for more.
PARAMETER temperature 0.3
PARAMETER top_p 0.9
Build and run it:
ollama create my-tech-writer -f Modelfile
ollama run my-tech-writer
This is useful for creating specialized versions of base models without any fine-tuning.
Where MindStudio Fits with Local Models
Running models locally with Ollama is excellent for development, privacy-sensitive workflows, and experimentation. But there’s a common gap: once you’ve got a local model running, building a full application around it — with a proper UI, workflow logic, integrations, and automated triggers — still requires significant engineering work.
MindStudio addresses that. Its AI Media Workbench and agent builder both support local models, including Ollama and LM Studio. If you’ve got Ollama running on your machine or a local server, you can point MindStudio workflows at it and use that model within a broader automated workflow.
For teams that want to mix local and cloud models — using a local Ollama model for cost-sensitive tasks and a cloud model like Claude for complex reasoning — MindStudio lets you do that within a single workflow. You’re not locked into one provider.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
More broadly, MindStudio gives you the orchestration layer that Ollama alone doesn’t provide. You can build agents that use your local model to process text, then pass results to a Google Workspace integration, send a Slack notification, or trigger a downstream workflow — all without writing infrastructure code.
You can start building with MindStudio for free at mindstudio.ai. If local AI models are already part of your stack, the integration is straightforward to configure.
Troubleshooting Common Ollama Issues
Model Downloads Stall or Fail
Large model files download in chunks. If a download stalls:
- Press
Ctrl+Cand re-runollama pull <model>— it resumes from where it stopped - Check available disk space (
df -hon macOS/Linux) - Verify your internet connection is stable
Slow Performance (CPU-Only Mode)
If Ollama falls back to CPU:
- macOS: Metal GPU acceleration is automatic on Apple Silicon. If it feels slow, try a smaller quantized model like
:4bor:3bvariants. - NVIDIA on Linux: Confirm CUDA drivers are installed (
nvidia-smishould return output) - Windows NVIDIA: Check that you have the latest NVIDIA drivers and that CUDA toolkit is installed
Use ollama run <model> and look for output indicating GPU layers loaded. If it shows 0 GPU layers, Ollama is running CPU-only.
Port 11434 Already in Use
Another process is using Ollama’s default port. Either stop that process or change Ollama’s port:
OLLAMA_HOST=127.0.0.1:11435 ollama serve
Out of Memory Errors
The model is too large for your available RAM. Options:
- Use a smaller parameter count (3B or 7B instead of 13B)
- Use a more aggressively quantized version (Q4 instead of Q8) — append
:q4_0to the model name if the variant is available - Close other applications to free RAM before running Ollama
Model Runs but Gives Poor Outputs
Try adjusting inference parameters at runtime:
ollama run qwen2.5:7b --verbose
Or via API, tune temperature (lower = more predictable) and num_ctx (context window size). Many quality issues come from context length being too short for the task.
Frequently Asked Questions
Is Ollama free to use?
Yes, Ollama is completely free and open source under the MIT license. You download it, run it, and there are no usage fees. The cost is just your hardware (electricity and compute). The models themselves are also free — they’re open-weight models released by their creators.
What’s the difference between a 7B and a 70B model?
The number refers to the number of parameters (weights) in the model. More parameters generally mean better reasoning, more nuanced outputs, and better handling of complex tasks — but also more RAM required and slower generation speed. A 7B model needs about 8 GB of RAM and runs fine on most laptops. A 70B model needs 48–64 GB of RAM and is really only practical on high-end workstations or servers.
For most everyday tasks, a well-tuned 7B model (like Qwen 2.5 7B or Gemma 3) gets you surprisingly far.
Can I run Ollama on a machine without a GPU?
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
Yes. Ollama runs on CPU-only machines, but it’s slower. On a modern CPU with 16 GB RAM, a 7B model might generate 5–15 tokens per second, which is usable but not fast. With a GPU, you typically see 30–100+ tokens per second depending on the GPU and model size. If you’re on an Apple Silicon Mac (M1, M2, M3, M4), you get excellent performance because the unified memory architecture handles these workloads very efficiently.
How does Ollama compare to LM Studio?
Both tools run local models, but they take different approaches. LM Studio is GUI-first — you browse models, download them, and chat through a visual interface. Ollama is CLI and API-first — better suited for developers who want to integrate local models into other applications. LM Studio is easier for non-technical users to get started with. Ollama is more flexible for building things. Many people use both.
Is my data private when using Ollama?
Yes. Everything runs locally — your prompts never leave your machine. There’s no telemetry sent to Ollama’s servers about what you’re running or what you’re saying to the model. This is one of the core reasons people choose local models over cloud APIs, especially for sensitive business data, personal information, or proprietary code.
What models work best on Apple Silicon Macs?
Apple Silicon (M1, M2, M3, M4) handles local models particularly well because of the unified memory architecture — the CPU and GPU share the same high-bandwidth memory pool. Recommended models for different Mac configs:
- 8 GB RAM:
llama3.2:3b,gemma3:4b - 16 GB RAM:
qwen2.5:7b,llama3.1:8b,mistral:7b - 32 GB RAM:
qwen2.5:14b,deepseek-r1:14b - 64+ GB RAM:
qwen2.5:32b,llama3.3:70b
Key Takeaways
- Ollama makes local LLMs accessible — one command to install, one command to pull a model, one command to run it.
- Start with 7B models — they balance performance and hardware requirements well. Qwen 2.5, Gemma 3, and LLaMA 3 are all solid choices.
- The local API is the real power — Ollama’s REST endpoint and OpenAI-compatible
/v1/interface let you plug local models into almost any application or framework. - GPU helps but isn’t required — Apple Silicon Macs are the best hardware for local models without a dedicated GPU. NVIDIA GPUs on Linux and Windows work well with proper CUDA drivers.
- Local models pair well with orchestration tools — Ollama handles the model runtime; tools like MindStudio handle the workflow, integrations, and application layer on top.
If you’re building workflows that use local models alongside cloud APIs, databases, and business tools, MindStudio’s agent builder is worth exploring. You can connect Ollama to broader automated workflows without writing the infrastructure yourself — and start free.
