How to Run Local AI Models with Ollama: A Beginner's Setup Guide for 2026

Why Running AI Models Locally Is Worth Your Time

Privacy, cost, and control — those are the three reasons people keep coming back to local AI models. With Ollama, getting a capable language model running on your own machine takes less than ten minutes.

This guide covers everything you need to know to run local AI models with Ollama in 2026: installation on any operating system, pulling models like Gemma, Qwen, and LLaMA, basic commands, connecting Ollama to other tools, and troubleshooting the common issues that trip people up.

No cloud dependency. No per-token bill. Your data stays on your machine.

What Ollama Actually Is

Ollama is an open-source tool that makes it straightforward to download, run, and manage large language models (LLMs) locally. It handles the messy parts — model quantization, hardware acceleration, server setup — so you don’t have to.

Think of it as a package manager for AI models, similar in concept to Homebrew for software or pip for Python packages. You run one command, and the model is downloaded, configured, and ready to use.

Under the hood, Ollama runs a local server on port 11434 and exposes a REST API. That means any application that can make an HTTP request can talk to your local model — which is what makes it so useful for integrating with other tools.

What Makes Ollama Different from Other Local AI Setups

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

There are other ways to run local models — LM Studio, llama.cpp directly, Jan, GPT4All. Ollama stands out for a few reasons:

CLI-first design — Pull and run models with single commands
Clean REST API — OpenAI-compatible endpoints make integration simple
Active model library — Hundreds of models available, updated regularly
Cross-platform — Works on macOS, Windows, and Linux
GPU acceleration — Automatically uses Apple Silicon, NVIDIA, and AMD GPUs when available

Prerequisites Before You Install

Before installing Ollama, check a few things:

Hardware minimums:

At least 8 GB of RAM for smaller models (7B parameters)
16 GB RAM recommended for comfortable performance with 13B models
GPU optional but strongly recommended — even an older NVIDIA card helps significantly

Storage:

Models range from about 2 GB (small quantized models) to 40+ GB (70B parameter models)
Have at least 10–20 GB free for experimenting with a few models

Operating system:

macOS 11 Big Sur or later (M1/M2/M3 Macs get the best performance)
Windows 10 or 11 (64-bit)
Linux: most major distributions supported

You don’t need Python, Docker, or any other runtime installed. Ollama is self-contained.

Installing Ollama

macOS Installation

The fastest path on macOS is the official installer:

Go to ollama.com and click Download
Open the downloaded .dmg file and drag Ollama to your Applications folder
Launch Ollama — you’ll see a llama icon appear in your menu bar
Open Terminal and verify it’s running:

ollama --version

Alternatively, if you use Homebrew:

brew install ollama

Then start the Ollama server manually:

ollama serve

Windows Installation

Download the Windows installer from ollama.com
Run the .exe file — it installs and starts automatically
Ollama runs as a background service and appears in the system tray
Open PowerShell or Command Prompt and verify:

ollama --version

Note on GPU support for Windows: Ollama supports NVIDIA GPUs with CUDA and AMD GPUs with ROCm on Windows. If you have a compatible GPU, Ollama detects and uses it automatically. No manual configuration needed in most cases.

Linux Installation

The one-liner install script handles everything:

curl -fsSL https://ollama.com/install.sh | sh

This downloads the binary, sets up a systemd service, and starts Ollama automatically. To verify:

ollama --version
systemctl status ollama

If you’re not using systemd, start the server manually:

ollama serve

GPU support on Linux: NVIDIA users need CUDA drivers installed separately. AMD GPU support via ROCm is available but requires a compatible GPU (RX 5000 series and newer generally work).

Downloading and Running Your First Model

With Ollama installed, you’re ready to pull a model. The command structure is simple:

ollama pull <model-name>

Recommended Starter Models for 2026

Here are solid choices depending on your use case and hardware:

For general chat and reasoning:

ollama pull qwen2.5:7b — Alibaba’s Qwen 2.5 at 7B parameters. Excellent English and Chinese performance, strong reasoning. About 4.7 GB.
ollama pull llama3.2:3b — Meta’s compact 3B model. Fast on almost any hardware. About 2 GB.
ollama pull gemma3:4b — Google’s Gemma 3 at 4B. Punches above its weight for instruction following. About 3.3 GB.

For coding:

ollama pull qwen2.5-coder:7b — Specifically trained on code. Handles Python, JavaScript, Go, and more. About 4.7 GB.
ollama pull deepseek-coder-v2:16b — DeepSeek’s coding model at 16B. Requires 16+ GB RAM. About 9.1 GB.

Hermes Crash Course — free 1-hour live workshop

For longer context and analysis:

ollama pull llama3.1:8b — Meta’s 8B model with 128K context window. About 4.9 GB.
ollama pull mistral:7b — Mistral AI’s base 7B model. Fast and efficient.

If you have a powerful machine (32+ GB RAM):

ollama pull qwen2.5:32b — One of the strongest local models available in this size class.
ollama pull llama3.3:70b — Meta’s flagship 70B. Outstanding quality, but demands serious hardware.

Running a Model

Once pulled, start a chat session:

ollama run qwen2.5:7b

You’ll get a prompt where you can type messages directly. Press Ctrl+D or type /bye to exit.

To run a model with a single prompt from the command line:

ollama run gemma3:4b "Explain how attention mechanisms work in transformers"

Checking What You Have Installed

ollama list

This shows all downloaded models, their sizes, and when they were last modified.

To remove a model you no longer need:

ollama rm mistral:7b

Using the Ollama API

Ollama’s local server exposes a REST API that’s partially compatible with the OpenAI API format. This is what makes it so easy to plug into other tools.

Basic API Calls

The server runs at http://localhost:11434 by default.

Generate a completion:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "What is retrieval-augmented generation?",
  "stream": false
}'

Chat with conversation history:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma3:4b",
  "messages": [
    {
      "role": "user",
      "content": "Write a Python function to parse JSON"
    }
  ]
}'

List available models via API:

curl http://localhost:11434/api/tags

Using Python with Ollama

Install the official Python library:

pip install ollama

Basic usage:

import ollama

response = ollama.chat(
    model='qwen2.5:7b',
    messages=[
        {'role': 'user', 'content': 'Summarize this in three bullet points: [your text here]'}
    ]
)

print(response['message']['content'])

For streaming responses (better for longer outputs):

import ollama

stream = ollama.chat(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Write a short story'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

OpenAI-Compatible Endpoint

Ollama supports the OpenAI API format at /v1/, which means you can use the OpenAI Python SDK pointed at your local server:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but can be any string
)

response = client.chat.completions.create(
    model='qwen2.5:7b',
    messages=[{'role': 'user', 'content': 'Hello'}]
)

print(response.choices[0].message.content)

This compatibility is particularly useful when swapping out cloud models for local ones in existing applications — you change the base URL and model name, nothing else.

Connecting Ollama to AI Tools and Workspaces

Ollama’s API means it integrates with a wide range of tools out of the box.

Open WebUI (Browser Interface)

If you want a ChatGPT-style interface for your local models, Open WebUI is the most popular option. It’s a web app that connects directly to Ollama.

Install with Docker:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. Open WebUI auto-detects your Ollama models and gives you a full chat interface with history, file uploads, and model switching.

Continue (VS Code Extension for Coding)

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Continue is a VS Code extension that acts as an AI coding assistant. It supports Ollama natively. Add this to your Continue config:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ]
}

You get tab completion, inline edits, and a chat panel — all running locally.

LangChain and LlamaIndex

Both popular AI frameworks support Ollama as a provider. This is useful if you’re building more complex applications that need retrieval, agents, or tool use.

LangChain example:

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b")
result = llm.invoke("Explain vector embeddings simply")
print(result)

Accessing Ollama from Other Machines on Your Network

By default, Ollama only listens on localhost. To expose it to your local network (useful for connecting other devices or VMs):

Set the environment variable before starting Ollama:

OLLAMA_HOST=0.0.0.0 ollama serve

On Windows, set this as a system environment variable and restart the Ollama service.

Then other machines on your network can access it at http://YOUR_LOCAL_IP:11434.

Running Multimodal and Specialized Models

Ollama isn’t limited to text-only models. Several multimodal models let you analyze images alongside text.

Vision Models

Pull a vision-capable model:

ollama pull llava:7b

Or the more capable:

ollama pull llama3.2-vision:11b

Use it via the API with an image:

import ollama

with open('image.jpg', 'rb') as f:
    image_data = f.read()

response = ollama.chat(
    model='llama3.2-vision:11b',
    messages=[
        {
            'role': 'user',
            'content': 'What is in this image?',
            'images': [image_data]
        }
    ]
)

print(response['message']['content'])

Embedding Models

For RAG (retrieval-augmented generation) applications, you’ll want an embedding model:

ollama pull nomic-embed-text

Generate embeddings via the API:

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The quick brown fox"
}'

These embeddings integrate with vector databases like ChromaDB, Qdrant, or pgvector for building search and retrieval applications.

Creating Custom Model Variants with Modelfiles

Ollama supports Modelfiles — simple configuration files that let you customize model behavior, set system prompts, and adjust parameters.

Create a file called Modelfile:

FROM qwen2.5:7b

SYSTEM You are a concise technical writing assistant. Always respond in plain English without jargon. Keep answers under 200 words unless specifically asked for more.

PARAMETER temperature 0.3
PARAMETER top_p 0.9

Build and run it:

ollama create my-tech-writer -f Modelfile
ollama run my-tech-writer

This is useful for creating specialized versions of base models without any fine-tuning.

Where MindStudio Fits with Local Models

Running models locally with Ollama is excellent for development, privacy-sensitive workflows, and experimentation. But there’s a common gap: once you’ve got a local model running, building a full application around it — with a proper UI, workflow logic, integrations, and automated triggers — still requires significant engineering work.

MindStudio addresses that. Its AI Media Workbench and agent builder both support local models, including Ollama and LM Studio. If you’ve got Ollama running on your machine or a local server, you can point MindStudio workflows at it and use that model within a broader automated workflow.

For teams that want to mix local and cloud models — using a local Ollama model for cost-sensitive tasks and a cloud model like Claude for complex reasoning — MindStudio lets you do that within a single workflow. You’re not locked into one provider.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

More broadly, MindStudio gives you the orchestration layer that Ollama alone doesn’t provide. You can build agents that use your local model to process text, then pass results to a Google Workspace integration, send a Slack notification, or trigger a downstream workflow — all without writing infrastructure code.

You can start building with MindStudio for free at mindstudio.ai. If local AI models are already part of your stack, the integration is straightforward to configure.

Troubleshooting Common Ollama Issues

Model Downloads Stall or Fail

Large model files download in chunks. If a download stalls:

Press Ctrl+C and re-run ollama pull <model> — it resumes from where it stopped
Check available disk space (df -h on macOS/Linux)
Verify your internet connection is stable

Slow Performance (CPU-Only Mode)

If Ollama falls back to CPU:

macOS: Metal GPU acceleration is automatic on Apple Silicon. If it feels slow, try a smaller quantized model like :4b or :3b variants.
NVIDIA on Linux: Confirm CUDA drivers are installed (nvidia-smi should return output)
Windows NVIDIA: Check that you have the latest NVIDIA drivers and that CUDA toolkit is installed

Use ollama run <model> and look for output indicating GPU layers loaded. If it shows 0 GPU layers, Ollama is running CPU-only.

Port 11434 Already in Use

Another process is using Ollama’s default port. Either stop that process or change Ollama’s port:

OLLAMA_HOST=127.0.0.1:11435 ollama serve

Out of Memory Errors

The model is too large for your available RAM. Options:

Use a smaller parameter count (3B or 7B instead of 13B)
Use a more aggressively quantized version (Q4 instead of Q8) — append :q4_0 to the model name if the variant is available
Close other applications to free RAM before running Ollama

Model Runs but Gives Poor Outputs

Try adjusting inference parameters at runtime:

ollama run qwen2.5:7b --verbose

Or via API, tune temperature (lower = more predictable) and num_ctx (context window size). Many quality issues come from context length being too short for the task.

Frequently Asked Questions

Is Ollama free to use?

Yes, Ollama is completely free and open source under the MIT license. You download it, run it, and there are no usage fees. The cost is just your hardware (electricity and compute). The models themselves are also free — they’re open-weight models released by their creators.

What’s the difference between a 7B and a 70B model?

The number refers to the number of parameters (weights) in the model. More parameters generally mean better reasoning, more nuanced outputs, and better handling of complex tasks — but also more RAM required and slower generation speed. A 7B model needs about 8 GB of RAM and runs fine on most laptops. A 70B model needs 48–64 GB of RAM and is really only practical on high-end workstations or servers.

For most everyday tasks, a well-tuned 7B model (like Qwen 2.5 7B or Gemma 3) gets you surprisingly far.

Can I run Ollama on a machine without a GPU?

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

Yes. Ollama runs on CPU-only machines, but it’s slower. On a modern CPU with 16 GB RAM, a 7B model might generate 5–15 tokens per second, which is usable but not fast. With a GPU, you typically see 30–100+ tokens per second depending on the GPU and model size. If you’re on an Apple Silicon Mac (M1, M2, M3, M4), you get excellent performance because the unified memory architecture handles these workloads very efficiently.

How does Ollama compare to LM Studio?

Both tools run local models, but they take different approaches. LM Studio is GUI-first — you browse models, download them, and chat through a visual interface. Ollama is CLI and API-first — better suited for developers who want to integrate local models into other applications. LM Studio is easier for non-technical users to get started with. Ollama is more flexible for building things. Many people use both.

Is my data private when using Ollama?

Yes. Everything runs locally — your prompts never leave your machine. There’s no telemetry sent to Ollama’s servers about what you’re running or what you’re saying to the model. This is one of the core reasons people choose local models over cloud APIs, especially for sensitive business data, personal information, or proprietary code.

What models work best on Apple Silicon Macs?

Apple Silicon (M1, M2, M3, M4) handles local models particularly well because of the unified memory architecture — the CPU and GPU share the same high-bandwidth memory pool. Recommended models for different Mac configs:

8 GB RAM: llama3.2:3b, gemma3:4b
16 GB RAM: qwen2.5:7b, llama3.1:8b, mistral:7b
32 GB RAM: qwen2.5:14b, deepseek-r1:14b
64+ GB RAM: qwen2.5:32b, llama3.3:70b

Key Takeaways

Ollama makes local LLMs accessible — one command to install, one command to pull a model, one command to run it.
Start with 7B models — they balance performance and hardware requirements well. Qwen 2.5, Gemma 3, and LLaMA 3 are all solid choices.
The local API is the real power — Ollama’s REST endpoint and OpenAI-compatible /v1/ interface let you plug local models into almost any application or framework.
GPU helps but isn’t required — Apple Silicon Macs are the best hardware for local models without a dedicated GPU. NVIDIA GPUs on Linux and Windows work well with proper CUDA drivers.
Local models pair well with orchestration tools — Ollama handles the model runtime; tools like MindStudio handle the workflow, integrations, and application layer on top.

If you’re building workflows that use local models alongside cloud APIs, databases, and business tools, MindStudio’s agent builder is worth exploring. You can connect Ollama to broader automated workflows without writing the infrastructure yourself — and start free.