How to Set Up a Local AI Stack with Ollama, Open Web UI, and Continue in Under 2 Hours

You Can Have a Private AI on Your Desk by This Afternoon

Most people who want to run AI locally spend their first weekend fighting drivers, misreading documentation, and eventually giving up and going back to ChatGPT. That’s a shame, because the actual setup — Ollama for inference, Open Web UI for a browser-based chat interface, and Continue for VS Code editor integration — takes under two hours on a modern Mac. This guide walks you through all three.

The stack we’re building is the one that Nate Jones calls the “daily-use runtime”: Ollama as your local OpenAI-compatible server, Open Web UI as the chat surface, and Continue as the VS Code extension that bridges your editor to the same local models. It’s not the most exotic configuration possible. It’s the one you’ll actually keep using.

A quick note on what this is not: this isn’t a guide to fine-tuning, serving models to a team, or squeezing every token out of your GPU. Those are real problems, but they come later. This is about getting a working local AI stack on your machine so you can start making real decisions about what belongs locally and what belongs in the cloud.

What You Actually Get When This Works

Before the prerequisites, it’s worth being concrete about the outcome.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

When this stack is running, you’ll have a local server on port 11434 that speaks the OpenAI API format. Any tool that can point at an OpenAI-compatible endpoint — Open Web UI, Continue, or anything else — can talk to your models. You’re not locked into a single interface.

Open Web UI gives you a browser tab at localhost:3000 that looks and feels like ChatGPT, but the inference happens on your machine. No data leaves. No per-token bill. You can run it on every meeting transcript, every draft, every sensitive document, and nothing goes anywhere.

Continue gives you the same models inside VS Code. You get inline completions, a chat panel, and the ability to highlight code and ask questions about it — all routed through the same Ollama server you already set up. One runtime, multiple surfaces.

The privacy argument is real, but there’s also a practical one: once you own the inference, you stop rationing your prompts. You stop thinking “is this question worth the API cost?” That psychological shift changes how you use AI.

What You Need Before You Start

Hardware: A Mac with Apple Silicon and at least 16GB of unified memory will run smaller models fine. 32GB is more comfortable. The recommended entry point for serious daily use is a Mac mini M4 Pro with 64GB — that gives you enough headroom to run a 32B parameter model without constant swapping. If you’re on an Intel Mac or a Windows machine with an Nvidia GPU, Ollama still works; the setup is the same, though you’ll want to verify CUDA drivers are current.

Software you need installed:

Homebrew (the Mac package manager — if you don’t have it, the install is one terminal command)
Docker Desktop (for Open Web UI — free for personal use)
VS Code (for the Continue extension)

What you don’t need: Any cloud API keys. Any Nvidia account. Any Python environment. Ollama ships as a native Mac app and a single binary on Linux.

Time: Budget 90 minutes if you’ve never done this before. Probably 45 if you have.

Setting Up the Stack, Step by Step

Step 1: Install Ollama

Go to ollama.com and download the Mac app. It’s a standard .dmg install — drag to Applications, open it, and it puts a small icon in your menu bar.

Once it’s running, open Terminal and verify:

ollama --version

You should see a version number. Now Ollama is running as a local server on port 11434.

Now you have: A local inference server that speaks the OpenAI API format. Nothing is listening to it yet, but it’s there.

Step 2: Pull Your First Model

Ollama has a model registry similar to Docker Hub. To pull Llama 3.2 (a solid 3B model that runs fast on almost any Apple Silicon Mac):

ollama pull llama3.2

For a stronger generalist model if you have 32GB+ of RAM, pull Qwen 2.5 14B:

ollama pull qwen2.5:14b

Qwen has become one of the default model families for local agent work and tool use — it handles structured output and longer context better than most models at its size. If you’re doing any RAG work later, also pull a dedicated embedding model:

ollama pull nomic-embed-text

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

Embedding models are small and cheap to run. They’re central to any retrieval system, and running them locally means your documents never leave the machine to become vectors — which is one of the easiest privacy wins in this whole stack.

To verify a model works:

ollama run llama3.2 "What is the capital of France?"

You should get a response in the terminal within a few seconds.

Now you have: A working local model you can query from the command line. The server is running, the model is loaded, and you can already use it via the Ollama CLI or by hitting http://localhost:11434 directly.

Step 3: Install Open Web UI

Open Web UI is a self-hosted chat interface that connects to Ollama. The easiest install is via Docker:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

This pulls the Open Web UI container, maps it to port 3000 on your machine, and mounts a volume so your chat history persists between restarts. The --add-host flag is what lets the container reach Ollama running on your Mac host — without it, the container can’t see the Ollama server.

Wait about 30 seconds for the container to start, then open http://localhost:3000 in your browser.

The first time you open it, you’ll create a local admin account (username and password, stored locally — no external auth). After that, you’ll see a model selector dropdown. Click it and you should see the models you pulled in Step 2.

If the dropdown is empty, the container can’t reach Ollama. Jump to the troubleshooting section below.

Now you have: A browser-based chat interface connected to your local models. You can have full conversations, upload documents, and switch between models — all running on your machine.

Step 4: Install the Continue Extension in VS Code

Open VS Code, go to the Extensions panel (⇧⌘X), and search for “Continue”. Install the one by Continue Dev. It adds a sidebar panel and a keyboard shortcut (⌥⌘J by default) to open the chat.

After installing, Continue will ask you to configure a model. You want to point it at your local Ollama server instead of a cloud provider. Open the Continue config file (it’ll prompt you, or you can find it at ~/.continue/config.json) and add this:

{
  "models": [
    {
      "title": "Llama 3.2 (Local)",
      "provider": "ollama",
      "model": "llama3.2",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Qwen 2.5 14B (Local)",
      "provider": "ollama",
      "model": "qwen2.5:14b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Autocomplete",
    "provider": "ollama",
    "model": "llama3.2",
    "apiBase": "http://localhost:11434"
  }
}

Save the file. Back in VS Code, open the Continue panel and you should see your local models in the dropdown. Try highlighting a function and pressing ⌥⌘L — it’ll open a chat with that code already in context.

Now you have: The full stack. Ollama serving models locally, Open Web UI for browser-based chat, and Continue routing your editor’s AI features through the same local server. One runtime, three surfaces.

Step 5: Add a Speech Model (Optional but Useful)

If you want local transcription — meeting notes, voice memos, anything audio — Whisper is the reference implementation. You can run it through Ollama:

ollama pull whisper

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

Or install it directly via Homebrew for more control:

brew install openai-whisper

Local transcription is fast on Apple Silicon and completely private. If you record every meeting for a year and transcribe locally, you build a searchable archive of your decisions and commitments — and no audio ever touches an external server. That’s a meaningful capability shift.

Now you have: A complete local AI stack covering chat, code assistance, and speech-to-text.

When Things Don’t Work

Open Web UI shows no models in the dropdown. The container can’t reach Ollama. First, confirm Ollama is actually running: check your menu bar for the Ollama icon, or run ollama list in Terminal. If Ollama is running but the UI still can’t see it, the --add-host flag may not have been included in your Docker command. Stop the container, remove it, and re-run the full Docker command from Step 3.

Ollama is slow or the model keeps getting unloaded. This usually means you’re running a model that’s too large for your available memory. A 14B model needs roughly 10–12GB of RAM just for the weights. If you have 16GB total and other apps are running, you’ll see constant swapping. Try a smaller model (llama3.2 at 3B is fast and fits easily) or close other applications.

Continue can’t connect to Ollama. Check that apiBase in your config.json is exactly http://localhost:11434 — no trailing slash, no HTTPS. Also confirm the model name in the config matches what ollama list shows exactly (including the tag, like qwen2.5:14b not just qwen2.5).

Docker Desktop isn’t running. Open Web UI won’t start if Docker isn’t running. Check your menu bar for the Docker whale icon. If it’s not there, open Docker Desktop from Applications first.

The first model pull is very slow. That’s normal — you’re downloading multi-gigabyte files. A 14B model in Q4 quantization is around 8GB. Let it finish. Subsequent starts are fast because the weights are cached locally.

One thing that catches people: llama.cpp (the foundation that Ollama is built on) uses GGUF as its model format. If you ever try to load a model file manually, make sure it’s a .gguf file, not a raw HuggingFace checkpoint. Ollama handles this automatically when you pull from its registry, but it matters if you’re loading custom models.

Where to Take This Next

Once the stack is running, the next decision is memory. The model is stateless — it doesn’t remember your last conversation unless you give it that context explicitly. For daily use, that’s fine. For anything that needs to accumulate knowledge over time (your meeting notes, your project decisions, your research), you need a retrieval layer.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

The lightweight path is SQLite with sqlite-vec: a single file, easy to back up, easy to understand. The more serious path is Postgres with pgvector, which gives you relational data, metadata, and vector search in one place. If you want a pre-built system that handles chunking strategy and retrieval for you, OpenBrain is an open-source memory layer that connects to any AI via MCP and stores everything in a database you control.

For model selection beyond the defaults, the local landscape has real choices now. Gemma 4 is worth running locally — Google designed it specifically for open deployment and it punches above its weight at smaller sizes. If you’re deciding between Gemma and Qwen for your main generalist slot, this comparison of Gemma 4 vs Qwen 3.5 covers the tradeoffs in detail.

For coding specifically, the pattern that works is layered: a small fast model for autocomplete (which Continue handles), a repo-aware model for refactoring and test generation, and a frontier cloud model for the hardest architectural questions. If you’re already using Claude Code for the hard problems, you can run it against local models via Ollama to reduce costs on the repetitive inner loops.

If you’re building applications on top of this kind of local inference — not just using it yourself but shipping something — the abstraction question comes up quickly. Tools like Remy take a different approach to that problem: you write a spec in annotated markdown, and it compiles into a complete TypeScript backend, SQLite database, auth, and deployment. The spec is the source of truth; the generated code is derived output. It’s a different layer of abstraction than what we’ve been talking about, but it’s relevant when “I want to build a tool on top of my local models” becomes the question.

The broader point from Nate Jones’s framing is worth sitting with: the model list ages fast, but the stack doesn’t. Llama 4 Scout and Maverick are already showing where the open ecosystem is headed — mixture-of-experts, multimodal, longer context. GPT-OSS-20B and GPT-OSS-120B are Apache 2.0 open-weight reasoning models you can run on infrastructure you control. New models will keep arriving. If your runtime layer is healthy — Ollama serving a clean OpenAI-compatible endpoint, Open Web UI and Continue both pointing at it — swapping in a new model is one ollama pull command.

If you want to go further than a single-user setup and start building agents or workflows that chain models together, MindStudio offers a no-code path with 200+ models and 1,000+ pre-built integrations — useful when the question shifts from “how do I run a model” to “how do I connect models to the rest of my tools.”

The stack you built today is the foundation. The interesting work starts when you stop asking “can I run this locally?” and start asking “what do I actually want to own?”