How to Build a Local AI Stack from Scratch: Ollama to vLLM, Step by Step
From Ollama for daily use to vLLM for serving to TensorRT-LLM for production — here's the complete local AI runtime stack and when to use each layer.
Five Runtimes, One Stack: How to Build Local AI That Actually Stays Running
Most local AI setups fail within two weeks. Not because the hardware is wrong or the models are bad — because the runtime layer was chosen for the demo, not for daily use. You install something, get a model responding, feel good about it, and then three days later you’re back to ChatGPT because the local thing requires too much friction to invoke. The fix takes maybe an afternoon to get right, and it starts with understanding that there isn’t one runtime — there are five, each occupying a different position in the stack. The full progression is: Ollama (daily use) → LM Studio (evaluation) → MLX (Apple native) → vLLM (serving) → TensorRT-LLM (production). Pick the right one for the right job, and the stack becomes durable. Conflate them, and you’ll keep rebuilding from scratch.
Here’s how to think through each layer.
The Foundation Nobody Talks About: llama.cpp
Before any of the runtimes above, there’s llama.cpp. Most people never call it directly, but it’s underneath almost everything in the local inference world. It created GGUF — the common quantized model format you’ll encounter constantly. It runs across CPU, Apple Metal, CUDA, and Vulkan. It’s the reason a 4-bit quantized 70B model fits in 40GB of RAM instead of 140GB.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
You don’t need to operate llama.cpp directly. But you should know it exists, because when something breaks in your stack, understanding that Ollama is essentially a clean wrapper around llama.cpp tells you where to look.
The GGUF format matters for a specific reason: quantization. A Q4_K_M quantized model trades a small amount of quality for a large reduction in memory footprint and inference cost. For most personal workloads — writing, coding assistance, document search — the quality delta is negligible. For hard reasoning tasks, you want Q8 or full precision. Knowing this lets you make intentional tradeoffs instead of just grabbing whatever file is at the top of a Hugging Face repo.
Layer 1 — Ollama: The One You’ll Actually Use Every Day
Ollama is the right default for daily use. Not because it’s the fastest or most configurable — it isn’t — but because it makes local inference feel like a normal part of your computer rather than a weekend project you never finished.
What Ollama gives you: a clean CLI, a local server running at localhost:11434, a simple model registry (ollama pull llama3.2, done), and an OpenAI-compatible API surface. That last part is the important one. Any tool that knows how to talk to OpenAI’s API can be pointed at your local Ollama instance instead. Continue (the VS Code extension for local coding assistance), Open Web UI, Aider for terminal-based code editing — they all work against Ollama without modification.
The setup is genuinely simple. Install Ollama, pull a model, and you have a local inference server. For a Mac mini M4 Pro with 64GB, ollama pull qwen2.5-coder:32b gets you a serious coding model. ollama pull nomic-embed-text gets you an embedding model for local retrieval. Both running simultaneously, no GPU required.
The thing Ollama doesn’t do well is batching. If you’re serving multiple users or running high-throughput agentic loops, Ollama’s single-request-at-a-time model becomes a bottleneck. That’s fine — that’s not what it’s for. Ollama is the daily driver. You invoke it from your editor, your launcher (Raycast or Alfred both support LLM integrations), your terminal. The goal is zero friction to get a model response from wherever you’re working.
If you’re trying to run Claude Code against a local model to cut costs, Ollama is the local endpoint you’re pointing it at.
Layer 2 — LM Studio: The Evaluation Workbench
LM Studio is where you go when you want to understand a model before committing to it in your stack. It’s a polished GUI for loading GGUF models, testing different quantization levels side by side, and seeing how a model actually behaves on your hardware before you wire it into anything.
The workflow is: new model drops (say, Gemma 4 or a new Qwen release), you pull it into LM Studio, run your standard set of test prompts, check latency and quality at Q4 vs Q8, decide whether it belongs in your stack. Then you move the winner to Ollama for daily use.
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
LM Studio also has its own OpenAI-compatible server mode, so you can use it as a drop-in replacement for Ollama during evaluation. But I’d keep the two roles separate. LM Studio for evaluation, Ollama for production daily use. Mixing them means you’re never sure which runtime is actually serving your tools.
One underrated feature: LM Studio shows you token generation speed, memory usage, and context window utilization in real time. When you’re trying to figure out whether a 70B model at Q4_K_M is actually fast enough for interactive use on your hardware, this matters. The number you care about is tokens per second for the first token (time to first token) and sustained generation speed. For interactive use, you want first token under 2 seconds and sustained generation above 15 tokens/second. Anything slower and the model starts feeling like a tax.
For a deeper look at how Gemma 4 performs locally through Ollama, the evaluation workflow in LM Studio is exactly where you’d start before committing to that model in your daily stack.
Layer 3 — MLX: The Apple Silicon Native Path
If you’re on Apple Silicon, MLX is worth understanding even if you don’t use it directly. It’s Apple’s machine learning framework, optimized for the unified memory architecture of M-series chips. The key difference from llama.cpp/Ollama is that MLX operates natively on the Metal GPU without the translation layer that llama.cpp uses.
In practice, this means faster inference for certain model sizes on Apple hardware. The MLX community has been porting popular models quickly — you can find MLX-format versions of Llama 4, Qwen, Gemma 4, and Mistral models. The mlx-lm Python package is the main entry point.
The tradeoff is ecosystem maturity. Ollama has broader tool compatibility. MLX requires more manual setup and doesn’t have the same plug-and-play integration story. My recommendation: use MLX when you’ve identified a specific model and workload where the performance delta matters — typically long-context work or large models where you’re pushing the memory limits of your machine. For everything else, Ollama’s simplicity wins.
The Mac Studio M4 Max with 128GB or 256GB unified memory is where MLX starts to shine in ways that are hard to replicate on other hardware. A 70B model at full precision fits in 140GB. On a 128GB Mac Studio, you’re running Q6 or Q8 quantization on a 70B model and getting quality that’s close to full precision. That’s a serious local inference machine, and MLX is the native performance path for it.
Layer 4 — vLLM: When Local Inference Becomes Infrastructure
vLLM is a different category of tool. It’s not for personal use — it’s for serving. If you’re building an internal product, running a team-shared inference endpoint, or operating agentic loops that need real throughput, vLLM is where the conversation starts.
The key feature is continuous batching. Instead of processing one request at a time (Ollama’s model), vLLM dynamically batches incoming requests, dramatically increasing GPU utilization and throughput. On an RTX 5090 with 32GB GDDR7, the difference between Ollama and vLLM for concurrent requests is substantial — vLLM can handle 10-20x more requests per second for the same model.
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
vLLM also has an OpenAI-compatible API, so the same tools that point at Ollama can point at vLLM. The migration path is: prototype with Ollama, validate the model and workflow, deploy with vLLM when you need to serve more than one person or run high-volume batch jobs.
The setup is more involved. You need CUDA, a proper Python environment, and some understanding of how to configure tensor parallelism if you’re running across multiple GPUs. But the payoff is real: if you’re running long agentic loops locally, vLLM’s throughput means you can run more loops, longer, without the inference becoming the bottleneck. And since you’re paying electricity instead of per-token API costs, the economics of long-running agents change completely. This is the same insight behind the open-claw phenomenon — people set up local inference and then actually run the long agent loops they were psychologically avoiding when paying per token.
Platforms like MindStudio handle the orchestration layer above this: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — useful when you want the agent logic without writing the serving infrastructure yourself.
Layer 5 — TensorRT-LLM and Beyond: The Production Tier
TensorRT-LLM, SG-Lang, and Nvidia NeMo are the serious deployment tier. These are not personal computing tools — they’re for when you’ve committed to CUDA infrastructure and need to squeeze every bit of performance out of it.
TensorRT-LLM compiles models into optimized inference engines for specific Nvidia GPU architectures. The compilation step takes time (sometimes hours for large models), but the resulting engine is significantly faster than vLLM for the same hardware. Latency, structured generation, and serving economics at scale are where TensorRT-LLM earns its complexity.
The Nvidia DGX Spark is the hardware context where this tier makes sense for individuals. A Grace Blackwell chip with 128GB of coherent unified memory on your desk — the same architecture class as data center GPUs, packaged as a personal appliance. If you’re running TensorRT-LLM on a DGX Spark, you have a serious local inference machine that can handle workloads that would otherwise require cloud GPU instances.
For most people reading this, TensorRT-LLM is aspirational. Know it exists, understand where it sits in the stack, and revisit it when your workload justifies the operational complexity.
The Model Layer: What You Actually Run
The runtime stack is the durable thing. The models are the swappable thing. But you still need a mental model for which models belong in which positions.
Think in terms of roles, not model names. You want: a fast small model for cheap interactive calls (Gemma 4 2B or Qwen 2.5 1.5B), a strong generalist model for hard local work (Llama 4 Scout or Qwen 2.5 72B), a coding-specialized model (Qwen 2.5 Coder 32B or a DeepSeek Coder variant), an embedding model for retrieval (nomic-embed-text or a Qwen embedding model), and Whisper for local transcription.
The Llama 4 Scout and Maverick models are worth understanding because they’re mixture-of-experts architectures — the question isn’t how big the model is, but how much of it fires per token. This changes the memory and compute math significantly. GPT-OSS-20B and GPT-OSS-120B are OpenAI’s permissively licensed (Apache 2.0) reasoning models — weights you run on infrastructure you control, not models you call through the API.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
For comparing Gemma 4 vs Qwen 3.5 on local workloads, the short version is: Gemma 4 is optimized for smaller sizes with strong capability, Qwen is the default for agents, coding, and multilingual work. Both belong in a well-configured local stack.
Memory: The Layer That Makes It Compound
A runtime without memory is a stateless tool. Useful, but not compounding.
The memory layer sits outside the model. Your notes, documents, meeting transcripts, code decisions, and project state need to live somewhere durable that the model can query. The right default for serious work is Postgres with pgvector — relational data, metadata, permissions, and vector search in one place. For personal use, SQLite with sqlite-vec is the lightweight version: a single file, easy to back up, easy to understand.
The architectural principle that matters most: keep raw data and embeddings separate in your database. When a better embedding model arrives (and it will), you can rebuild the vector index without losing the source documents. Most retrieval failures aren’t model failures — they’re pipeline failures, usually from chunking strategies that didn’t account for document structure. PDFs need different handling than markdown. Meeting transcripts need speaker labels and timestamps. Code needs symbol-aware indexing.
Open Brain is an open-source memory system built specifically for this — SQL-driven database plus embedding management plus an MCP server, so Claude or any other model can query your memory through a standard tool interface. The MCP server is the right direction for connecting local memory to any model client, but treat it like any other tool surface: it needs permissions, logging, and boundaries. A writing agent doesn’t need shell access. A meeting summarizer doesn’t need permission to delete files.
For building a personal knowledge base that compounds over time, the Andrej Karpathy LLM wiki approach with Claude Code is worth studying — the principle of turning raw documents into a structured, queryable knowledge base is the same whether you’re using cloud or local models.
The Interface Principle: Many Surfaces, One Stack
The last failure mode is interface. A great runtime with no comfortable surface is a setup you’ll stop using.
The principle is: many surfaces, one stack underneath. Your editor (Continue pointing at Ollama), your notes app, your browser, your launcher (Raycast or Alfred with LLM integrations), your terminal (Aider for code editing), and your voice recorder (Whisper for transcription) — none of these should have separate memory layers. They should all call into the same local runtime and the same memory layer.
This is the part that most AI products won’t give you, because their business model depends on owning the memory underneath the input channel. You accumulate memory inside a particular cloud service, and then you can’t get it out. The local stack inverts this: you own the memory, and the models — local or cloud — come to you.
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
When you’re building applications on top of this kind of local inference infrastructure, the abstraction level above the runtime matters. Tools like Remy take a different approach to the application layer: you write a spec — annotated markdown — and a complete full-stack application gets compiled from it, TypeScript backend, SQLite database with auto-migrations, auth, deployment. The spec is the source of truth; the generated code is derived output. It’s a different layer of the stack, but the same principle: own the source, let the derived artifacts be regenerated.
The Routing Decision
The personal AI stack is ultimately a routing system. Some work stays local because it’s private, repetitive, or context-heavy. Some work goes to the cloud because it’s rare, hard, or needs frontier capability.
The power comes from you making that routing decision intentionally, rather than defaulting to whatever the cloud provider wants. A frontier model is a specialist you hire for hard problems — not your file system, not your memory layer, not your workflow engine.
Build the runtime stack right, and you’re not buying a model appliance. You’re building a substrate that new models can drop into, new runtimes can replace, and new agents can call — without taking your knowledge base with them when they go.
The machine on your desk has a job. Give it one before it arrives.