How to Run Open-Weight AI Models Locally with Ollama and LM Studio
Run Qwen 3.6, Gemma, and DeepSeek locally with Ollama and LM Studio. This guide covers setup, quantization, and performance on consumer hardware.
Why Running LLMs Locally Is Worth Your Time
Running open-weight AI models locally has gone from a niche hobby to a practical option for developers, researchers, and privacy-conscious users. With tools like Ollama and LM Studio, you can run models like Qwen 3, Gemma 3, and DeepSeek-R1 on consumer hardware — no API key, no monthly bill, no data leaving your machine.
This guide covers everything you need to get started: which tools to use, how to pick and configure models, what quantization means for performance, and realistic expectations for what your hardware can handle.
What “Open-Weight” Actually Means
Before getting into setup, it’s worth being precise about terminology. “Open-weight” means the model weights are publicly available — you can download and run them yourself. It does not necessarily mean fully open-source (some models restrict commercial use or fine-tuning).
Popular open-weight models right now include:
- Meta LLaMA 3.1 and 3.3 — Strong general-purpose models, widely supported
- Qwen 3 (Alibaba) — Excellent multilingual performance, comes in sizes from 0.6B to 235B
- Gemma 3 (Google) — Efficient, well-documented, strong at reasoning tasks
- DeepSeek-R1 — Reasoning-focused model with strong benchmark scores
- Mistral and Mixtral — Fast inference, solid for instruction following
- Phi-4 (Microsoft) — Surprisingly capable at small sizes (14B and under)
These models vary widely in size, capability, and licensing. Choosing the right one depends on your hardware and use case.
Understanding Quantization (The Short Version)
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
Full-precision LLMs require enormous amounts of VRAM. A 70B parameter model at 16-bit precision takes around 140GB — well beyond what most consumer GPUs can handle.
Quantization compresses the model by reducing the precision of its weights. Instead of storing each weight as a 16-bit float, quantized models use 4-bit or 8-bit integers. The tradeoff is a small reduction in quality for a dramatic reduction in memory use.
Common Quantization Formats
| Format | Bits per weight | Memory reduction | Quality impact |
|---|---|---|---|
| F16 | 16 | None (baseline) | None |
| Q8_0 | 8 | ~50% | Minimal |
| Q5_K_M | 5 | ~69% | Very low |
| Q4_K_M | 4 | ~75% | Low |
| Q3_K_M | 3 | ~81% | Moderate |
| Q2_K | 2 | ~87% | Significant |
For most use cases, Q4_K_M is the sweet spot — it gives you good quality with roughly 4–5GB of memory needed for a 7B model. Ollama and LM Studio both handle quantized models in the GGUF format, which is the standard for local inference.
How Much VRAM Do You Actually Need?
A rough rule of thumb: multiply the model’s parameter count (in billions) by 0.6 for a Q4 quantized model to get an approximate VRAM requirement in GB.
- 7B model at Q4 ≈ 4–5GB VRAM
- 13B model at Q4 ≈ 8–9GB VRAM
- 34B model at Q4 ≈ 20–22GB VRAM
- 70B model at Q4 ≈ 40–45GB VRAM (or needs RAM offloading)
If your GPU doesn’t have enough VRAM, both Ollama and LM Studio can offload layers to system RAM — but this comes at a significant speed penalty.
Setting Up Ollama
Ollama is a command-line tool that makes running local LLMs as straightforward as pulling a Docker image. It handles model downloads, quantization selection, and serving a local API automatically.
Installation
Ollama supports macOS, Linux, and Windows. Installation is a single download:
- macOS: Download the
.dmgfrom ollama.com and run it. Ollama runs as a menu bar app. - Linux: Run
curl -fsSL https://ollama.com/install.sh | shin your terminal. - Windows: Download the Windows installer from the same site.
After installation, Ollama runs a local server on port 11434.
Pulling and Running Your First Model
Open a terminal and pull a model:
ollama pull qwen3:8b
Ollama will download the default quantized version. To run it interactively:
ollama run qwen3:8b
You’ll get a prompt where you can type messages directly. To exit, type /bye.
Choosing Specific Quantizations
By default, Ollama picks a sensible quantization for the model. But you can specify:
ollama pull llama3.3:70b-instruct-q4_K_M
Use ollama list to see what’s downloaded, and ollama show <model> for details about a specific model.
Using Ollama’s API
One of Ollama’s best features is its OpenAI-compatible REST API. Any app built for the OpenAI API can point to Ollama instead:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Explain quantization in two sentences."}]
}'
This means you can swap local models into existing tools, scripts, or pipelines with minimal changes.
Useful Ollama Commands
ollama list # Show downloaded models
ollama ps # Show running models
ollama rm <model> # Delete a model
ollama serve # Start the server manually (if not running)
Setting Up LM Studio
LM Studio is a desktop application with a graphical interface. It’s a better choice if you prefer not to use the command line, or if you want a built-in chat UI, model browser, and performance monitoring in one place.
Installation
Download LM Studio from lmstudio.ai. It’s available for macOS (Apple Silicon and Intel), Windows, and Linux. The app is self-contained — no additional installs required.
Browsing and Downloading Models
LM Studio’s home screen includes a model discovery interface connected to Hugging Face. You can search by name, filter by size, and see community recommendations.
To download a model:
- Open the Discover tab.
- Search for a model (e.g., “Gemma 3” or “DeepSeek-R1”).
- Select the quantization you want. LM Studio labels each option with estimated VRAM usage.
- Click Download.
Models are stored locally in ~/.lmstudio/models (macOS/Linux) or the equivalent Windows path.
Running Models in LM Studio
Once downloaded:
- Go to the Chat tab.
- Select your model from the dropdown.
- Adjust context length, temperature, and system prompt in the settings panel.
- Start chatting.
The GPU Offload slider lets you control how many layers run on GPU vs. CPU. More GPU layers = faster inference but more VRAM used. LM Studio shows you estimated VRAM usage in real time as you adjust.
LM Studio’s Local Server
Like Ollama, LM Studio can run a local OpenAI-compatible API server. Go to the Local Server tab, load a model, and click Start Server. It runs on port 1234 by default.
This is useful for connecting LM Studio to local development environments, coding assistants like Continue.dev, or any tool that supports a custom OpenAI endpoint.
Model Recommendations by Hardware Tier
Not every machine can run every model. Here’s a practical breakdown of what works well at each hardware level.
8GB VRAM (e.g., RTX 3070, RTX 4060, M2 MacBook Air)
- Qwen3 4B or 8B (Q4) — Fast inference, good at reasoning and coding
- Gemma 3 4B or 12B (Q4) — Strong for its size, excellent instruction following
- Phi-4 14B (Q4) — Pushes the limit but works with some layer offloading
- Mistral 7B — Reliable, fast, good for general tasks
16–24GB VRAM (e.g., RTX 4090, RTX 3090, M3 Pro/Max)
- LLaMA 3.3 70B (Q3 or Q4) — Near-frontier quality, but slower tokens per second
- Qwen3 14B or 32B (Q4) — Excellent multilingual and coding performance
- DeepSeek-R1 14B or 32B (Q4) — Strong reasoning chains, good for step-by-step tasks
- Gemma 3 27B (Q4) — Google’s best open model at a manageable size
Apple Silicon (Unified Memory)
Apple Silicon is uniquely well-suited for local LLMs because RAM and VRAM share the same pool. An M3 Max with 64GB unified memory can run 70B models at reasonable speeds.
- M1/M2 (8–16GB): 7B–13B models comfortably
- M2/M3 Pro (18–36GB): Up to 34B models
- M3 Max/Ultra (64–192GB): 70B models and beyond
Ollama has native Metal support. LM Studio also uses Metal acceleration on Apple Silicon. Both will automatically use the GPU.
CPU-Only (No Dedicated GPU)
It’s possible, but slow. Expect 1–5 tokens per second on a modern CPU for a 7B model. Models like Phi-4 mini or Qwen3 0.6B are designed for efficiency and handle CPU inference better than larger models.
Performance Tips and Common Troubleshooting
Getting Better Inference Speed
- Use Q4_K_M instead of Q8 when VRAM is tight. The quality difference is small, and the speed gain is real.
- Reduce context length. A 128K context window uses more memory than a 4K one. Set it to what you actually need.
- Close background applications that use GPU resources (games, video editing software, etc.).
- On Ollama, set
OLLAMA_NUM_GPU=99in your environment to force full GPU offloading. - On LM Studio, use the GPU offload slider to maximize layers on GPU.
One coffee. One working app.
You bring the idea. Remy manages the project.
Common Issues and Fixes
Model loads but inference is very slow This usually means layers are being offloaded to RAM. Either reduce context size, pick a smaller quantization, or try a smaller model variant.
“Out of memory” error Reduce the number of GPU layers in LM Studio, or pick a more aggressive quantization (e.g., Q3 instead of Q4).
Model gives garbled or repetitive output This can happen with incorrect chat templates. In LM Studio, make sure the selected model is loaded with its correct template. In Ollama, stick to models from the official library — they include correct templates by default.
Ollama not using GPU on Windows
Ensure you have the latest NVIDIA drivers installed. Run ollama ps to confirm GPU usage. If it shows CPU, try reinstalling Ollama after a driver update.
LM Studio model download fails Check available disk space. Large models (30–70B at Q4) can require 20–40GB. Also check that your Hugging Face connection isn’t rate-limited.
Where MindStudio Fits Into This
Running models locally is great for privacy, experimentation, and cost control. But building actual workflows on top of local models — automated pipelines, multi-step agents, connected tools — takes significantly more work if you’re coding it from scratch.
MindStudio’s AI Media Workbench already supports local model backends including Ollama and LM Studio. You can connect your local Ollama instance to MindStudio’s visual workflow builder and use it as the inference backend for agents you build — while still connecting to external tools like Google Workspace, Slack, Notion, or HubSpot without writing integration code.
This means you get the privacy and cost benefits of local inference combined with the orchestration layer MindStudio provides. You’re not choosing between local models and capable workflows — you can have both.
For teams that want to build AI agents without code, MindStudio also gives you access to 200+ hosted models alongside your local ones, so you can route specific tasks to the model best suited for them — a local Qwen3 for private document analysis, a hosted Claude for customer-facing responses, all in one workflow.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
Is Ollama or LM Studio better for beginners?
LM Studio is generally easier to start with because it has a graphical interface, a built-in model browser, and visual settings for GPU offloading. Ollama is faster to set up if you’re comfortable with a terminal and better suited for integration with other tools and scripts via its API.
Can I run DeepSeek-R1 locally?
Yes. DeepSeek-R1 is available in multiple sizes (1.5B, 7B, 8B, 14B, 32B, 70B) and is well-supported in both Ollama and LM Studio. The 7B and 14B versions run comfortably on mid-range GPUs. The full 671B version is not practical on consumer hardware. The distilled variants (based on Qwen and LLaMA architectures) offer good reasoning performance at accessible sizes.
What is GGUF format?
How Remy works. You talk. Remy ships.
GGUF (GPT-Generated Unified Format) is the standard file format for locally running quantized LLMs. It replaced the older GGML format and is supported by llama.cpp, which powers both Ollama and LM Studio under the hood. GGUF files contain model weights, tokenizer data, and metadata in a single portable file.
How fast are local models compared to cloud APIs?
It depends heavily on your hardware. A well-configured 7B model on an RTX 4090 typically generates 80–120 tokens per second — comparable to many cloud APIs. Larger models on consumer hardware run slower, often 10–30 tokens per second. Apple Silicon M3 Max achieves around 30–60 tokens per second on 70B models. For most interactive use cases, this is fast enough.
Do local models have internet access?
No. Local models are purely inference engines — they generate text based on their training data and your prompt. They don’t browse the web by default. You can add tool use and web access by building a RAG pipeline or using an agent framework that feeds retrieved content into the model’s context.
Can I fine-tune models locally?
Fine-tuning is different from inference and requires more VRAM and specialized tools like Unsloth, LLaMA-Factory, or Axolotl. Ollama and LM Studio are inference tools — they run models but don’t train or fine-tune them. That said, you can use models locally after fine-tuning them with other tools by converting them to GGUF format.
Key Takeaways
- Ollama is best for developers who want CLI control and API integration. LM Studio is better for those who prefer a GUI and built-in chat interface.
- Quantization makes large models practical on consumer hardware. Q4_K_M is the right starting point for most use cases.
- Hardware matters, but is flexible. Even an 8GB GPU can run capable 7B–8B models at usable speeds. Apple Silicon’s unified memory architecture is particularly well-suited.
- Start with smaller models. A well-prompted 8B model often outperforms a poorly-prompted 70B one, and it runs twice as fast.
- Local inference pairs well with workflow automation. Tools like MindStudio let you connect local models to real business tools without building the integration layer yourself.
If you want to go beyond running models in a terminal and actually build something useful on top of them — automated reports, document processing agents, internal chatbots — exploring MindStudio’s workflow builder is a practical next step.