How to Run Open-Weight AI Models Locally with Ollama and LM Studio

Why Running LLMs Locally Is Worth Your Time

Running open-weight AI models locally has gone from a niche hobby to a practical option for developers, researchers, and privacy-conscious users. With tools like Ollama and LM Studio, you can run models like Qwen 3, Gemma 3, and DeepSeek-R1 on consumer hardware — no API key, no monthly bill, no data leaving your machine.

This guide covers everything you need to get started: which tools to use, how to pick and configure models, what quantization means for performance, and realistic expectations for what your hardware can handle.

What “Open-Weight” Actually Means

Before getting into setup, it’s worth being precise about terminology. “Open-weight” means the model weights are publicly available — you can download and run them yourself. It does not necessarily mean fully open-source (some models restrict commercial use or fine-tuning).

Popular open-weight models right now include:

Meta LLaMA 3.1 and 3.3 — Strong general-purpose models, widely supported
Qwen 3 (Alibaba) — Excellent multilingual performance, comes in sizes from 0.6B to 235B
Gemma 3 (Google) — Efficient, well-documented, strong at reasoning tasks
DeepSeek-R1 — Reasoning-focused model with strong benchmark scores
Mistral and Mixtral — Fast inference, solid for instruction following
Phi-4 (Microsoft) — Surprisingly capable at small sizes (14B and under)

These models vary widely in size, capability, and licensing. Choosing the right one depends on your hardware and use case.

Understanding Quantization (The Short Version)

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Full-precision LLMs require enormous amounts of VRAM. A 70B parameter model at 16-bit precision takes around 140GB — well beyond what most consumer GPUs can handle.

Quantization compresses the model by reducing the precision of its weights. Instead of storing each weight as a 16-bit float, quantized models use 4-bit or 8-bit integers. The tradeoff is a small reduction in quality for a dramatic reduction in memory use.

Common Quantization Formats

Format	Bits per weight	Memory reduction	Quality impact
F16	16	None (baseline)	None
Q8_0	8	~50%	Minimal
Q5_K_M	5	~69%	Very low
Q4_K_M	4	~75%	Low
Q3_K_M	3	~81%	Moderate
Q2_K	2	~87%	Significant

For most use cases, Q4_K_M is the sweet spot — it gives you good quality with roughly 4–5GB of memory needed for a 7B model. Ollama and LM Studio both handle quantized models in the GGUF format, which is the standard for local inference.

How Much VRAM Do You Actually Need?

A rough rule of thumb: multiply the model’s parameter count (in billions) by 0.6 for a Q4 quantized model to get an approximate VRAM requirement in GB.

7B model at Q4 ≈ 4–5GB VRAM
13B model at Q4 ≈ 8–9GB VRAM
34B model at Q4 ≈ 20–22GB VRAM
70B model at Q4 ≈ 40–45GB VRAM (or needs RAM offloading)

If your GPU doesn’t have enough VRAM, both Ollama and LM Studio can offload layers to system RAM — but this comes at a significant speed penalty.

Setting Up Ollama

Ollama is a command-line tool that makes running local LLMs as straightforward as pulling a Docker image. It handles model downloads, quantization selection, and serving a local API automatically.

Installation

Ollama supports macOS, Linux, and Windows. Installation is a single download:

macOS: Download the .dmg from ollama.com and run it. Ollama runs as a menu bar app.
Linux: Run curl -fsSL https://ollama.com/install.sh | sh in your terminal.
Windows: Download the Windows installer from the same site.

After installation, Ollama runs a local server on port 11434.

Pulling and Running Your First Model

Open a terminal and pull a model:

ollama pull qwen3:8b

Ollama will download the default quantized version. To run it interactively:

ollama run qwen3:8b

You’ll get a prompt where you can type messages directly. To exit, type /bye.

Choosing Specific Quantizations

By default, Ollama picks a sensible quantization for the model. But you can specify:

ollama pull llama3.3:70b-instruct-q4_K_M

Use ollama list to see what’s downloaded, and ollama show <model> for details about a specific model.

Using Ollama’s API

One of Ollama’s best features is its OpenAI-compatible REST API. Any app built for the OpenAI API can point to Ollama instead:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Explain quantization in two sentences."}]
  }'

This means you can swap local models into existing tools, scripts, or pipelines with minimal changes.

Useful Ollama Commands

ollama list          # Show downloaded models
ollama ps            # Show running models
ollama rm <model>    # Delete a model
ollama serve         # Start the server manually (if not running)

Setting Up LM Studio

LM Studio is a desktop application with a graphical interface. It’s a better choice if you prefer not to use the command line, or if you want a built-in chat UI, model browser, and performance monitoring in one place.

Installation

Download LM Studio from lmstudio.ai. It’s available for macOS (Apple Silicon and Intel), Windows, and Linux. The app is self-contained — no additional installs required.

Browsing and Downloading Models

LM Studio’s home screen includes a model discovery interface connected to Hugging Face. You can search by name, filter by size, and see community recommendations.

To download a model:

Open the Discover tab.
Search for a model (e.g., “Gemma 3” or “DeepSeek-R1”).
Select the quantization you want. LM Studio labels each option with estimated VRAM usage.
Click Download.

Models are stored locally in ~/.lmstudio/models (macOS/Linux) or the equivalent Windows path.

Running Models in LM Studio

Once downloaded:

Go to the Chat tab.
Select your model from the dropdown.
Adjust context length, temperature, and system prompt in the settings panel.
Start chatting.

The GPU Offload slider lets you control how many layers run on GPU vs. CPU. More GPU layers = faster inference but more VRAM used. LM Studio shows you estimated VRAM usage in real time as you adjust.

LM Studio’s Local Server

Like Ollama, LM Studio can run a local OpenAI-compatible API server. Go to the Local Server tab, load a model, and click Start Server. It runs on port 1234 by default.

This is useful for connecting LM Studio to local development environments, coding assistants like Continue.dev, or any tool that supports a custom OpenAI endpoint.

Model Recommendations by Hardware Tier

Not every machine can run every model. Here’s a practical breakdown of what works well at each hardware level.

8GB VRAM (e.g., RTX 3070, RTX 4060, M2 MacBook Air)

Qwen3 4B or 8B (Q4) — Fast inference, good at reasoning and coding
Gemma 3 4B or 12B (Q4) — Strong for its size, excellent instruction following
Phi-4 14B (Q4) — Pushes the limit but works with some layer offloading
Mistral 7B — Reliable, fast, good for general tasks

16–24GB VRAM (e.g., RTX 4090, RTX 3090, M3 Pro/Max)

LLaMA 3.3 70B (Q3 or Q4) — Near-frontier quality, but slower tokens per second
Qwen3 14B or 32B (Q4) — Excellent multilingual and coding performance
DeepSeek-R1 14B or 32B (Q4) — Strong reasoning chains, good for step-by-step tasks
Gemma 3 27B (Q4) — Google’s best open model at a manageable size

Apple Silicon (Unified Memory)

Apple Silicon is uniquely well-suited for local LLMs because RAM and VRAM share the same pool. An M3 Max with 64GB unified memory can run 70B models at reasonable speeds.

M1/M2 (8–16GB): 7B–13B models comfortably
M2/M3 Pro (18–36GB): Up to 34B models
M3 Max/Ultra (64–192GB): 70B models and beyond

Ollama has native Metal support. LM Studio also uses Metal acceleration on Apple Silicon. Both will automatically use the GPU.

CPU-Only (No Dedicated GPU)

It’s possible, but slow. Expect 1–5 tokens per second on a modern CPU for a 7B model. Models like Phi-4 mini or Qwen3 0.6B are designed for efficiency and handle CPU inference better than larger models.

Performance Tips and Common Troubleshooting

Getting Better Inference Speed

Use Q4_K_M instead of Q8 when VRAM is tight. The quality difference is small, and the speed gain is real.
Reduce context length. A 128K context window uses more memory than a 4K one. Set it to what you actually need.
Close background applications that use GPU resources (games, video editing software, etc.).
On Ollama, set OLLAMA_NUM_GPU=99 in your environment to force full GPU offloading.
On LM Studio, use the GPU offload slider to maximize layers on GPU.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Common Issues and Fixes

Model loads but inference is very slow This usually means layers are being offloaded to RAM. Either reduce context size, pick a smaller quantization, or try a smaller model variant.

“Out of memory” error Reduce the number of GPU layers in LM Studio, or pick a more aggressive quantization (e.g., Q3 instead of Q4).

Model gives garbled or repetitive output This can happen with incorrect chat templates. In LM Studio, make sure the selected model is loaded with its correct template. In Ollama, stick to models from the official library — they include correct templates by default.

Ollama not using GPU on Windows Ensure you have the latest NVIDIA drivers installed. Run ollama ps to confirm GPU usage. If it shows CPU, try reinstalling Ollama after a driver update.

LM Studio model download fails Check available disk space. Large models (30–70B at Q4) can require 20–40GB. Also check that your Hugging Face connection isn’t rate-limited.

Where MindStudio Fits Into This

Running models locally is great for privacy, experimentation, and cost control. But building actual workflows on top of local models — automated pipelines, multi-step agents, connected tools — takes significantly more work if you’re coding it from scratch.

MindStudio’s AI Media Workbench already supports local model backends including Ollama and LM Studio. You can connect your local Ollama instance to MindStudio’s visual workflow builder and use it as the inference backend for agents you build — while still connecting to external tools like Google Workspace, Slack, Notion, or HubSpot without writing integration code.

This means you get the privacy and cost benefits of local inference combined with the orchestration layer MindStudio provides. You’re not choosing between local models and capable workflows — you can have both.

For teams that want to build AI agents without code, MindStudio also gives you access to 200+ hosted models alongside your local ones, so you can route specific tasks to the model best suited for them — a local Qwen3 for private document analysis, a hosted Claude for customer-facing responses, all in one workflow.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

Is Ollama or LM Studio better for beginners?

LM Studio is generally easier to start with because it has a graphical interface, a built-in model browser, and visual settings for GPU offloading. Ollama is faster to set up if you’re comfortable with a terminal and better suited for integration with other tools and scripts via its API.

Can I run DeepSeek-R1 locally?

Yes. DeepSeek-R1 is available in multiple sizes (1.5B, 7B, 8B, 14B, 32B, 70B) and is well-supported in both Ollama and LM Studio. The 7B and 14B versions run comfortably on mid-range GPUs. The full 671B version is not practical on consumer hardware. The distilled variants (based on Qwen and LLaMA architectures) offer good reasoning performance at accessible sizes.

What is GGUF format?

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

GGUF (GPT-Generated Unified Format) is the standard file format for locally running quantized LLMs. It replaced the older GGML format and is supported by llama.cpp, which powers both Ollama and LM Studio under the hood. GGUF files contain model weights, tokenizer data, and metadata in a single portable file.

How fast are local models compared to cloud APIs?

It depends heavily on your hardware. A well-configured 7B model on an RTX 4090 typically generates 80–120 tokens per second — comparable to many cloud APIs. Larger models on consumer hardware run slower, often 10–30 tokens per second. Apple Silicon M3 Max achieves around 30–60 tokens per second on 70B models. For most interactive use cases, this is fast enough.

Do local models have internet access?

No. Local models are purely inference engines — they generate text based on their training data and your prompt. They don’t browse the web by default. You can add tool use and web access by building a RAG pipeline or using an agent framework that feeds retrieved content into the model’s context.

Can I fine-tune models locally?

Fine-tuning is different from inference and requires more VRAM and specialized tools like Unsloth, LLaMA-Factory, or Axolotl. Ollama and LM Studio are inference tools — they run models but don’t train or fine-tune them. That said, you can use models locally after fine-tuning them with other tools by converting them to GGUF format.

Key Takeaways

Ollama is best for developers who want CLI control and API integration. LM Studio is better for those who prefer a GUI and built-in chat interface.
Quantization makes large models practical on consumer hardware. Q4_K_M is the right starting point for most use cases.
Hardware matters, but is flexible. Even an 8GB GPU can run capable 7B–8B models at usable speeds. Apple Silicon’s unified memory architecture is particularly well-suited.
Start with smaller models. A well-prompted 8B model often outperforms a poorly-prompted 70B one, and it runs twice as fast.
Local inference pairs well with workflow automation. Tools like MindStudio let you connect local models to real business tools without building the integration layer yourself.

If you want to go beyond running models in a terminal and actually build something useful on top of them — automated reports, document processing agents, internal chatbots — exploring MindStudio’s workflow builder is a practical next step.

Why Running LLMs Locally Is Worth Your Time

What “Open-Weight” Actually Means

Understanding Quantization (The Short Version)

Remy doesn't build the plumbing. It inherits it.

Common Quantization Formats

How Much VRAM Do You Actually Need?

Setting Up Ollama

Installation

Pulling and Running Your First Model

Choosing Specific Quantizations

Using Ollama’s API

Useful Ollama Commands

Setting Up LM Studio

Installation

Browsing and Downloading Models

Running Models in LM Studio

LM Studio’s Local Server

Model Recommendations by Hardware Tier

8GB VRAM (e.g., RTX 3070, RTX 4060, M2 MacBook Air)

16–24GB VRAM (e.g., RTX 4090, RTX 3090, M3 Pro/Max)

Apple Silicon (Unified Memory)

CPU-Only (No Dedicated GPU)

Performance Tips and Common Troubleshooting

Getting Better Inference Speed

One coffee. One working app.

Common Issues and Fixes

Where MindStudio Fits Into This

Frequently Asked Questions

Is Ollama or LM Studio better for beginners?

Can I run DeepSeek-R1 locally?

What is GGUF format?

How Remy works. You talk. Remy ships.

How fast are local models compared to cloud APIs?

Do local models have internet access?

Can I fine-tune models locally?

Key Takeaways

Related Articles

The AI Tools That Got Replaced in 2026: Why Claude Code and Hermes Agent Killed Cursor, OpenClaw, and ChatGPT

How to Keep Up with Anthropic's Release Velocity: A Practical Guide for Claude Builders

My 2026 AI Builder Stack: S-Tier Daily Drivers, What I Retired, and the 20% Rule for Switching

How to Use Free Claude Code Alternatives: OpenRouter, NVIDIA NIM, and Ollama Setup Guide