What Is Selective Quantization? How Dwarf Star Runs 284B Models on 128GB RAM
Dwarf Star crushes only routed expert weights to 2-bit while keeping load-bearing layers at 4-bit, preserving quality while slashing memory requirements.
Running a 284-Billion-Parameter Model on a Single Mac
Not long ago, running a model with hundreds of billions of parameters required a data center. You needed racks of A100s, specialized cooling, and a budget that ruled out almost everyone except hyperscalers.
That’s changing fast. Selective quantization — the technique at the heart of Dwarf Star — is one of the main reasons why. By treating different parts of a model differently and compressing only what can be safely compressed, Dwarf Star makes it possible to run a 284B-parameter model on a single machine with 128GB of unified memory. That’s a Mac Studio with an M4 Ultra chip, sitting on your desk.
This article explains how selective quantization works, why Dwarf Star’s approach is smarter than blanket compression, and what it means for anyone who wants to run powerful open-source models locally.
The Memory Problem With Large Language Models
Every parameter in a neural network takes up memory. In full 32-bit floating-point precision, a single parameter uses 4 bytes. Scale that up to 70 billion parameters and you need 280GB just to load the weights — before you account for the KV cache, activations, or anything else needed to actually run inference.
Even at 16-bit (half-precision), a 70B model needs roughly 140GB. A 284B model at 16-bit needs around 570GB. That puts it firmly in multi-node GPU cluster territory.
The obvious solution is compression. Make each number take fewer bits to store. That’s quantization. But naive compression breaks things fast.
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
What Quantization Actually Does
Quantization reduces the numerical precision used to represent model weights. Instead of storing each weight as a 16-bit or 32-bit float, you store it as an 8-bit, 4-bit, or even 2-bit integer.
The math is straightforward: halve the bits, roughly halve the memory.
- FP16 (16-bit): Full quality, high memory
- INT8 (8-bit): ~50% memory reduction, minimal quality loss
- INT4 (4-bit): ~75% memory reduction, acceptable quality loss on most tasks
- INT2 (2-bit): ~87.5% memory reduction, significant quality loss — unless you’re careful about where you apply it
The problem is that a language model isn’t a uniform blob of equally important numbers. Some weights matter enormously for output quality. Others barely matter at all. Treating them the same is where naive quantization goes wrong.
What Selective Quantization Is
Selective quantization applies different precision levels to different parts of a model, based on how sensitive those parts are to numerical error.
The core insight: not every layer contributes equally to the quality of the model’s output. Some layers — attention mechanisms, embeddings, shared computation blocks — are load-bearing. Degrade them and the model falls apart. Others — particularly in Mixture of Experts (MoE) architectures — can absorb aggressive compression without noticeably hurting output quality.
Selective quantization lets you be aggressive where it’s safe and conservative where it’s not. The result is a model that fits in dramatically less memory while still behaving like a high-quality model.
This is the approach Dwarf Star takes. And to understand why it works so well, you need to understand how MoE models are built.
How Mixture of Experts Models Work
Modern frontier-scale open-source models like DeepSeek-V3 aren’t dense transformers where every parameter activates for every token. They’re Mixture of Experts (MoE) models.
Here’s how MoE works in plain terms:
- The model has a large pool of “expert” sub-networks, each specializing in different patterns
- A small routing network looks at each incoming token and decides which experts to activate
- Only a fraction of the total experts fire for any given token — often just 2 out of 64 or more
This is why MoE models can have enormous total parameter counts while still being efficient at inference. A 685B-parameter model like DeepSeek-V3 might only activate ~37B parameters per token. The rest sit idle, waiting for the router to call them.
That structure creates an obvious opportunity for selective quantization.
The Two Categories of Weights in an MoE Model
Load-bearing layers include:
- Attention weights (Q, K, V, output projections)
- Shared expert weights (always active, not routed)
- Layer normalization parameters
- Embedding tables
These fire on every single forward pass. Any degradation here directly degrades every output the model produces.
Routed expert weights include:
- The FFN (feed-forward network) weights inside each expert
- These are only activated when the router selects them for a given token
Because routed experts are called selectively and infrequently, they can tolerate heavier compression. If an individual expert’s weights are slightly imprecise, that imprecision affects only a fraction of tokens, and the overall output stays coherent.
How Dwarf Star Uses This Structure
Dwarf Star is a selective quantization scheme built specifically for large MoE models. The approach:
- Routed expert FFN weights → 2-bit (Q2_K or equivalent)
- Attention, shared experts, embeddings, and other critical layers → 4-bit (Q4_K_M or equivalent)
This isn’t uniform 2-bit compression. That would destroy the model. It’s surgical compression applied only to the parts of the model that can absorb it.
What the Memory Math Looks Like
Take a 284B-parameter MoE model where the majority of parameters live in routed expert FFN blocks (this is typical — MoE models concentrate most of their parameter count in expert layers).
If roughly 70–75% of total parameters are routed expert weights compressed to 2-bit, and the remaining 25–30% are kept at 4-bit, the blended average comes out to around 2.5–3 bits per parameter.
At 2.75 bits average:
- 284B parameters × 2.75 bits ÷ 8 bits/byte = ~97.6GB
Add KV cache, runtime overhead, and system memory usage, and you land right at the edge of what 128GB of unified memory can handle.
That’s not a coincidence. Dwarf Star is tuned specifically to fit target hardware.
Why 2-Bit for Experts Doesn’t Wreck Quality
The concern with 2-bit quantization is obvious: you’re representing continuous-valued weights as one of only four possible values. How can that not destroy the model?
A few reasons it works for routed experts specifically:
-
Redundancy across experts. In a large MoE model with dozens or hundreds of expert blocks, no single expert is solely responsible for any capability. Quantization noise in one expert is distributed and smoothed by others that handle similar token patterns.
-
Calibrated quantization. Modern quantization tools (like those used in GGUF format via llama.cpp) don’t just round weights blindly. They group weights into blocks, compute optimal quantization scales per block, and minimize total error. Q2_K specifically uses k-quant grouping to reduce the precision loss substantially compared to naive 2-bit rounding.
-
Careful layer selection. Dwarf Star doesn’t apply 2-bit compression to anything that would propagate errors globally. The 4-bit floor on attention and shared components ensures the model’s core reasoning machinery stays intact.
Benchmarks and Real-World Performance
The practical question is: how much quality does Dwarf Star actually sacrifice compared to running a higher-precision version of the same model?
Based on benchmarks run by the community testing Dwarf Star-style quantizations of DeepSeek-class models:
- On standard reasoning and coding benchmarks (MMLU, HumanEval, GSM8K), the selective 2/4-bit approach preserves approximately 95–98% of the quality of a full Q4 quantization
- Compared to the FP16 baseline, perplexity degradation is measurable but small — typically less than a 3–5% increase
- For most practical tasks — coding assistance, writing, analysis, Q&A — the output quality is indistinguishable from a higher-precision model in casual use
The tradeoff isn’t zero. If you’re running highly sensitive scientific reasoning or math competition problems, you might prefer a Q4 or Q5 quantization if memory allows. But for the vast majority of real-world tasks, Dwarf Star’s selective approach delivers quality that would have required a $100K+ server setup just a couple of years ago.
Inference Speed on Apple Silicon
Running a 284B model locally on 128GB RAM comes with another tradeoff: speed. The unified memory architecture of Apple Silicon is remarkably fast for this kind of workload — memory bandwidth is ~400GB/s on M4 Ultra — but you’re still moving enormous amounts of data per forward pass.
Typical token generation speed for a Dwarf Star-quantized 284B model on M4 Ultra lands in the range of 4–8 tokens per second for generation. That’s slower than running a 7B or 13B model, obviously, but it’s fast enough for most interactive use cases. And for batch or background processing, throughput compounds over time.
Selective Quantization Beyond Dwarf Star
Dwarf Star is one specific application of selective quantization, but the broader technique is spreading across the model compression ecosystem.
Unsloth has developed similar mixed-precision schemes for quantizing DeepSeek-R1 and other large MoE models, achieving comparable memory reductions with its own calibration approach.
GGUF mixed quantization in llama.cpp allows users to specify different quantization levels for different tensor types — for example, keeping attention weights at Q6_K while compressing FFN weights to Q2_K.
bitsandbytes library supports layer-by-layer quantization configuration in Python, allowing researchers to experiment with mixed-precision setups programmatically.
GPTQ with perplexity-guided bit allocation takes a data-driven approach, using calibration data to identify which weights are most sensitive and automatically assigning higher bit counts to those layers.
The common thread: the industry is moving away from uniform compression toward intelligent, structure-aware quantization. Blanket 4-bit isn’t the ceiling — it’s just the floor for the parts that matter.
What This Means for Local AI Development
The ability to run 284B-parameter models on consumer-grade hardware changes what’s possible for individual developers and small teams.
Before selective quantization techniques like Dwarf Star matured, your realistic local options topped out around 70B parameters — and even that required a machine with 64GB+ RAM. The jump to 284B isn’t just quantitative; models at this scale exhibit qualitatively different behavior: stronger reasoning chains, better instruction following, more coherent long-form output.
Concretely, running a Dwarf Star-quantized 284B model locally means:
- No API costs for high-volume workflows
- Full data privacy — nothing leaves your machine
- No rate limits
- Offline capability
- The ability to run a model that competes with proprietary frontier models on most benchmarks
The hardware requirements are still significant — a Mac Studio with 128GB costs around $3,000–4,000. But that’s a one-time capital cost, not a recurring inference bill.
How MindStudio Fits Into the Model Access Picture
Most teams don’t need to manage quantization schemes themselves. The underlying compression tech matters a lot, but it’s infrastructure — not something most builders want to think about day-to-day.
MindStudio takes the model access layer off your plate entirely. The platform gives you access to 200+ AI models — including large open-source models, Claude, GPT-4o, Gemini, and more — without managing API keys, infrastructure, or quantization tradeoffs yourself. You pick the model that fits your task and budget; the platform handles everything below that.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
This is especially useful when you’re building AI agents or automated workflows and want to mix models — using a fast, cheap model for simple routing decisions and a capable large model for complex reasoning steps. MindStudio’s visual no-code builder makes it straightforward to wire those together, and the average agent takes 15 minutes to an hour to build.
If you’re a developer who does want direct control over local models, MindStudio also supports connections to local model servers (via Ollama and LMStudio), so you can run your own Dwarf Star-quantized 284B model locally and call it from a MindStudio workflow.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What is selective quantization?
Selective quantization is a model compression technique that applies different levels of numerical precision to different parts of a neural network, based on how sensitive each part is to precision loss. Rather than compressing all weights equally, selective quantization identifies load-bearing layers (like attention mechanisms) and keeps them at higher precision, while aggressively compressing less critical layers (like infrequently-routed expert weights in MoE models) to save memory.
What makes Dwarf Star different from standard 4-bit quantization?
Standard 4-bit quantization applies the same bit depth uniformly across all model weights. Dwarf Star uses a mixed-precision strategy specifically designed for MoE models: routed expert FFN weights are compressed to 2-bit, while attention layers, shared experts, and embeddings stay at 4-bit. This achieves a lower average bits-per-parameter than uniform 4-bit, fitting larger models into a given memory budget while preserving most of the quality you’d get from full 4-bit compression.
Does 2-bit quantization significantly hurt model quality?
For routed expert weights in large MoE models, 2-bit quantization causes relatively little quality degradation — typically a perplexity increase of under 5% compared to full 4-bit quantization. This is because: (1) individual routed experts are only activated for a fraction of tokens, so noise in any single expert has limited global impact; (2) modern calibrated quantization methods like Q2_K use block-wise scaling to minimize rounding error; and (3) redundancy across many expert blocks smooths out individual imprecision. The story is very different for attention or embedding layers, where 2-bit would cause severe quality loss — which is why Dwarf Star doesn’t apply it there.
What hardware do you need to run a 284B model with Dwarf Star?
The primary requirement is memory. A Dwarf Star-quantized 284B model typically requires around 100–120GB of RAM for the weights alone, plus additional memory for the KV cache and runtime. The most accessible current hardware target is the Apple M4 Ultra Mac Studio with 128GB unified memory (~$3,000–4,000). Multi-GPU setups with enough total VRAM can also work. Inference speed will be moderate — typically 4–8 tokens per second on M4 Ultra — but sufficient for interactive use.
Is Dwarf Star a specific model or a quantization method?
Dwarf Star is a quantization method (specifically, a mixed-precision quantization scheme) rather than a model family. The same Dwarf Star approach can be applied to any large MoE model architecture. In practice, it’s been applied most visibly to DeepSeek-class models, which have the right MoE structure to benefit from selective 2/4-bit compression.
How does selective quantization compare to pruning?
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
Quantization and pruning are complementary but different techniques. Quantization reduces the number of bits used to represent each weight — all weights stay in the model, just at lower precision. Pruning removes weights entirely (setting them to zero or removing entire neurons/heads), reducing the total parameter count. Selective quantization preserves the full model structure and tends to be more predictable in quality tradeoffs. Some advanced compression pipelines combine both approaches.
Key Takeaways
- Selective quantization assigns different bit depths to different parts of a model based on their sensitivity — not a single uniform compression level across all weights.
- Dwarf Star specifically targets MoE models: routed expert weights get 2-bit compression, while attention and shared components stay at 4-bit, achieving a memory footprint low enough to run 284B models on 128GB RAM.
- The technique works because routed expert weights in MoE architectures are inherently more tolerant of precision loss — they activate infrequently, and quality noise distributes across many redundant experts.
- Quality tradeoffs are real but modest — typically under 5% perplexity degradation versus full 4-bit — and barely detectable on most practical tasks.
- The broader trend points toward intelligent, structure-aware compression becoming standard practice, replacing blanket quantization schemes.
For teams building AI applications who want to access powerful models without managing any of this infrastructure themselves, MindStudio provides access to 200+ models — including large open-source models — through a no-code interface. You focus on what you’re building; the model access layer handles itself.
