How to Self-Host an Open-Weight AI Stack for Enterprise in Under a Day: DeepSeek V4 + Qwen Embeddings
Cut your AI inference bill 3x by self-hosting DeepSeek V4 with Qwen embeddings. Here's the full stack setup guide for enterprise teams.
You Can Cut Your AI Inference Bill by 3x Before Lunch
Your team is spending real money on API tokens. Not “noticeable” money — money that shows up in budget reviews and makes finance ask uncomfortable questions. If you’re running document summarization, customer support agents, or internal RAG pipelines against GPT-5.5 at $5/M input tokens and $30/M output tokens, the math gets ugly fast at any meaningful scale.
Here’s what changes in under a day: you self-host DeepSeek V4 with Qwen embedding models and Llama 4 Scout or Maverick handling your local RAG and agent workloads. The result is near-equivalent benchmark performance for most enterprise use cases at a fraction of the cost — DeepSeek V4 runs at $1.74/M input tokens and $3.48/M output tokens when served from your own infrastructure, and the Qwen embedding models that power your retrieval layer cost essentially nothing beyond electricity once they’re running locally.
This isn’t a theoretical exercise. The models exist, the tooling is mature enough to deploy in a working day, and the business case is straightforward for any team spending more than a few thousand dollars a month on inference.
Why Your Token Bill Is Probably 3x Higher Than It Needs to Be
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
DeepSeek V4 is open-weight, carries a 1 million token context window, and benchmarks at near-parity with GPT-5.4 on math and Q&A tasks. Compare that to Claude Opus 4.7 at $5/M input and $25/M output, or Gemini 3.1 at $2/M input and $12/M output. For the overwhelming majority of enterprise workloads — document processing, structured data extraction, internal Q&A, support ticket routing — the frontier models are overkill.
The irony is that China’s GPU export restrictions accidentally made DeepSeek cheaper to serve than any American model. Because DeepSeek couldn’t access the most powerful Nvidia chips during training, they were forced to find compute-efficient training methods. Those efficiency gains translate directly into lower serving costs. The constraint became a structural advantage.
For enterprise teams, this creates a specific opportunity: you can run DeepSeek V4 on your own servers, keep all data on-premises, and pay only for the hardware and electricity. The security and compliance arguments alone justify the setup time for many regulated industries. The cost savings are a bonus.
The stack we’re building here pairs DeepSeek V4 as the primary reasoning model with Qwen embedding models for retrieval and Llama 4 Scout (or Maverick, if you need the heavier mixture-of-experts model) for agent tasks that benefit from a more efficient local model. If you want a deeper comparison of how Qwen models stack up against other open-weight options, the Gemma 4 vs Qwen 3.5 open-weight comparison covers the tradeoffs in detail.
What You Need Before You Start
Hardware. DeepSeek V4 is too large for consumer GPUs — you need either a cloud VM with serious VRAM (an A100 or H100 instance), an on-premises server with multiple high-memory GPUs, or an Nvidia DGX Spark if you want an appliance-style local deployment. For the embedding layer, Qwen embedding models are small enough to run on almost anything with a GPU, including a Mac Studio with 128GB unified memory.
For Llama 4 Scout and Maverick, the mixture-of-experts architecture means the active parameter count per token is lower than the total model size suggests. Scout is the more practical choice for most teams — it runs on hardware that’s actually purchasable without a procurement process.
Software prerequisites. You need:
- Docker (for containerized serving)
- Ollama or vLLM (your inference runtime — Ollama for getting started quickly, vLLM when you’re serving a team)
- Python 3.10+ for the RAG pipeline
- A vector database: Postgres with pgvector is the production default; SQLite with sqlite-vec works for smaller deployments
Accounts and access. Pull the DeepSeek V4 weights from Hugging Face. You’ll need a Hugging Face account and enough disk space — plan for several hundred GB depending on quantization level. Qwen embedding models are available from the same source and are substantially smaller.
Knowledge baseline. You should be comfortable with the command line, understand what an embedding is at a conceptual level, and have some familiarity with REST APIs. You don’t need to understand the model architecture.
The Setup, Step by Step
Step 1: Stand up your inference runtime
Install Ollama on your target machine. For Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Start the Ollama service and verify it’s running on port 11434. Now you have a local OpenAI-compatible API endpoint that other tools can talk to.
If you’re deploying for a team rather than a single workstation, use vLLM instead — it handles batching and concurrent requests properly. The Ollama path is faster to get running; vLLM is the right answer when serving becomes infrastructure.
Step 2: Pull DeepSeek V4 and your embedding model
ollama pull deepseek-v4
ollama pull qwen:embedding
The DeepSeek V4 pull will take a while depending on your connection. While it’s downloading, pull Llama 4 Scout:
ollama pull llama4:scout
Once complete, run a quick sanity check:
ollama run deepseek-v4 "Summarize the key risks in a standard enterprise software contract in three sentences."
If you get a coherent response, the model is serving correctly. Now you have a working local inference layer.
Step 3: Set up your vector store
For a production RAG setup, Postgres with pgvector is the right choice. Install pgvector and create your embeddings table:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536),
source TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops);
The Qwen embedding models output 1536-dimensional vectors by default — match your vector column dimension to whatever the model actually produces, which you can verify by running a test embedding and checking the output shape.
For smaller deployments or local experimentation, SQLite with sqlite-vec works and requires no separate database server. The how to run local AI models with Claude Code to cut costs post covers the SQLite path in more detail if that’s your starting point.
Now you have a vector store ready to receive embeddings.
Step 4: Build the ingestion pipeline
Your ingestion pipeline needs to: load documents, chunk them sensibly, generate embeddings via your local Qwen model, and store them in your vector database.
import requests
import psycopg2
import json
OLLAMA_URL = "http://localhost:11434/api/embeddings"
EMBEDDING_MODEL = "qwen:embedding"
def embed_text(text: str) -> list[float]:
response = requests.post(OLLAMA_URL, json={
"model": EMBEDDING_MODEL,
"prompt": text
})
return response.json()["embedding"]
def chunk_document(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
The chunking strategy matters more than most people expect. PDFs need different handling than plain text. Meeting transcripts need speaker attribution preserved. Code needs symbol-aware chunking. A naive word-count splitter will work well enough to validate the pipeline, but plan to revisit this before you call the system production-ready.
Run the ingestion pipeline against a representative sample of your documents. Now you have a populated vector store with locally-generated embeddings that never left your network.
Step 5: Wire up the retrieval and generation layer
The retrieval step: embed the user’s query using the same Qwen embedding model, find the top-k most similar chunks by cosine distance, and pass them as context to DeepSeek V4.
def retrieve_context(query: str, top_k: int = 5) -> list[str]:
query_embedding = embed_text(query)
# Query your pgvector store for nearest neighbors
# Return the content of the top_k chunks
def generate_response(query: str, context: list[str]) -> str:
context_text = "\n\n".join(context)
prompt = f"""Use the following context to answer the question.
If the context doesn't contain the answer, say so.
Context:
{context_text}
Question: {query}"""
response = requests.post("http://localhost:11434/api/generate", json={
"model": "deepseek-v4",
"prompt": prompt,
"stream": False
})
return response.json()["response"]
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
Test this end-to-end with a question you know the answer to from your documents. If retrieval is working, you’ll see the relevant chunks surfaced and DeepSeek V4 will synthesize a coherent answer from them.
Now you have a working local RAG pipeline. Everything — ingestion, embedding, retrieval, generation — runs on your hardware.
Step 6: Add Llama 4 Scout for agent tasks
For agent workloads that involve tool use, multi-step reasoning, or high-frequency calls where you want to preserve DeepSeek V4 capacity for harder tasks, route those calls to Llama 4 Scout instead. The mixture-of-experts architecture means Scout fires only a subset of its parameters per token, making it faster and cheaper to serve for tasks that don’t need the full model.
Configure your agent harness to use llama4:scout as the default model and escalate to deepseek-v4 for tasks that require deeper reasoning. This routing logic is where most of the operational cost optimization happens in practice.
If you’re building agents that need to connect to business tools — CRMs, ticketing systems, internal APIs — platforms like MindStudio handle this orchestration layer: 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows, which can be useful when you want the local inference stack but don’t want to hand-wire every integration.
The Failure Modes Nobody Warns You About
Chunking quality kills retrieval quality. The most common reason a RAG pipeline underperforms is bad chunking, not a bad model. If your chunks split sentences mid-thought, or if they’re too short to carry meaningful context, retrieval will surface irrelevant passages and generation will hallucinate. Invest time here before blaming the model.
Embedding model mismatch. If you generate embeddings with one model and then switch to a different embedding model later, your existing vectors are incompatible. You’ll need to re-embed everything. This is why keeping your raw document chunks and your embeddings separate in the database matters — you can rebuild the embeddings without losing the source data.
Memory pressure under concurrent load. DeepSeek V4 with a 1 million token context window can consume enormous amounts of VRAM if you’re not careful about context length in practice. Set explicit max_tokens limits in your generation calls and monitor GPU memory under realistic concurrent load before calling the system production-ready.
Quantization tradeoffs. Running DeepSeek V4 at 4-bit quantization significantly reduces memory requirements but introduces some quality degradation. For most enterprise document tasks, 4-bit is fine. For tasks requiring precise numerical reasoning, test carefully before committing. The Qwen 3.6 Plus review for agentic coding has useful notes on quantization behavior in practice.
The “it worked in testing” problem. Local inference is fast and cheap in testing when you’re the only user. Under concurrent load from a real team, you may discover that your single-GPU setup queues requests in ways that make the system feel slow. vLLM’s batching handles this significantly better than Ollama for multi-user scenarios.
Where to Take This Next
The stack you’ve built — DeepSeek V4 for generation, Qwen embeddings for retrieval, Llama 4 Scout for agent tasks — is a reasonable production baseline for most enterprise RAG use cases. But there are several directions worth pursuing once the baseline is working.
Fine-tuning on your domain. The open-weight nature of these models means you can fine-tune on your own data. For specialized domains — legal documents, medical records, financial filings — a fine-tuned smaller model will often outperform a general-purpose large model. Llama 4 Scout’s mixture-of-experts architecture makes it a practical fine-tuning target.
Hybrid routing. Not every query needs local inference. Hard synthesis tasks, novel reasoning problems, and anything where you need the absolute frontier of capability should still route to a cloud model. The how to use OpenRouter free models with Claude Code to cut AI costs post covers one approach to this kind of hybrid routing.
Memory persistence. The RAG pipeline handles document retrieval, but agent memory — the ability to remember decisions, preferences, and project state across sessions — is a separate problem. This is worth solving intentionally rather than bolting on later.
Spec-driven application development. Once your inference stack is stable, the next question is how to build production applications on top of it. Tools like Remy take a different approach to this layer: you write an annotated markdown spec describing your application’s behavior, data types, and edge cases, and Remy compiles it into a complete TypeScript backend, SQLite database, frontend, and auth — the spec is the source of truth, the code is derived output. For teams building internal tools on top of a local inference stack, this can dramatically reduce the time from working pipeline to deployed application.
Monitoring and evals. Local inference removes the observability you get for free from cloud providers. Build logging into your pipeline from the start — log queries, retrieved chunks, and responses. Run periodic evals against a golden dataset to catch quality regressions when you update models or change chunking strategies.
The open-weight ecosystem is moving fast enough that any specific model recommendation ages quickly. What doesn’t age is the stack architecture: a runtime that makes models swappable, an embedding layer you control, a vector store you own, and routing logic that puts the right model on the right task. Build that, and swapping DeepSeek V4 for whatever ships next quarter is a one-line change.
For teams evaluating which open-weight models to anchor their embedding layer on, the Gemma 4 vs Qwen 3.5 comparison for agentic workflows covers the context window and function-calling tradeoffs that matter most for agent use cases.
The cost math is real. The setup time is real. The question is whether your team’s current token spend justifies a working day of setup. For most teams running production AI workloads, the answer is yes before you finish the calculation.