Karpathy's LLM Wiki: 95% Less Token Use Than RAG

Q: What is an LLM wiki and how is it different from a knowledge base?

An LLM wiki is a knowledge base formatted specifically for consumption by a language model — typically a concise, structured markdown document. Unlike traditional knowledge bases designed for human navigation, an LLM wiki is optimized for density and clarity so that an AI can reason over the full document within a single context window. The term was popularized by Andrej Karpathy as a way to give LLMs reliable, structured access to information without retrieval complexity.

Q: When does RAG outperform an LLM wiki?

RAG outperforms an LLM wiki when the knowledge base is too large to fit in the model's context window, when content changes frequently enough that maintaining a curated document isn't practical, or when source attribution is a hard requirement. For knowledge bases that run into the hundreds of thousands of tokens or more, retrieval is the only scalable approach.

Q: Can I use both RAG and an LLM wiki in the same system?

Yes, and this is often the most practical approach for production systems. A common pattern is to keep stable, foundational knowledge in the system prompt as a curated wiki, while using retrieval for dynamic, large, or user-specific content that changes frequently or can't fit in context. The two approaches are complementary rather than mutually exclusive.

Q: How do I know if my knowledge base is small enough for an LLM wiki?

A rough guideline: if your knowledge base fits under 50,000–100,000 tokens (depending on which model you're using), an LLM wiki is worth trying first. That's roughly 150–200 pages of dense text. If you can fit your domain knowledge into a well-structured document of that size, load it directly into context and skip the retrieval infrastructure.

Q: What are the most common failure modes in RAG systems?

The most common RAG failures are: chunking boundaries that cut semantic context at the wrong point, embedding similarity that misses conceptually relevant chunks with different vocabulary, stale indexes that don't reflect recent document updates, and retrieved chunks that lack enough surrounding context to be useful. These failures are often silent — the system generates a confident response without the information it needed.

Q: Is vector search always required for RAG?

No. While vector databases are the most common RAG architecture, there are alternatives. BM25 keyword search can work well for certain types of content, especially technical documentation where exact terminology matters. Hybrid approaches that combine vector search with keyword search often outperform either alone. The choice depends on your content type, query patterns, and latency requirements.

The Case Against Always Reaching for RAG

Retrieval-Augmented Generation has become the default answer to every “how do I give my AI access to knowledge?” question. Need your chatbot to know about your product? RAG it. Need your agent to reference internal docs? RAG it. But RAG comes with real costs — infrastructure complexity, retrieval failures, chunking headaches, and ongoing maintenance — and for many use cases, there’s a simpler approach that works just as well or better.

That approach is the LLM wiki: a structured markdown knowledge base designed to live directly in an LLM’s context window. Andrej Karpathy popularized the concept, and for small, focused knowledge bases, it can cut token usage by up to 95% compared to naive document loading while dramatically reducing system complexity.

This article breaks down both approaches, compares them honestly, and helps you figure out which one actually fits your situation.

What Is an LLM Wiki?

The term “LLM wiki” refers to a knowledge base — typically a set of structured markdown files — that’s written specifically to be read by a language model, not by humans browsing a website.

The core idea is simple: instead of building a retrieval pipeline to pull relevant information at query time, you write a compact, well-organized document that covers everything an LLM needs to know about a given domain. That document gets loaded directly into the context window as part of the system prompt or the initial context.

How Karpathy Thinks About It

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Andrej Karpathy framed the LLM wiki concept as a way to maintain a “living document” of knowledge that grows incrementally. You write it like you’d write a Wikipedia article — structured, factual, dense — but optimized for the way LLMs process text rather than the way humans browse information.

The key properties of a good LLM wiki:

Concise but complete — No fluff, no repetition. Every sentence earns its place.
Well-structured — Clear headings, consistent formatting, logical hierarchy.
Factual and specific — Definitions, examples, relationships between concepts.
Curated — Deliberately includes what matters and excludes what doesn’t.
Updatable — Maintained like a living document, not a static dump.

A 2,000-word markdown file covering your product’s pricing model, edge cases, and common customer questions might be all you need. No embeddings, no vector database, no retrieval logic.

Why This Works

Modern LLMs have large context windows. Claude 3.5 Sonnet handles 200,000 tokens. GPT-4o handles 128,000. Gemini 1.5 Pro can go even higher. For most small-to-medium knowledge bases, the entire knowledge base fits comfortably in context.

When the knowledge fits in context, you don’t need to retrieve anything. The LLM reads everything and reasons over it directly. This is more reliable than retrieval — you’re not depending on embedding similarity to surface the right chunks.

How Traditional RAG Works

RAG (Retrieval-Augmented Generation) adds an external knowledge source to an LLM by retrieving relevant information at query time rather than keeping it all in context.

The typical pipeline looks like this:

Ingestion — Documents are split into chunks (usually 256–1,024 tokens each).
Embedding — Each chunk is converted into a vector using an embedding model.
Storage — Vectors are stored in a vector database (Pinecone, Weaviate, Chroma, pgvector, etc.).
Query — At runtime, the user’s query is embedded, and the database returns the most similar chunks.
Injection — Retrieved chunks are injected into the LLM’s context alongside the query.
Generation — The LLM generates a response based on the retrieved context.

What RAG Gets Right

RAG exists for good reasons. It genuinely solves problems that context-stuffing can’t:

Scale — When your knowledge base has millions of documents or hundreds of thousands of tokens, you can’t fit it in context. RAG lets you query only what’s relevant.
Real-time updates — You can update a vector database without changing your prompts or system architecture.
Source attribution — RAG naturally supports citations, since you know exactly which chunks were retrieved.
Long-tail coverage — For broad knowledge bases with diverse topics, retrieval can surface niche content that you couldn’t predict would be needed.

What RAG Gets Wrong

RAG also has failure modes that are easy to underestimate until you’re debugging them in production:

Chunking problems — Splitting documents at the wrong points breaks context. A chunk that says “the fee is $25” without explaining what fee or under what conditions is nearly useless.
Retrieval misses — Embedding similarity is not the same as semantic relevance. A query about “cancellation policy” might not retrieve the chunk that talks about “ending your subscription.”
Retrieval overhead — Every query requires an embedding call and a vector search. For high-volume applications, this adds latency and cost.
Infrastructure complexity — You now maintain an embedding pipeline, a vector database, chunking logic, and retrieval tuning. That’s a lot of moving parts.
Stale indexes — If your source documents update but your index doesn’t, your AI gives wrong answers with confidence.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Head-to-Head Comparison

Here’s how the two approaches stack up across the dimensions that matter most for practical deployments:

Dimension	LLM Wiki	RAG
Setup complexity	Low — write markdown, load it	High — chunk, embed, index, retrieve
Infrastructure required	None	Vector DB + embedding model + pipeline
Best knowledge base size	Small to medium (<50K tokens)	Medium to very large
Retrieval reliability	100% — everything is in context	Variable — depends on chunk quality
Update workflow	Edit the markdown file	Re-chunk, re-embed, re-index
Token cost per query	Fixed (full wiki)	Variable (retrieved chunks)
Source attribution	Harder	Native
Latency	Lower	Higher (embedding + search)
Maintenance burden	Low	Medium to high

The tradeoff is fundamentally about knowledge base size vs. system complexity. LLM wikis win on simplicity and reliability for small knowledge bases. RAG wins on scale when the knowledge genuinely can’t fit in context.

When to Use an LLM Wiki

The LLM wiki approach is the right choice when your knowledge base is small, focused, and relatively stable. Specifically, consider it when:

Your Knowledge Base Fits in Context

If your entire knowledge base is under 50,000–100,000 tokens, there’s no technical reason to use RAG. Modern models handle this comfortably. A well-written wiki covering product documentation, support FAQs, internal policies, or domain-specific knowledge often comes in well under this limit.

A 50,000-token knowledge base is roughly 150–200 pages of dense text. Most real-world domain knowledge fits within that range when it’s properly curated rather than dumped.

You Need High Reliability

If your use case can’t tolerate retrieval misses — customer-facing support bots, legal or compliance tools, medical information assistants — the LLM wiki approach is more reliable. Everything is in context, so the model can reason over the complete picture rather than a retrieved sample.

Retrieval failures in RAG are often silent. The system doesn’t tell you it didn’t find the relevant chunk — it just generates a response without it.

Your Knowledge Is Stable

If your knowledge base doesn’t change frequently, a markdown file is easy to maintain. You update the document, and the next query automatically uses the updated version. No re-indexing required.

You Want to Ship Quickly

RAG requires infrastructure. Even using a managed vector database service, you still need to write ingestion pipelines, tune chunking, handle embedding API calls, and build retrieval logic. An LLM wiki is a markdown file. You can build and test it in an afternoon.

When to Use RAG

RAG is genuinely the right choice in specific situations. Don’t write it off just because it’s more complex — the complexity is justified when:

Your Knowledge Base Is Large

When you’re dealing with thousands of documents, extensive legal codebases, large product catalogs, or any knowledge base that runs into the millions of tokens, you need retrieval. Context windows have limits, and even when they don’t, loading hundreds of thousands of tokens on every query is expensive.

Your Content Changes Frequently

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

If documents update daily or in real time — stock information, news feeds, changing inventory, live pricing — a retrieval system that can be updated incrementally is the right architecture. Maintaining a curated wiki for fast-changing content becomes a full-time job.

You Need Source Attribution

RAG makes citations straightforward. When you retrieve a chunk, you know its source document, page number, and URL. If your application needs to show users where information came from — research tools, legal assistants, academic applications — RAG handles this naturally.

You’re Building Multi-Domain Systems

A single RAG index can cover many different domains and surface relevant content across all of them. An LLM wiki works best when the knowledge is coherent and focused. If you’re building a general-purpose enterprise assistant that needs to know about HR, finance, IT, and legal, a retrieval system scales better.

The Token Math: Why the Difference Is Larger Than You Think

The claim that an LLM wiki can cut token usage by 95% deserves some unpacking, because it depends heavily on what you’re comparing against.

The Naive Document-Loading Baseline

Many teams start by loading full documents — entire PDFs, complete policy manuals, unstructured text — directly into context. A single policy document might be 20,000 tokens. Load three of them and you’re at 60,000 tokens per query, whether or not most of that content is relevant.

A curated LLM wiki covering the same factual ground, written concisely, might be 2,000–3,000 tokens. That’s where the 95% reduction comes from: not from comparing against optimized RAG, but from comparing against the unstructured approach that many people start with.

Comparing Against Optimized RAG

Against a well-tuned RAG system, the token comparison is more nuanced:

RAG might inject 2,000–5,000 tokens of retrieved context per query.
A small LLM wiki of 3,000 tokens costs roughly the same per query.
A large LLM wiki of 30,000 tokens costs more than optimized RAG.

The breakeven point depends on your wiki size and your RAG retrieval settings. For small knowledge bases, an LLM wiki is cost-competitive or cheaper. For large ones, RAG wins on token economics.

The Hidden Token Costs of RAG

RAG also has token costs that don’t show up in the obvious comparison:

Embedding calls (usually small, but they add up at scale)
System prompt overhead for retrieval instructions
Token waste from poorly-chunked content that retrieves irrelevant material

Factor those in and the true cost difference narrows further.

A Hybrid Approach Worth Considering

The LLM wiki vs. RAG framing is useful for understanding the tradeoffs, but in practice, many production systems use both.

A common hybrid pattern:

LLM wiki in the system prompt — Stable, curated knowledge that’s always relevant (product overview, core policies, persona instructions).
RAG for dynamic or large content — Specific documents retrieved on demand (transaction history, user-specific records, large reference databases).

The wiki handles the foundational knowledge that should always be available. RAG handles the long-tail, high-volume, or frequently-changing content that can’t live in context permanently.

This approach gives you the reliability of always-available core knowledge combined with the scale of retrieval for edge cases.

How MindStudio Handles Knowledge in AI Agents

When you’re building AI agents or workflows in MindStudio, you have access to both patterns — and you don’t have to choose your architecture upfront.

MindStudio’s no-code workflow builder lets you configure what knowledge an AI agent has access to. For focused, stable knowledge, you can paste a curated markdown document directly into the agent’s system prompt. Your agent has immediate access to everything in that document, with no setup, no indexing, and no retrieval step.

For larger knowledge bases, MindStudio supports retrieval from connected data sources through its 1,000+ integrations — pulling from Notion, Google Drive, Airtable, or other connected tools at runtime. This gives you the retrieval pattern without having to build and maintain embedding pipelines yourself.

The practical upshot: you can start with an LLM wiki approach in minutes, validate that it works for your use case, and migrate to a retrieval-based pattern later if your knowledge base grows beyond what context can handle. The average MindStudio build takes 15 minutes to an hour, so you’re not committing to a heavy architecture before you’ve tested whether the simple approach works.

If you’re building a support bot, internal knowledge assistant, or any agent that needs domain-specific knowledge, MindStudio’s free tier is worth trying before you spin up a vector database.

You can also explore how other teams are using AI agents for knowledge management workflows to get a sense of what’s practical in production.

Frequently Asked Questions

What is an LLM wiki and how is it different from a knowledge base?

An LLM wiki is a knowledge base formatted specifically for consumption by a language model — typically a concise, structured markdown document. Unlike traditional knowledge bases designed for human navigation, an LLM wiki is optimized for density and clarity so that an AI can reason over the full document within a single context window. The term was popularized by Andrej Karpathy as a way to give LLMs reliable, structured access to information without retrieval complexity.

When does RAG outperform an LLM wiki?

RAG outperforms an LLM wiki when the knowledge base is too large to fit in the model’s context window, when content changes frequently enough that maintaining a curated document isn’t practical, or when source attribution is a hard requirement. For knowledge bases that run into the hundreds of thousands of tokens or more, retrieval is the only scalable approach.

Can I use both RAG and an LLM wiki in the same system?

Yes, and this is often the most practical approach for production systems. A common pattern is to keep stable, foundational knowledge in the system prompt as a curated wiki, while using retrieval for dynamic, large, or user-specific content that changes frequently or can’t fit in context. The two approaches are complementary rather than mutually exclusive.

How do I know if my knowledge base is small enough for an LLM wiki?

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

A rough guideline: if your knowledge base fits under 50,000–100,000 tokens (depending on which model you’re using), an LLM wiki is worth trying first. That’s roughly 150–200 pages of dense text. If you can fit your domain knowledge into a well-structured document of that size, load it directly into context and skip the retrieval infrastructure.

What are the most common failure modes in RAG systems?

The most common RAG failures are: chunking boundaries that cut semantic context at the wrong point, embedding similarity that misses conceptually relevant chunks with different vocabulary, stale indexes that don’t reflect recent document updates, and retrieved chunks that lack enough surrounding context to be useful. These failures are often silent — the system generates a confident response without the information it needed.

Is vector search always required for RAG?