How to Build a Hybrid AI Memory System for Claude Code: Storage, Injection, and Recall
Learn how to combine MemSearch and Hermes to build a memory system that stores everything, injects smartly, and recalls by meaning with source citations.
Why Claude Code Forgets Everything (And How to Fix It)
Every developer who works with Claude Code runs into the same wall eventually. You’re mid-session on a complex project. Claude has context about your architecture, your naming conventions, your past decisions. Then the session ends — and the next time you open it, none of that exists. You’re starting from scratch.
This isn’t a flaw in Claude Code specifically. It’s a fundamental constraint of how large language models work: they have a fixed context window, and nothing outside that window is accessible. For short tasks, this is fine. For ongoing development work — the kind where history, preferences, and accumulated knowledge actually matter — it’s a serious limitation.
A hybrid AI memory system solves this. By combining a semantic recall layer (MemSearch) with a structured storage and injection layer (Hermes), you can give Claude Code something close to persistent, meaningful memory: storing everything that happens, injecting relevant context at the right moments, and retrieving past information by meaning rather than by exact keyword match.
This guide walks through how to build that system from scratch.
What a Hybrid Memory System Actually Does
Before getting into implementation, it’s worth being precise about what “memory” means in this context — because there are several distinct problems to solve.
The Three Memory Problems
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
Storage: By default, Claude Code doesn’t write anything to persistent storage. Once a session closes, everything is gone. You need a system that captures important information — decisions made, patterns identified, code explained — and saves it somewhere durable.
Injection: Having stored memory doesn’t help if Claude never sees it. You need a mechanism that automatically retrieves relevant memory and injects it into Claude’s context at the start of a session or at key moments during a task. Injecting everything would overflow the context window, so the injection layer needs to be selective.
Recall: When you want to ask “what did we decide about authentication?” or “show me how we handled rate limiting before,” you need semantic search — retrieval that works on meaning, not just text matching. This is the hardest part to get right.
A hybrid system addresses all three separately and lets them work together.
Why “Hybrid”?
Pure vector search (embedding-based retrieval) is great at finding semantically similar content but loses structure and chronology. Pure key-value or relational storage is great at structure but can’t retrieve by meaning. A hybrid system uses both:
- Hermes handles structured storage and smart injection — maintaining metadata, session context, source tracking, and deciding what gets injected when
- MemSearch handles the semantic layer — embedding content into vectors, enabling meaning-based retrieval, and surfacing results with source citations
Together, they cover the full memory lifecycle.
Setting Up the Storage Layer with Hermes
Hermes acts as the memory orchestration layer. Its job is to receive information from Claude Code sessions, store it with proper structure, and manage what gets injected into future sessions.
What Hermes Stores
Each memory entry should capture more than just the raw content. Useful metadata includes:
- Session ID — which conversation or work session this came from
- Timestamp — when the memory was created
- Memory type — decision, code snippet, explanation, user preference, error resolution
- Tags — project name, file names, technologies involved
- Source — what file, conversation turn, or command produced this memory
- Confidence — how important or reliable this memory is
Without this structure, you end up with a flat pile of text blobs that’s hard to manage or filter.
Structuring Memory Entries
A practical schema for a memory entry looks something like this:
{
"id": "mem_20240712_001",
"session_id": "sess_abc123",
"timestamp": "2024-07-12T14:23:00Z",
"type": "decision",
"content": "We're using JWT tokens with 15-minute expiry for access and 7-day refresh tokens stored in httpOnly cookies.",
"tags": ["auth", "security", "tokens"],
"source": "src/auth/middleware.ts",
"project": "ecommerce-api",
"importance": 0.9
}
The importance score is useful for filtering during injection — high-importance memories (architectural decisions, security choices, key patterns) get injected more aggressively than low-importance ones (minor style notes, one-off fixes).
Capturing Memories During Sessions
There are two approaches to capture: automatic and manual.
Automatic capture hooks into Claude Code’s output stream and uses a secondary model to identify what’s worth saving. After each significant response, a lightweight classification step decides whether to store the content and how to categorize it.
Manual capture gives you explicit control. A simple command — something like /remember [content] — triggers immediate storage. This is more reliable but requires discipline.
One coffee. One working app.
You bring the idea. Remy manages the project.
In practice, both work best together. Automatic capture catches things you’d forget to save; manual capture lets you flag the decisions that really matter.
The Injection Strategy
When a new session starts, Hermes queries stored memories and selects a subset to inject into Claude’s system prompt or first user message. The selection logic should consider:
- Project match — only inject memories tagged to the current project
- Recency — newer memories are generally more relevant
- Importance — high-importance entries always make the cut
- Token budget — never inject more than a set percentage of the available context window (a good default is 20-30%)
The injected memories appear as structured context, clearly labeled so Claude knows they’re retrieved history rather than live information:
--- MEMORY CONTEXT ---
[2024-07-10] DECISION: Authentication uses JWT with 15-min access tokens and 7-day refresh tokens in httpOnly cookies. (Source: auth/middleware.ts)
[2024-07-11] PATTERN: All API errors follow the format {error: string, code: string, details?: object}. (Source: types/errors.ts)
--- END MEMORY CONTEXT ---
This framing helps Claude treat these as reliable background knowledge rather than conversation content.
Building Semantic Recall with MemSearch
Hermes handles structured storage and injection. MemSearch handles the other half: finding memories by meaning when you explicitly ask for them.
How Semantic Search Works Here
Every time a memory is stored by Hermes, MemSearch generates a vector embedding of the content using an embedding model. That embedding is stored alongside the memory entry in a vector database.
When you ask a recall question — “how did we handle pagination?” — MemSearch:
- Embeds the query using the same embedding model
- Performs a similarity search in the vector database
- Returns the top-N most semantically similar memories
- Includes source citations for each result
The result isn’t a keyword match. It finds memories that are about pagination even if they use different words — “cursor-based navigation,” “offset/limit patterns,” “scroll handling.” This is what makes semantic recall genuinely useful.
Choosing an Embedding Model
For a local or low-latency setup, models like text-embedding-3-small (OpenAI) or nomic-embed-text (open source, runs locally) work well. The key requirements are:
- Consistent model use across storage and retrieval — if you embed with one model, you must query with the same one
- Reasonable embedding dimensions (768–1536 works well for most use cases)
- Fast inference — memory injection shouldn’t add more than 200-300ms to session startup
Setting Up the Vector Database
Popular options for the vector store include:
- Chroma — open source, runs locally, easy to set up
- Qdrant — open source, production-ready, good filtering support
- Pinecone — managed service, minimal ops overhead
- pgvector — if you’re already using PostgreSQL, this avoids adding a new system
For a Claude Code memory system, Chroma or Qdrant running locally is usually the right call. You get low latency, full control, and no data sent to external services.
Source Citations in Recall Results
One of the practical requirements for a useful memory system is knowing where a memory came from. When Claude tells you “we decided to use Redis for session storage,” you want to be able to verify that and trace it back to the original context.
MemSearch handles this by returning the source metadata alongside each result. A recall query returns something like:
Query: "session storage approach"
Result 1 (score: 0.94):
"Redis is used for session storage with 24-hour TTL. Sessions keyed by userId."
Source: src/session/store.ts | Session: 2024-07-08 | Type: decision
Result 2 (score: 0.81):
"Session invalidation happens on logout and password change via Redis DEL."
Source: src/auth/logout.ts | Session: 2024-07-09 | Type: code pattern
This turns recall from a black box into a traceable, auditable process.
Connecting MemSearch and Hermes: The Full Flow
The two systems work together through a simple coordination layer. Here’s the full lifecycle:
Write Path (Storing a Memory)
- Claude Code produces output during a session
- The capture layer (automatic or manual) identifies content worth saving
- Hermes stores the memory entry with full metadata
- MemSearch generates an embedding and stores it in the vector database
- Both systems now have a reference to the same memory — Hermes for structured retrieval, MemSearch for semantic search
Read Path (Injecting Context at Session Start)
- New session begins with a project identifier
- Hermes queries structured storage by project, filters by importance and recency, respects token budget
- Selected memories are formatted and injected into Claude’s opening context
- Session proceeds with relevant history already available
Read Path (Explicit Recall During a Session)
- User asks a recall question (“how did we handle X before?”)
- MemSearch receives the query, generates an embedding, searches the vector store
- Top results returned with source citations
- Results injected into the next Claude turn as retrieved context
The two read paths can run simultaneously — automatic injection for session startup, semantic recall for on-demand queries.
Implementation: Putting It Together with Claude Code
Here’s a practical implementation approach using the MindStudio Agent Skills Plugin, which handles the infrastructure layer so you can focus on the memory logic itself.
Prerequisites
- Claude Code installed and configured
- Node.js 18+ for the coordination layer
- A vector database (Chroma recommended for local setup)
- The
@mindstudio-ai/agentnpm package
Step 1: Install the Agent Skills Plugin
npm install @mindstudio-ai/agent
This gives your coordination layer access to MindStudio’s typed capabilities, including search, storage, and workflow execution — without managing API keys or rate limiting yourself.
Step 2: Build the Memory Coordinator
The coordinator is the bridge between Claude Code sessions and your MemSearch/Hermes systems. A minimal version:
import { agent } from '@mindstudio-ai/agent';
async function storeMemory(content, metadata) {
// Store in Hermes (structured)
await hermesStore.insert({ content, ...metadata });
// Store in MemSearch (semantic)
const embedding = await generateEmbedding(content);
await vectorStore.upsert({ id: metadata.id, embedding, payload: metadata });
}
async function recallByMeaning(query, projectId) {
const embedding = await generateEmbedding(query);
const results = await vectorStore.search({ embedding, filter: { project: projectId }, limit: 5 });
return results.map(r => ({ ...r.payload, score: r.score }));
}
async function buildSessionContext(projectId, tokenBudget) {
const memories = await hermesStore.query({ project: projectId, minImportance: 0.7 });
return selectWithinBudget(memories, tokenBudget);
}
Step 3: Hook Into Claude Code Sessions
Claude Code supports custom system prompts and pre-session hooks. Use these to call buildSessionContext before each session and inject the formatted memory block.
For explicit recall, you can either:
- Add a
/recall [query]command that callsrecallByMeaningand returns formatted results - Configure a background watcher that monitors conversation turns and triggers recall automatically when certain patterns appear
Step 4: Set Retention and Pruning Rules
Memory systems get noisy over time. Define pruning rules upfront:
- TTL-based: memories older than 90 days drop to low importance unless explicitly pinned
- Deduplication: when very similar memories are stored, keep the newer one and update the embedding
- Importance decay: memories that are never retrieved lose importance score over time
- Project archival: when a project is marked inactive, its memories move to cold storage
This keeps the system useful as it scales up.
How MindStudio Fits Into This Architecture
If you’re building this kind of memory system for Claude Code, the biggest operational headache isn’t the logic — it’s the infrastructure: managing rate limits across multiple APIs, handling retries when vector store operations fail, wiring together embeddings, storage, and retrieval without everything breaking when one piece changes.
The MindStudio Agent Skills Plugin addresses exactly this. The @mindstudio-ai/agent SDK gives Claude Code and any other agent runtime access to 120+ typed capabilities as simple method calls, with the infrastructure layer already handled.
For a memory system specifically, this means:
- Calling
agent.searchGoogle()to pull in external context worth storing - Using
agent.runWorkflow()to trigger memory consolidation or summarization pipelines - Handling retries and rate limiting automatically, so your memory coordinator doesn’t need defensive code for every API call
MindStudio’s no-code builder also lets you build the memory management UI — a dashboard for browsing stored memories, adjusting importance scores, or manually pinning critical context — without writing frontend code. Teams that work with Claude Code often want visibility into what’s in memory; MindStudio makes that dashboard a 30-minute build, not a side project.
You can try it free at mindstudio.ai.
Common Mistakes and How to Avoid Them
Injecting Too Much Context
The most common failure mode is greed: storing lots of memory and injecting most of it every session. This crowds out the actual task content, slows session startup, and often degrades Claude’s performance on the immediate work.
Keep injection selective. A 20% token budget for memory context is a reasonable ceiling. Within that, prioritize importance score over recency.
Mismatched Embedding Models
If you embed at write time with one model and embed queries with another, similarity scores become meaningless. Lock your embedding model and treat any change as a migration that requires re-embedding your entire memory store.
No Source Tracking
Memory without provenance is hard to trust. If Claude says “we decided X,” and you can’t verify where that came from, you’re flying blind. Build source citations in from day one — retrofitting them is painful.
Forgetting to Test Recall Quality
It’s easy to build a system that stores things correctly but retrieves the wrong ones. After initial setup, run a set of recall test queries against your real stored memories. If the top results aren’t what you’d expect, tune your embedding model, similarity threshold, or metadata filters before relying on the system for real work.
Frequently Asked Questions
What is a hybrid AI memory system?
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
A hybrid AI memory system combines two complementary approaches: structured storage with metadata filtering (for precise, rule-based retrieval) and semantic vector search (for meaning-based recall). Neither approach alone handles all retrieval needs well. Structured storage can’t find conceptually similar content; vector search loses chronology and structure. Combining them covers the full range of memory access patterns an AI coding agent needs.
How does semantic recall differ from keyword search?
Keyword search finds exact or near-exact text matches. Semantic recall finds content that is about the same thing, even if different words are used. If you stored a memory about “token-based authentication” and query for “how does login work,” semantic search returns the relevant result; keyword search likely misses it. This matters a lot in coding contexts, where the same concept gets described multiple ways across different files and sessions.
Does this work with Claude Code specifically, or any AI coding tool?
The architecture works with any AI coding assistant that accepts a configurable system prompt or pre-session context injection. Claude Code is a good fit because it’s designed for longer-horizon agentic tasks where persistent memory provides the most value. The same pattern applies to multi-agent workflows where multiple agents share a memory pool.
How many memories can the system handle before performance degrades?
Vector search scales well — modern vector databases handle millions of entries with sub-100ms query times. The bottleneck is usually the injection layer: how many tokens of memory you can include in context without hurting Claude’s performance on the actual task. A well-tuned system with 100,000+ memory entries can still inject only the most relevant 20-30 entries per session, keeping context tight.
How do I handle sensitive information in stored memories?
Don’t store credentials, API keys, or personally identifiable information in the memory system. Use environment variables for secrets as you normally would, and configure your capture layer to redact or skip content that matches sensitive patterns before storage. For team environments, also consider access controls on the vector database — who can read or write memories should match your existing permissions model.
Can multiple developers share the same memory system?
Yes, with some caveats. Shared memory works well for project-level knowledge: architectural decisions, coding patterns, known bugs, established conventions. Personal preferences and individual workflow patterns should stay in user-scoped memory. Tag memories with both a project identifier and a user identifier, then query both namespaces at session start — project memories injected for everyone, user memories injected only for the relevant user.
Key Takeaways
- Claude Code’s context window limitation is a real constraint for ongoing development work — a persistent memory system directly addresses it
- Hermes handles structured storage, metadata management, and smart injection based on importance, recency, and token budget
- MemSearch handles semantic recall using vector embeddings, returning results with source citations for traceability
- The hybrid approach covers both structured filtering and meaning-based retrieval — neither alone is sufficient
- Source citations and importance scoring are non-negotiable from day one; retrofitting them is significantly harder
- MindStudio’s Agent Skills Plugin handles the infrastructure layer, letting you focus on memory logic rather than API plumbing
If you’re building with Claude Code and want persistent, intelligent memory without rebuilding the infrastructure from scratch, MindStudio is worth exploring — especially for teams that also need a management interface or want to connect memory workflows to the rest of their tooling.

