How to Build a Hybrid AI Memory System for Claude Code: Storage, Injection, and Recall

Why Claude Code Forgets Everything (And How to Fix It)

Every developer who works with Claude Code runs into the same wall eventually. You’re mid-session on a complex project. Claude has context about your architecture, your naming conventions, your past decisions. Then the session ends — and the next time you open it, none of that exists. You’re starting from scratch.

This isn’t a flaw in Claude Code specifically. It’s a fundamental constraint of how large language models work: they have a fixed context window, and nothing outside that window is accessible. For short tasks, this is fine. For ongoing development work — the kind where history, preferences, and accumulated knowledge actually matter — it’s a serious limitation.

A hybrid AI memory system solves this. By combining a semantic recall layer (MemSearch) with a structured storage and injection layer (Hermes), you can give Claude Code something close to persistent, meaningful memory: storing everything that happens, injecting relevant context at the right moments, and retrieving past information by meaning rather than by exact keyword match.

This guide walks through how to build that system from scratch.

What a Hybrid Memory System Actually Does

Before getting into implementation, it’s worth being precise about what “memory” means in this context — because there are several distinct problems to solve.

The Three Memory Problems

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Storage: By default, Claude Code doesn’t write anything to persistent storage. Once a session closes, everything is gone. You need a system that captures important information — decisions made, patterns identified, code explained — and saves it somewhere durable.

Injection: Having stored memory doesn’t help if Claude never sees it. You need a mechanism that automatically retrieves relevant memory and injects it into Claude’s context at the start of a session or at key moments during a task. Injecting everything would overflow the context window, so the injection layer needs to be selective.

Recall: When you want to ask “what did we decide about authentication?” or “show me how we handled rate limiting before,” you need semantic search — retrieval that works on meaning, not just text matching. This is the hardest part to get right.

A hybrid system addresses all three separately and lets them work together.

Why “Hybrid”?

Pure vector search (embedding-based retrieval) is great at finding semantically similar content but loses structure and chronology. Pure key-value or relational storage is great at structure but can’t retrieve by meaning. A hybrid system uses both:

Hermes handles structured storage and smart injection — maintaining metadata, session context, source tracking, and deciding what gets injected when
MemSearch handles the semantic layer — embedding content into vectors, enabling meaning-based retrieval, and surfacing results with source citations

Together, they cover the full memory lifecycle.

Setting Up the Storage Layer with Hermes

Hermes acts as the memory orchestration layer. Its job is to receive information from Claude Code sessions, store it with proper structure, and manage what gets injected into future sessions.

What Hermes Stores

Each memory entry should capture more than just the raw content. Useful metadata includes:

Session ID — which conversation or work session this came from
Timestamp — when the memory was created
Memory type — decision, code snippet, explanation, user preference, error resolution
Tags — project name, file names, technologies involved
Source — what file, conversation turn, or command produced this memory
Confidence — how important or reliable this memory is

Without this structure, you end up with a flat pile of text blobs that’s hard to manage or filter.

Structuring Memory Entries

A practical schema for a memory entry looks something like this:

{
  "id": "mem_20240712_001",
  "session_id": "sess_abc123",
  "timestamp": "2024-07-12T14:23:00Z",
  "type": "decision",
  "content": "We're using JWT tokens with 15-minute expiry for access and 7-day refresh tokens stored in httpOnly cookies.",
  "tags": ["auth", "security", "tokens"],
  "source": "src/auth/middleware.ts",
  "project": "ecommerce-api",
  "importance": 0.9
}

The importance score is useful for filtering during injection — high-importance memories (architectural decisions, security choices, key patterns) get injected more aggressively than low-importance ones (minor style notes, one-off fixes).

Capturing Memories During Sessions

There are two approaches to capture: automatic and manual.

Automatic capture hooks into Claude Code’s output stream and uses a secondary model to identify what’s worth saving. After each significant response, a lightweight classification step decides whether to store the content and how to categorize it.

Manual capture gives you explicit control. A simple command — something like /remember [content] — triggers immediate storage. This is more reliable but requires discipline.

In practice, both work best together. Automatic capture catches things you’d forget to save; manual capture lets you flag the decisions that really matter.

The Injection Strategy

When a new session starts, Hermes queries stored memories and selects a subset to inject into Claude’s system prompt or first user message. The selection logic should consider:

Project match — only inject memories tagged to the current project
Recency — newer memories are generally more relevant
Importance — high-importance entries always make the cut
Token budget — never inject more than a set percentage of the available context window (a good default is 20-30%)

The injected memories appear as structured context, clearly labeled so Claude knows they’re retrieved history rather than live information:

--- MEMORY CONTEXT ---
[2024-07-10] DECISION: Authentication uses JWT with 15-min access tokens and 7-day refresh tokens in httpOnly cookies. (Source: auth/middleware.ts)
[2024-07-11] PATTERN: All API errors follow the format {error: string, code: string, details?: object}. (Source: types/errors.ts)
--- END MEMORY CONTEXT ---

This framing helps Claude treat these as reliable background knowledge rather than conversation content.

Building Semantic Recall with MemSearch

Hermes handles structured storage and injection. MemSearch handles the other half: finding memories by meaning when you explicitly ask for them.

How Semantic Search Works Here

Every time a memory is stored by Hermes, MemSearch generates a vector embedding of the content using an embedding model. That embedding is stored alongside the memory entry in a vector database.

When you ask a recall question — “how did we handle pagination?” — MemSearch:

Embeds the query using the same embedding model
Performs a similarity search in the vector database
Returns the top-N most semantically similar memories
Includes source citations for each result

The result isn’t a keyword match. It finds memories that are about pagination even if they use different words — “cursor-based navigation,” “offset/limit patterns,” “scroll handling.” This is what makes semantic recall genuinely useful.

Choosing an Embedding Model

For a local or low-latency setup, models like text-embedding-3-small (OpenAI) or nomic-embed-text (open source, runs locally) work well. The key requirements are:

Consistent model use across storage and retrieval — if you embed with one model, you must query with the same one
Reasonable embedding dimensions (768–1536 works well for most use cases)
Fast inference — memory injection shouldn’t add more than 200-300ms to session startup

Setting Up the Vector Database

Popular options for the vector store include:

Chroma — open source, runs locally, easy to set up
Qdrant — open source, production-ready, good filtering support
Pinecone — managed service, minimal ops overhead
pgvector — if you’re already using PostgreSQL, this avoids adding a new system

For a Claude Code memory system, Chroma or Qdrant running locally is usually the right call. You get low latency, full control, and no data sent to external services.

Source Citations in Recall Results

One of the practical requirements for a useful memory system is knowing where a memory came from. When Claude tells you “we decided to use Redis for session storage,” you want to be able to verify that and trace it back to the original context.

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

MemSearch handles this by returning the source metadata alongside each result. A recall query returns something like:

Query: "session storage approach"

Result 1 (score: 0.94):
"Redis is used for session storage with 24-hour TTL. Sessions keyed by userId."
Source: src/session/store.ts | Session: 2024-07-08 | Type: decision

Result 2 (score: 0.81):
"Session invalidation happens on logout and password change via Redis DEL."
Source: src/auth/logout.ts | Session: 2024-07-09 | Type: code pattern

This turns recall from a black box into a traceable, auditable process.

Connecting MemSearch and Hermes: The Full Flow

The two systems work together through a simple coordination layer. Here’s the full lifecycle:

Write Path (Storing a Memory)

Claude Code produces output during a session
The capture layer (automatic or manual) identifies content worth saving
Hermes stores the memory entry with full metadata
MemSearch generates an embedding and stores it in the vector database
Both systems now have a reference to the same memory — Hermes for structured retrieval, MemSearch for semantic search

Read Path (Injecting Context at Session Start)

New session begins with a project identifier
Hermes queries structured storage by project, filters by importance and recency, respects token budget
Selected memories are formatted and injected into Claude’s opening context
Session proceeds with relevant history already available

Read Path (Explicit Recall During a Session)

User asks a recall question (“how did we handle X before?”)
MemSearch receives the query, generates an embedding, searches the vector store
Top results returned with source citations
Results injected into the next Claude turn as retrieved context

The two read paths can run simultaneously — automatic injection for session startup, semantic recall for on-demand queries.

Implementation: Putting It Together with Claude Code

Here’s a practical implementation approach using the MindStudio Agent Skills Plugin, which handles the infrastructure layer so you can focus on the memory logic itself.

Prerequisites

Claude Code installed and configured
Node.js 18+ for the coordination layer
A vector database (Chroma recommended for local setup)
The @mindstudio-ai/agent npm package

Step 1: Install the Agent Skills Plugin

npm install @mindstudio-ai/agent

This gives your coordination layer access to MindStudio’s typed capabilities, including search, storage, and workflow execution — without managing API keys or rate limiting yourself.

Step 2: Build the Memory Coordinator

The coordinator is the bridge between Claude Code sessions and your MemSearch/Hermes systems. A minimal version:

import { agent } from '@mindstudio-ai/agent';

async function storeMemory(content, metadata) {
  // Store in Hermes (structured)
  await hermesStore.insert({ content, ...metadata });
  
  // Store in MemSearch (semantic)
  const embedding = await generateEmbedding(content);
  await vectorStore.upsert({ id: metadata.id, embedding, payload: metadata });
}

async function recallByMeaning(query, projectId) {
  const embedding = await generateEmbedding(query);
  const results = await vectorStore.search({ embedding, filter: { project: projectId }, limit: 5 });
  return results.map(r => ({ ...r.payload, score: r.score }));
}

async function buildSessionContext(projectId, tokenBudget) {
  const memories = await hermesStore.query({ project: projectId, minImportance: 0.7 });
  return selectWithinBudget(memories, tokenBudget);
}

Step 3: Hook Into Claude Code Sessions

Claude Code supports custom system prompts and pre-session hooks. Use these to call buildSessionContext before each session and inject the formatted memory block.

For explicit recall, you can either:

Add a /recall [query] command that calls recallByMeaning and returns formatted results
Configure a background watcher that monitors conversation turns and triggers recall automatically when certain patterns appear

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

Step 4: Set Retention and Pruning Rules

Memory systems get noisy over time. Define pruning rules upfront:

TTL-based: memories older than 90 days drop to low importance unless explicitly pinned
Deduplication: when very similar memories are stored, keep the newer one and update the embedding
Importance decay: memories that are never retrieved lose importance score over time
Project archival: when a project is marked inactive, its memories move to cold storage

This keeps the system useful as it scales up.

How MindStudio Fits Into This Architecture

If you’re building this kind of memory system for Claude Code, the biggest operational headache isn’t the logic — it’s the infrastructure: managing rate limits across multiple APIs, handling retries when vector store operations fail, wiring together embeddings, storage, and retrieval without everything breaking when one piece changes.

The MindStudio Agent Skills Plugin addresses exactly this. The @mindstudio-ai/agent SDK gives Claude Code and any other agent runtime access to 120+ typed capabilities as simple method calls, with the infrastructure layer already handled.

For a memory system specifically, this means:

Calling agent.searchGoogle() to pull in external context worth storing
Using agent.runWorkflow() to trigger memory consolidation or summarization pipelines
Handling retries and rate limiting automatically, so your memory coordinator doesn’t need defensive code for every API call

MindStudio’s no-code builder also lets you build the memory management UI — a dashboard for browsing stored memories, adjusting importance scores, or manually pinning critical context — without writing frontend code. Teams that work with Claude Code often want visibility into what’s in memory; MindStudio makes that dashboard a 30-minute build, not a side project.

You can try it free at mindstudio.ai.

Common Mistakes and How to Avoid Them

Injecting Too Much Context

The most common failure mode is greed: storing lots of memory and injecting most of it every session. This crowds out the actual task content, slows session startup, and often degrades Claude’s performance on the immediate work.

Keep injection selective. A 20% token budget for memory context is a reasonable ceiling. Within that, prioritize importance score over recency.

Mismatched Embedding Models

If you embed at write time with one model and embed queries with another, similarity scores become meaningless. Lock your embedding model and treat any change as a migration that requires re-embedding your entire memory store.

No Source Tracking

Memory without provenance is hard to trust. If Claude says “we decided X,” and you can’t verify where that came from, you’re flying blind. Build source citations in from day one — retrofitting them is painful.

Forgetting to Test Recall Quality

It’s easy to build a system that stores things correctly but retrieves the wrong ones. After initial setup, run a set of recall test queries against your real stored memories. If the top results aren’t what you’d expect, tune your embedding model, similarity threshold, or metadata filters before relying on the system for real work.

Frequently Asked Questions

What is a hybrid AI memory system?

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

A hybrid AI memory system combines two complementary approaches: structured storage with metadata filtering (for precise, rule-based retrieval) and semantic vector search (for meaning-based recall). Neither approach alone handles all retrieval needs well. Structured storage can’t find conceptually similar content; vector search loses chronology and structure. Combining them covers the full range of memory access patterns an AI coding agent needs.

How does semantic recall differ from keyword search?

Keyword search finds exact or near-exact text matches. Semantic recall finds content that is about the same thing, even if different words are used. If you stored a memory about “token-based authentication” and query for “how does login work,” semantic search returns the relevant result; keyword search likely misses it. This matters a lot in coding contexts, where the same concept gets described multiple ways across different files and sessions.

Does this work with Claude Code specifically, or any AI coding tool?

The architecture works with any AI coding assistant that accepts a configurable system prompt or pre-session context injection. Claude Code is a good fit because it’s designed for longer-horizon agentic tasks where persistent memory provides the most value. The same pattern applies to multi-agent workflows where multiple agents share a memory pool.

How many memories can the system handle before performance degrades?

Vector search scales well — modern vector databases handle millions of entries with sub-100ms query times. The bottleneck is usually the injection layer: how many tokens of memory you can include in context without hurting Claude’s performance on the actual task. A well-tuned system with 100,000+ memory entries can still inject only the most relevant 20-30 entries per session, keeping context tight.

How do I handle sensitive information in stored memories?

Don’t store credentials, API keys, or personally identifiable information in the memory system. Use environment variables for secrets as you normally would, and configure your capture layer to redact or skip content that matches sensitive patterns before storage. For team environments, also consider access controls on the vector database — who can read or write memories should match your existing permissions model.

Yes, with some caveats. Shared memory works well for project-level knowledge: architectural decisions, coding patterns, known bugs, established conventions. Personal preferences and individual workflow patterns should stay in user-scoped memory. Tag memories with both a project identifier and a user identifier, then query both namespaces at session start — project memories injected for everyone, user memories injected only for the relevant user.

Key Takeaways

Claude Code’s context window limitation is a real constraint for ongoing development work — a persistent memory system directly addresses it
Hermes handles structured storage, metadata management, and smart injection based on importance, recency, and token budget
MemSearch handles semantic recall using vector embeddings, returning results with source citations for traceability
The hybrid approach covers both structured filtering and meaning-based retrieval — neither alone is sufficient
Source citations and importance scoring are non-negotiable from day one; retrofitting them is significantly harder
MindStudio’s Agent Skills Plugin handles the infrastructure layer, letting you focus on memory logic rather than API plumbing

If you’re building with Claude Code and want persistent, intelligent memory without rebuilding the infrastructure from scratch, MindStudio is worth exploring — especially for teams that also need a management interface or want to connect memory workflows to the rest of their tooling.