How to Build an AI Memory System for Claude Code: Storage, Injection, and Recall

Why Claude Code Keeps Forgetting Everything

Claude Code is a capable coding agent. But every time you start a new session, it starts from zero. It doesn’t know that you renamed a core module last week, that you decided against PostgreSQL in favor of SQLite for this project, or that a particular API endpoint has a known race condition you worked around three months ago.

This isn’t a bug — it’s a fundamental property of how large language models work. The Claude Code memory system as it ships out of the box is context-window-scoped. Once the window closes, the knowledge goes with it.

For personal scripts or short tasks, this barely matters. For production codebases worked on across weeks or months, it’s a serious problem. Agents repeat mistakes, ask redundant questions, and make decisions that contradict earlier choices — because they literally can’t remember making those choices.

This guide walks through how to build a proper memory system for Claude Code: one that stores the right information, injects it cleanly into active sessions, and retrieves relevant context without bloating your prompts. We’ll cover two core architectural patterns — Memarch for storage structure, and Hermes for injection and recall — and walk through an implementation you can adapt to your own workflow.

What Claude Code Actually Remembers (And What It Doesn’t)

Before building anything, it helps to understand exactly where memory breaks down.

The Four Memory Gaps

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

1. Cross-session amnesia Each Claude Code session starts fresh. There’s no automatic persistence of decisions, file structures, or context from prior sessions unless you manually supply it.

2. Accumulated context decay Even within a long session, Claude Code operates on a sliding context window. As the conversation grows, early decisions and context get pushed out. The agent may effectively forget something it knew an hour ago.

3. No self-updating knowledge If you fix a bug, rename a function, or change an architectural pattern, Claude Code won’t automatically update its internal model of your codebase. It works from what’s currently in context — not from a living understanding of your project.

4. File-level vs. project-level understanding Claude Code reads files when you point it at them. But it doesn’t maintain a persistent model of how files relate to each other, what patterns are intentional vs. accidental, or what constraints apply project-wide.

What CLAUDE.md Gives You

Claude Code does support one native memory mechanism: the CLAUDE.md file. This is a markdown file in your project root (or home directory for global preferences) that Claude Code automatically reads at session start.

It’s useful. But it has real limits:

You have to maintain it manually
It’s static — it can’t update dynamically based on session learnings
It doesn’t support retrieval or ranking — everything in it gets injected, relevant or not
Large CLAUDE.md files eat context budget fast

For small projects with stable patterns, CLAUDE.md is enough. For anything bigger, you need something more structured.

The Three Tiers of Memory

A well-designed Claude Code memory system works across three tiers. Each has different storage characteristics, injection timing, and retrieval logic.

Tier 1: In-Context Memory

This is what’s actively in Claude Code’s context window right now. It’s the fastest to access, requires no retrieval step, but is also the most limited and the most expensive (in tokens).

In-context memory is where you put:

The immediate task description
Relevant code snippets
The last 2–3 significant decisions made this session
Any errors encountered in the current task

You don’t need to build infrastructure for this tier — it’s just the active conversation. What you do need is discipline about what else gets injected here, because every token you spend on stale context is a token you can’t use for the actual problem.

Tier 2: Session Memory

Session memory persists across a single working session — meaning you can retrieve it within the same Claude Code session, but it’s cleared when the session ends (unless you write it somewhere durable).

This is where you store:

Files modified this session and what changed
Decisions made mid-session that should inform later steps
Intermediate outputs (test results, error logs, function signatures)

Session memory typically lives in a lightweight local store: a SQLite file, a JSON file in /tmp, or an in-memory key-value store your agent can write to and read from during its run.

Tier 3: Persistent Memory

This is the long-term layer. Persistent memory survives across sessions, across days, and across agents. It’s the institutional knowledge of your project.

Persistent memory stores:

Architectural decisions and the reasoning behind them
Known bugs, workarounds, and their status
Naming conventions, patterns, and style rules
Entity definitions (key classes, modules, APIs and what they do)
Project constraints (things you’ve explicitly decided not to do)

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

Persistent memory requires a proper storage backend and a retrieval strategy — which is where Memarch and Hermes come in.

The Memarch Pattern: Structuring What You Store

Memarch stands for Memory Architecture — a structured approach to classifying what you store before you store it. Without this, long-term memory becomes a disorganized pile of facts that’s expensive to search and easy to misapply.

Memarch organizes memories into four types:

Entity Memories

These describe the key objects in your codebase: modules, classes, APIs, services, databases, and their relationships.

A well-formed entity memory looks like this:

Entity: UserAuthService
Type: Class (Python)
Location: src/auth/service.py
Purpose: Handles JWT issuance, validation, and refresh logic
Dependencies: UserRepository, RedisSessionStore
Notes: Intentionally stateless — session data lives in Redis, not instance
Last updated: 2025-06-10

Entity memories answer the question: “What is this thing and how does it fit in?”

Decision Memories

These capture the why behind choices — the most valuable and most commonly missing type of project knowledge.

Decision: Use SQLite for local dev, PostgreSQL in production
Date: 2025-05-20
Rationale: Dev/prod parity isn't needed for this service; SQLite removes a 
Docker dependency and speeds up local iteration. Migration handled by Alembic.
Do not reverse: Saves ~15 min of setup per new dev machine.

Decision memories prevent future Claude Code sessions (or future you) from relitigating settled questions.

Error Memories

These log known failure modes, gotchas, and their fixes.

Error: Race condition in OrderProcessor.finalize()
Symptom: Duplicate order records created under concurrent requests
Root cause: Missing transaction lock in line 47-52
Fix: Added SELECT FOR UPDATE before record creation (commit a3f81)
Status: Resolved - do not remove the lock

Error memories are especially valuable for agents because they prevent the same debugging loop from happening twice.

Context Memories

These capture current project state — what phase you’re in, what’s in progress, what’s blocked.

Context: Auth refactor in progress (branch: feat/auth-v2)
Current task: Migrating token validation to new middleware
Blocked: Waiting on security review for PKCE implementation
Next step: After review, remove legacy /api/v1/auth routes

Context memories help Claude Code pick up where you left off without a long re-briefing.

Storage Backends: Where Memories Live

Choosing a storage backend depends on your project’s scale and how sophisticated your retrieval needs to be.

Flat File Storage

The simplest option. Memories are stored as markdown or JSON files in a /memory directory inside your project.

Pros: No dependencies, human-readable, easy to version control. Cons: No semantic search, retrieval is limited to file name or grep.

Good for small projects or when you want every memory to always be injected (small enough to fit in context).

SQLite with Full-Text Search

A local SQLite database with a full-text search index is a significant step up from flat files. You can query memories by type, by keyword, and by recency.

CREATE TABLE memories (
  id TEXT PRIMARY KEY,
  type TEXT, -- entity, decision, error, context
  content TEXT,
  embedding BLOB, -- optional, for vector search
  created_at DATETIME,
  updated_at DATETIME,
  tags TEXT
);
CREATE VIRTUAL TABLE memories_fts USING fts5(content, tags);

This setup works well for medium-sized projects and doesn’t require external services.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Vector Database (Chroma, Qdrant, or Pinecone)

For large projects with thousands of stored memories, keyword search isn’t enough. Semantic search — finding memories that are conceptually relevant, not just keyword-matched — requires vector embeddings.

The flow looks like this:

When a memory is created, generate an embedding (using OpenAI’s embedding API or a local model)
Store the embedding alongside the text in a vector database
At retrieval time, embed the current query and find nearest neighbors

This is more infrastructure to manage, but the retrieval quality improvement is significant for complex projects.

The Hermes Pattern: Injection and Recall

Hermes is the pattern for the other half of the problem: how do stored memories get back into Claude Code’s context at the right time?

Named for the role of a messenger — delivering the right information to the right place — the Hermes pattern has three components: query, rank, and inject.

Step 1: Query

Before each Claude Code session (or task), a retrieval query is constructed from:

The current task description
The files being worked on
Any explicit context the user provides

This query is used to search the memory store. For flat file or SQLite systems, this is a keyword search. For vector systems, this is a semantic similarity search.

def build_query(task: str, files: list[str]) -> str:
    file_names = ", ".join([f.split("/")[-1] for f in files])
    return f"{task} | files: {file_names}"

results = memory_store.search(
    query=build_query(current_task, open_files),
    types=["entity", "decision", "error", "context"],
    limit=20
)

Step 2: Rank

Not all retrieved memories are equally relevant. The ranking step scores and filters the raw search results before injection.

Ranking factors:

Recency — more recent memories are usually more relevant
Type priority — context memories rank highest for session start; error memories rank highest during debugging
Semantic score — the similarity score from vector search (if applicable)
Tag overlap — memories tagged with the current file or module rank higher

After ranking, you select the top N memories that fit within your available context budget. A reasonable rule: allocate no more than 15–20% of your context window to injected memories.

Step 3: Inject

Injection is how memories enter Claude Code’s context. There are two main approaches:

Prepend injection — memories are prepended to the system prompt or first user message before the session starts. This is simple and reliable.

def build_system_prompt(memories: list[Memory]) -> str:
    memory_block = "\n\n".join([
        f"[{m.type.upper()}] {m.content}" 
        for m in memories
    ])
    return f"""You are working on the {PROJECT_NAME} codebase.

## Relevant Context
{memory_block}

## Current Task
...
"""

Dynamic injection — memories are injected mid-session in response to detected triggers (e.g., when Claude Code starts discussing a topic that has relevant stored memories). This is more complex but preserves context budget for sessions that evolve in unexpected directions.

For most projects, prepend injection is sufficient and much simpler to implement.

Writing Back: Closing the Loop

Hermes isn’t just about reading memories — it’s also about writing new ones. After a session ends (or periodically during a long session), new memories should be extracted and stored.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

This can be automated with a post-session script that:

Reviews the session transcript
Extracts candidate memories (decisions made, errors encountered, entities introduced)
Checks for duplicates or conflicts with existing memories
Writes confirmed memories to the store

def extract_memories(transcript: str) -> list[Memory]:
    # Use a lightweight Claude call to extract structured memories
    # from the session transcript
    extraction_prompt = """
    Review this coding session transcript and extract:
    - New decisions made (with reasoning)
    - Errors encountered and their fixes
    - New entities introduced (classes, modules, APIs)
    - Any updates to current context/status
    
    Return as structured JSON.
    """
    return call_claude(extraction_prompt, transcript)

Building the Full System: A Step-by-Step Implementation

Here’s a practical path from zero to a working memory system.

Step 1: Set Up Your Memory Store

Start with SQLite. Create a memory/ directory in your project root and initialize a database.

mkdir memory
python scripts/init_memory.py

Your init_memory.py creates the schema shown earlier and creates a memory/store.db file. This file should be committed to version control (it’s project knowledge), but add the WAL files to .gitignore.

Step 2: Seed Initial Memories

Manually create entity memories for your core modules, and decision memories for any choices you’ve already made. This is a one-time cost that pays off immediately — future sessions start knowing your project’s fundamentals.

Write these as simple JSON files, then run an import script to load them into the database.

Step 3: Create a Session Launcher Script

Replace direct claude invocations with a wrapper script that:

Accepts a task description as input
Queries the memory store for relevant memories
Builds an augmented system prompt with injected memories
Launches Claude Code with that system prompt

#!/bin/bash
# start-session.sh
TASK="$1"
python scripts/recall.py "$TASK" > /tmp/memory_context.md
claude --system-prompt-file /tmp/memory_context.md

Step 4: Add Post-Session Memory Extraction

After each significant session, run the extraction script on the session log. Claude Code can export session transcripts — pipe these through your extractor to capture new memories automatically.

Make this a habit. Five minutes of memory maintenance after a session saves significant re-orientation time in future sessions.

Step 5: Iterate on Your Ranking Logic

After a week or two of use, you’ll notice which memories are getting injected but aren’t useful, and which ones you wish were surfaced. Tune your ranking weights accordingly.

The most common adjustments:

Boost entity memories when working files are passed as context
Boost error memories when the task description contains words like “fix,” “debug,” or “broken”
Deprioritize memories older than 30 days unless they’re architectural decisions

How MindStudio Fits Into This Architecture

Building and maintaining memory extraction scripts, post-session pipelines, and recall logic is solid engineering work — but it’s also infrastructure that eats time. If you’re running Claude Code across multiple projects or sharing memory systems across a team, MindStudio offers a faster path.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

MindStudio’s Agent Skills Plugin is an npm SDK (@mindstudio-ai/agent) that lets any agent — including Claude Code — call pre-built capabilities as simple method calls. Instead of hand-rolling memory storage, extraction, and retrieval, you can wire these up as MindStudio workflows and call them from your session scripts.

A memory workflow in MindStudio might:

Accept a session transcript as input
Run an extraction step with a Claude model to identify new memories
Deduplicate against existing records in Airtable or Notion
Write confirmed memories back to your store

The same workflow can expose a retrieval endpoint your session launcher calls before each Claude Code session. You get the full Memarch + Hermes architecture without managing separate embedding services, database schemas, or extraction prompts from scratch.

For teams, MindStudio’s integrations with Notion, Airtable, and Google Workspace mean your memory store can live somewhere the whole team can inspect and edit — not in a SQLite file only one person can access.

You can try MindStudio free at mindstudio.ai.

Common Mistakes to Avoid

Injecting Too Much

The most common failure mode is injecting every retrieved memory without regard for context budget. If 30% of your context window is filled with memory before the task description arrives, you’ve undermined the system you built.

Keep injected memories to 10–15% of your context budget. Rank ruthlessly. More context isn’t always better — relevant context is.

Storing Outputs Instead of Decisions

Don’t store “Claude Code generated a login function” — store “Decided to use bcrypt with cost factor 12 for password hashing, based on OWASP recommendations.” The output is in your codebase. The reasoning is what’s hard to reconstruct later.

Neglecting Memory Maintenance

Stale memories are worse than no memories. A memory that says “Currently refactoring auth module” from three months ago actively misleads future sessions. Add a review step to your monthly project maintenance — archive or update memories that are no longer accurate.

Skipping the CLAUDE.md Integration

Your CLAUDE.md file and your memory system should work together, not compete. Use CLAUDE.md for truly stable, always-relevant project-wide conventions (coding style, key commands, project overview). Use your memory system for the evolving, task-specific context that changes session to session.

Frequently Asked Questions

Does Claude Code have built-in long-term memory?

No. Claude Code doesn’t persist memory between sessions by default. The only native mechanism is the CLAUDE.md file, which is static and manually maintained. Anything more sophisticated — including cross-session recall, tiered retrieval, or dynamic injection — requires external tooling.

What’s the difference between CLAUDE.md and a proper memory system?

CLAUDE.md is always-on: its entire contents are injected into every session, regardless of relevance. A proper memory system retrieves only what’s relevant to the current task, ranks by importance, and manages context budget. For small projects, CLAUDE.md is sufficient. For large or long-running projects, a retrieval-based system gives you much better signal-to-noise.

How do I prevent memory systems from bloating my context window?

Use a hard token budget for injected memories — something like 2,000–4,000 tokens depending on your model’s context window. Apply ranking to select the most relevant memories within that budget, and summarize long memories rather than injecting them verbatim. Regularly archive stale memories so they’re excluded from retrieval.

Can I use vector embeddings without a dedicated vector database?

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Yes. For small to medium memory stores (under ~10,000 memories), you can store embeddings as binary blobs in SQLite and compute cosine similarity in Python at query time. Libraries like numpy make this fast enough for most projects. You only need a dedicated vector database like Qdrant or Chroma when you’re managing very large memory collections or need sub-100ms retrieval at scale.

How often should memory extraction run?

After any session that involved meaningful decisions, new entities, or resolved errors. For exploratory sessions (reading code, asking questions), extraction is optional. A good rule: if a session changed something in your codebase, run extraction before you close the terminal.

Is this approach specific to Claude Code, or does it work with other coding agents?

The patterns — Memarch for storage structure, Hermes for injection and recall — are agent-agnostic. You can apply the same architecture to GitHub Copilot Workspace, Cursor, or any agent that accepts system prompts. The implementation details differ (how you inject context, what format memories take), but the underlying approach is transferable.

Key Takeaways

Claude Code has no persistent memory by default — every session starts from scratch unless you build around this limitation.
A tiered memory approach (in-context, session, and persistent) matches different types of information to appropriate storage and retrieval strategies.
The Memarch pattern classifies memories into four types — entity, decision, error, and context — making stored knowledge easier to retrieve and apply correctly.
The Hermes pattern handles injection via query, rank, and inject steps, ensuring only relevant memories consume context budget.
Start simple: SQLite + a session wrapper script is enough to get meaningful memory persistence on most projects.
Memory systems require maintenance — stale or overly broad injection actively degrades agent performance.

For teams looking to skip the infrastructure work and get straight to the workflows, MindStudio offers a no-code way to build, host, and integrate memory pipelines that connect directly to the tools Claude Code agents already use.

Why Claude Code Keeps Forgetting Everything

What Claude Code Actually Remembers (And What It Doesn’t)

The Four Memory Gaps

Seven tools to build an app. Or just Remy.

What CLAUDE.md Gives You

The Three Tiers of Memory

Tier 1: In-Context Memory

Tier 2: Session Memory

Tier 3: Persistent Memory

The Memarch Pattern: Structuring What You Store

Entity Memories

Decision Memories

Error Memories

Context Memories

Storage Backends: Where Memories Live

Flat File Storage

SQLite with Full-Text Search

Vector Database (Chroma, Qdrant, or Pinecone)

The Hermes Pattern: Injection and Recall

Step 1: Query

Step 2: Rank

Step 3: Inject

Writing Back: Closing the Loop

Day one: idea. Day one: app.

Building the Full System: A Step-by-Step Implementation

Step 1: Set Up Your Memory Store

Step 2: Seed Initial Memories

Step 3: Create a Session Launcher Script

Step 4: Add Post-Session Memory Extraction

Step 5: Iterate on Your Ranking Logic

How MindStudio Fits Into This Architecture

Built like a system. Not vibe-coded.

Common Mistakes to Avoid

Injecting Too Much

Storing Outputs Instead of Decisions

Neglecting Memory Maintenance

Skipping the CLAUDE.md Integration

Frequently Asked Questions

Does Claude Code have built-in long-term memory?

What’s the difference between CLAUDE.md and a proper memory system?

How do I prevent memory systems from bloating my context window?

Can I use vector embeddings without a dedicated vector database?

Plans first. Then code.

How often should memory extraction run?

Is this approach specific to Claude Code, or does it work with other coding agents?

Key Takeaways

Related Articles

How to Use AI Agents in Large Codebases: Anthropic's 7-Strategy Framework

How to Use Sub-Agents to Split Exploration from Editing in AI Coding Workflows

How to Build an AI Video Generation Workflow with Claude Code and HyperFrames

How to Build an AI Agent That Runs While You Sleep: Scheduled Automations with Claude