How to Build a Hybrid AI Memory System: Combining Memarch and Hermes
Learn how to combine Memarch's automatic vector capture with Hermes's curated memory injection for a complete Claude Code memory architecture.
The Memory Problem Every Claude Code User Runs Into
If you’ve used Claude Code for real work, you’ve hit the wall. You spend 20 minutes explaining your codebase structure, naming conventions, and architectural decisions. The session ends. Next time, you start over.
This isn’t a minor inconvenience — it compounds across every session, every developer, every project. And it’s the core reason why building a hybrid AI memory system for Claude Code has become one of the most valuable things you can do for sustained, high-quality AI-assisted development.
The hybrid approach discussed in this guide combines two complementary systems: Memarch, which automatically captures and indexes context through vector embeddings, and Hermes, which handles curated, structured memory injection. Together, they cover the full spectrum of what an AI coding assistant actually needs to remember. Here’s how to build it.
Why Single-Method Memory Falls Short
Most developers who think about AI memory land on one of two approaches: automatic logging or manual notes. Both work, but both fail in predictable ways.
Automatic-only memory captures everything, which sounds great — until your agent is retrieving an outdated architectural decision from six months ago with high confidence because it’s semantically close to a current query. Volume creates noise, and noise degrades retrieval precision.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Manual-only memory (think: a curated CLAUDE.md or a handwritten system prompt) stays clean and accurate, but it requires discipline. Developers don’t update it consistently. It drifts from reality. Key context gets omitted because the developer assumed it was obvious.
The hybrid model solves both problems. Memarch handles the automatic layer — capturing what happened, indexed semantically so it’s retrievable by meaning. Hermes handles the curated layer — injecting what matters most, structured and maintained deliberately. One system learns passively; the other is shaped intentionally. Together, they give Claude Code both breadth and precision.
Understanding Memarch: Automatic Vector Capture
Memarch functions as a persistent vector store that runs alongside your Claude Code sessions. Its job is to observe, encode, and index context automatically — no manual intervention required.
How the Capture Pipeline Works
When you interact with Claude Code, Memarch intercepts meaningful exchanges: code changes, architectural discussions, error resolutions, and decisions made during a session. Each piece of context is chunked, passed through an embedding model (typically a lightweight local model or an OpenAI/Voyage embeddings endpoint), and stored as a vector in a local or hosted vector database.
Common storage backends for Memarch implementations include:
- FAISS (local, fast, free — good for single-developer setups)
- Chroma (local or hosted, with simple APIs and metadata filtering)
- Pinecone or Weaviate (cloud-hosted, better for team environments)
The retrieval side is where it earns its keep. When Claude Code starts a new session or encounters a problem, Memarch runs a semantic similarity search against the indexed store and surfaces relevant past context. This gets injected into Claude’s context window before it begins reasoning.
What Memarch Captures Well
Memarch excels at capturing:
- Implicit patterns — things you do consistently without explicitly stating them (e.g., always using factory functions over class instantiation in a certain module)
- Error resolution history — when a specific bug was solved, how, and what the fix looked like
- Refactoring decisions — why a particular file structure was chosen or changed
- Library and API usage — specific patterns for how your team calls external services
Memarch’s Limitations
The limitation is precision. Because Memarch indexes everything with equal weight, a low-stakes debugging session can pollute retrieval results for high-stakes architectural queries. Retrieval quality degrades as the store grows if you don’t have good chunking strategies and metadata filtering in place.
This is exactly why Hermes exists.
Understanding Hermes: Curated Memory Injection
Hermes takes a different approach. Rather than capturing everything, it gives you explicit control over what Claude Code always knows and how that knowledge is structured.
Think of Hermes as a managed, versioned knowledge layer that sits above your raw session history. It injects memories deliberately — on session start, on context switch, or in response to specific triggers.
The Three Memory Tiers Hermes Manages
A well-implemented Hermes setup typically manages three tiers of curated memory:
1. Project-level constants These are facts that never change mid-project: the tech stack, the repository structure, the naming conventions, the deployment pipeline. They get injected at session start, every time.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
2. Domain knowledge
Business logic, domain-specific terminology, and relationship maps (e.g., “the orders table is the source of truth for fulfillment; the fulfillment_queue table is derived and should never be written to directly”). This layer requires periodic review but changes slowly.
3. Active session context What the developer is working on right now — the current feature branch, the open issues, the recent decisions. This is injected conditionally based on what Claude Code is about to do.
How Hermes Injects Memory
Hermes typically uses a combination of:
- Structured CLAUDE.md files — placed at the project root and in relevant subdirectories, read automatically by Claude Code at session start
- Tool-injected system prompts — using MCP (Model Context Protocol) servers or custom tool integrations to push structured context into the conversation programmatically
- Conditional injection logic — rules that add memory based on what file is open, what directory is active, or what command was just run
The injection mechanism matters because Claude Code has a finite context window. Hermes is designed to be selective — it injects what’s relevant, not everything it has.
Designing the Hybrid Architecture
Now that both systems are clear, here’s how they fit together structurally.
┌─────────────────────────────────────────────────────┐
│ Claude Code Session │
├─────────────────────────────────────────────────────┤
│ HERMES LAYER (always-on, curated injection) │
│ → Project constants │
│ → Domain knowledge │
│ → Active context │
├─────────────────────────────────────────────────────┤
│ MEMARCH LAYER (semantic retrieval, as-needed) │
│ → Relevant past decisions │
│ → Similar error resolutions │
│ → Matching code patterns │
├─────────────────────────────────────────────────────┤
│ Claude reasoning + code generation │
└─────────────────────────────────────────────────────┘
The Hermes layer is deterministic — Claude always gets this context. The Memarch layer is probabilistic — Claude gets this context when a semantic similarity threshold is met.
This split is intentional. You don’t want Memarch’s retrieved memories competing with or contradicting Hermes’s curated facts. By injecting Hermes memories first and Memarch memories second (with clear labeling), Claude can treat curated facts as authoritative and retrieved memories as supporting context.
Step-by-Step: Building the Hybrid System
Step 1: Set Up Your Vector Store (Memarch Layer)
Start with a local Chroma instance for simplicity. Install the Python client and create a collection for your project:
pip install chromadb sentence-transformers
Create a basic capture script that intercepts Claude Code session output:
import chromadb
from sentence_transformers import SentenceTransformer
client = chromadb.Client()
collection = client.get_or_create_collection("memarch_project")
model = SentenceTransformer("all-MiniLM-L6-v2")
def capture_memory(content: str, metadata: dict):
embedding = model.encode(content).tolist()
collection.add(
documents=[content],
embeddings=[embedding],
metadatas=[metadata],
ids=[f"mem_{hash(content)}"]
)
Set up metadata tagging from the start. At minimum, tag each memory with:
session_datefile_path(if relevant)memory_type(decision, error_fix, pattern, discussion)confidence(high/medium/low — set this manually for important memories)
Step 2: Configure Retrieval with Filtering
Don’t retrieve blindly. Use metadata filters to surface only relevant memories:
def retrieve_memories(query: str, file_context: str = None, n_results: int = 5):
query_embedding = model.encode(query).tolist()
where_filter = {}
if file_context:
where_filter["file_path"] = {"$eq": file_context}
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
where=where_filter if where_filter else None
)
return results["documents"][0]
Set a similarity threshold. Results below 0.75 cosine similarity often add more noise than signal — it’s better to return nothing than to inject irrelevant context.
Step 3: Build Your Hermes Knowledge Files
Create a structured directory for your curated memory files:
.hermes/
├── constants.md # Tech stack, repo structure, conventions
├── domain.md # Business logic, domain models
├── active.md # Current sprint context (update weekly)
└── rules.md # Hard constraints Claude must follow
How Remy works. You talk. Remy ships.
Each file should be concise and structured. Here’s an example constants.md:
# Project Constants
## Tech Stack
- Backend: Python 3.11, FastAPI
- Database: PostgreSQL 15, SQLAlchemy ORM
- Frontend: React 18, TypeScript, TailwindCSS
- Deployment: Docker + AWS ECS, CI via GitHub Actions
## Code Conventions
- Use snake_case for all Python variables and functions
- React components: PascalCase, one component per file
- All database queries go through repository classes — never write raw SQL in routes
- Every async function must have error handling with structured logging
## Repository Structure
- `/api` — FastAPI routes and middleware
- `/services` — Business logic layer
- `/models` — SQLAlchemy models
- `/repositories` — Data access layer
- `/tests` — pytest test suite
This gets injected at session start, every time. Claude Code doesn’t need to infer any of this — it’s told directly.
Step 4: Create the Injection Orchestrator
Write a simple orchestrator that runs before each Claude Code session and assembles the full context injection:
import os
def build_context_injection(query: str = None, current_file: str = None) -> str:
hermes_context = load_hermes_files()
memarch_context = []
if query:
memarch_context = retrieve_memories(query, file_context=current_file)
injection = "## Project Context (Hermes)\n\n"
injection += hermes_context
if memarch_context:
injection += "\n\n## Relevant Past Context (Memarch)\n\n"
for memory in memarch_context:
injection += f"- {memory}\n"
return injection
def load_hermes_files() -> str:
hermes_dir = ".hermes"
combined = ""
for filename in ["constants.md", "domain.md", "active.md", "rules.md"]:
filepath = os.path.join(hermes_dir, filename)
if os.path.exists(filepath):
with open(filepath, "r") as f:
combined += f.read() + "\n\n"
return combined
Step 5: Wire It Into Claude Code via MCP
The cleanest integration point is an MCP (Model Context Protocol) server. This lets you expose your hybrid memory system as a tool that Claude Code can call natively.
Create a simple MCP server with two tools:
get_project_context— returns the full Hermes injectionretrieve_relevant_memory— takes a query and returns Memarch results
Register both tools in your Claude Code configuration. Once active, Claude can call these tools automatically at session start or when it needs to recall past context.
Step 6: Automate Memory Capture Post-Session
Close the loop by capturing what happened during each session. Set up a post-session hook (or a simple script you run manually) that:
- Extracts key decisions and resolutions from the session log
- Tags them with appropriate metadata
- Adds them to the Memarch vector store
You can automate this extraction step using a lightweight summarization call — pass the session log to a fast model with a prompt like: “Extract the 3-5 most important technical decisions or problem resolutions from this session. Format as bullet points.”
Managing Memory Quality Over Time
Both stores require maintenance. Without it, quality degrades.
Keeping Memarch Clean
- Set a TTL (time-to-live) for low-confidence memories. Anything tagged
confidence: lowshould expire after 90 days unless promoted. - Deduplicate regularly. Similar memories accumulate fast. Run a periodic dedup job that clusters similar embeddings and keeps only the most recent or highest-confidence version.
- Manual review cadence. Once a month, scan the top-retrieved memories and remove anything outdated or incorrect.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
Keeping Hermes Accurate
- Treat Hermes files like code. Version-control them in your repository. Require PR review for changes to
constants.mdanddomain.md. - Review
active.mdweekly. This file should reflect what’s actually being worked on right now, not last sprint’s priorities. - Add an expiry annotation to anything time-bound:
<!-- Expires: 2025-03-01 -->
The checkout flow is under active refactoring. Avoid modifying
`/api/routes/checkout.py` without checking with the team first.
Common Mistakes to Avoid
Injecting too much context. Claude Code has a context window. If your Hermes files and Memarch results together exceed what fits alongside the actual code you’re working on, retrieval quality and reasoning quality both drop. Keep Hermes files under 2,000 tokens total and limit Memarch to 5-7 retrieved chunks max.
Letting Memarch contradict Hermes. If a past session captured an outdated architectural decision that conflicts with your current constants.md, Claude will get confused. When you change something major in Hermes, clean up conflicting Memarch memories explicitly.
Skipping metadata on capture. Untagged memories are nearly impossible to filter or clean later. Enforce metadata at capture time, not retroactively.
Treating this as a set-and-forget system. The hybrid system is a tool, not a solution. It needs regular human review to stay useful.
Where MindStudio Fits Into This Architecture
Building the hybrid system described above requires wiring together a vector store, an MCP server, file-based memory management, and capture automation. That’s several moving parts — manageable for a single developer, but harder to scale across a team.
This is where MindStudio’s Agent Skills Plugin becomes useful. The @mindstudio-ai/agent npm SDK lets Claude Code (and other agents) call managed capabilities as simple method calls, handling the infrastructure layer so your agents focus on reasoning.
Rather than building and hosting your own vector retrieval pipeline, you can expose your hybrid memory architecture as a MindStudio workflow and call it from Claude Code via the Agent Skills Plugin:
const agent = new MindStudioAgent();
// Retrieve hybrid context for a new session
const context = await agent.runWorkflow("hybrid-memory-retrieval", {
query: currentQuery,
project: projectName,
currentFile: openFile
});
MindStudio’s visual workflow builder lets you design the orchestration logic — which Hermes files to load, how to filter Memarch results, how to format the injection — without rewriting code every time requirements change. If you want to add a new memory tier, adjust retrieval thresholds, or swap out your vector store backend, you update the workflow visually rather than modifying and redeploying code.
For teams using Claude Code across multiple projects, MindStudio also makes it straightforward to share memory infrastructure without duplicating setup. You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What is a hybrid AI memory system for Claude Code?
One coffee. One working app.
You bring the idea. Remy manages the project.
A hybrid AI memory system combines two types of memory: automatic semantic capture (like Memarch uses) and deliberate curated injection (like Hermes provides). Automatic memory captures what happened in past sessions using vector embeddings and retrieves relevant context by similarity. Curated memory stores structured facts — conventions, domain knowledge, active context — and injects them reliably at session start. Together, they give Claude Code both broad recall and precise, authoritative knowledge.
How is Memarch different from just saving chat logs?
Chat logs are unstructured and unindexed. To use them, you’d have to manually search through text or paste them into a context window, which is impractical at scale. Memarch converts session content into vector embeddings, which enables semantic search — meaning you can retrieve past context based on meaning and relevance rather than keyword matching. It also stores metadata alongside each memory, enabling filtered retrieval.
Do I need to use both Memarch and Hermes, or can I pick one?
You can use either independently, and both are useful on their own. But they address different failure modes. Memarch alone captures too much and can surface outdated or irrelevant memories. Hermes alone stays accurate but misses implicit patterns and history. The hybrid approach is recommended for any project longer than a few days where consistency and recall both matter.
How much does running this system cost?
The core components — Chroma for vector storage, a lightweight local embedding model like all-MiniLM-L6-v2, and file-based Hermes storage — cost nothing to run locally. The main cost is the embedding API if you use a hosted model (e.g., OpenAI’s text-embedding-3-small costs roughly $0.02 per million tokens, which is minimal for typical session volumes). Cloud vector stores like Pinecone add cost at scale but aren’t required for individual developers or small teams.
How do I handle memory across multiple Claude Code projects?
Use separate Chroma collections (or namespaces in a cloud vector store) per project for Memarch, and separate .hermes/ directories per project repository. Each project should have its own retrieval context — cross-project memory retrieval rarely helps and usually adds noise. If you share common knowledge across projects (e.g., company-wide conventions), maintain a shared Hermes file that gets included in each project’s injection alongside the project-specific files.
Can this work with AI agents other than Claude Code?
Yes. The underlying architecture — a vector store for semantic retrieval combined with structured memory injection — is model-agnostic. Memarch-style capture and Hermes-style injection can work with any agent that accepts context injection via system prompts, MCP tools, or similar mechanisms. The specific integrations will differ, but the memory management principles apply regardless of which model or framework you’re using.
Key Takeaways
- Memarch handles automatic context capture via vector embeddings — indexing past sessions and retrieving relevant history by semantic similarity.
- Hermes handles curated memory injection — structured, deliberate, versioned knowledge that Claude Code gets reliably on every session.
- The hybrid architecture layers both systems: Hermes provides authoritative context; Memarch provides supporting history. Neither replaces the other.
- Memory quality requires active maintenance — deduplication, TTL on stale memories, and regular review of Hermes files keep the system accurate over time.
- MCP servers are the cleanest integration point for exposing this system to Claude Code natively.
- MindStudio’s workflow builder can handle the orchestration layer, making it easier to manage and update the system across teams and projects.
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
A well-built hybrid memory system doesn’t just save time — it changes the quality of what Claude Code can help you build. When the agent actually remembers your codebase, your decisions, and your constraints, it spends less time getting oriented and more time doing useful work.