How to Build an AI Agent with Persistent Memory Using Claude and Milvus

Why AI Agents Forget Everything (And How to Fix It)

Claude is a capable reasoning engine. But out of the box, every conversation starts from scratch. Ask it something today, and tomorrow it has no idea what you discussed. For simple chatbots, that’s fine. For agents that need to accumulate knowledge, track context across sessions, and retrieve information from complex documents, it’s a serious limitation.

Persistent memory solves this. By pairing Claude with Milvus — an open-source vector database built for high-scale similarity search — you can give your agent a multi-layered memory system that persists across conversations, retrieves relevant context on demand, and handles complex PDFs without overwhelming the context window.

This guide walks through how to build that system: what each layer does, how to wire up Milvus with Claude, and how to make document retrieval actually useful for multi-agent workflows.

What “Persistent Memory” Actually Means for AI Agents

Memory in AI agents isn’t one thing. It’s a stack of different layers, each serving a different purpose.

Short-Term Memory (Context Window)

This is what Claude already has: the active conversation window. Claude can hold and reason over a large amount of text in a single session — Claude 3.5 Sonnet supports up to 200K tokens. But the moment the session ends, everything in that window is gone.

Short-term memory is fast and zero-latency, but it doesn’t persist. It also has a ceiling: load in a 500-page PDF and you’ll hit limits fast.

Long-Term Semantic Memory (Vector Database)

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

This is where Milvus comes in. Instead of storing raw text, you convert information into vector embeddings — numerical representations of meaning. Milvus indexes these embeddings so you can search by semantic similarity rather than exact keyword match.

When your agent needs to recall something, it embeds the query and retrieves the most relevant chunks from Milvus — even if the exact words don’t match. This is what makes it useful for dense documents like legal contracts, research papers, or technical manuals.

Episodic Memory (Conversation History)

A third layer tracks what happened in past sessions: what the user asked, what the agent responded, what decisions were made. This can be stored as structured records and retrieved alongside semantic search results to give the agent better context about the relationship history.

A well-built agent combines all three layers — pulling from episodic records, retrieving semantic chunks, and reasoning over both within the active context window.

Why Milvus for Vector Search

Several vector databases are available — Pinecone, Weaviate, Chroma, Qdrant. Milvus is worth considering for a few specific reasons.

Scale without penalty. Milvus is purpose-built for production-scale deployments. It handles billions of vectors efficiently, with support for GPU acceleration and horizontal scaling.

Hybrid search. Beyond pure vector search, Milvus supports combining dense vector search with sparse keyword filtering (BM25). This matters when you need both semantic relevance and exact term matching — common in legal and technical document retrieval.

Flexible deployment. Milvus runs as a fully managed cloud service (Zilliz Cloud) or as a self-hosted instance. For development, Milvus Lite runs entirely in-process with no external dependencies.

Rich filtering. You can attach metadata to each vector — document name, page number, section, date — and filter results before or after retrieval. This is essential when your agent needs to restrict searches to a specific document or time range.

Architecture Overview: The Memory Stack

Before writing any code, it helps to understand the full data flow.

User Query
    ↓
[Claude Agent]
    ↓                   ↓
[Episodic Store]   [Milvus Query]
    ↓                   ↓
[Retrieved History] [Relevant Chunks]
         ↓
   [Assembled Context]
         ↓
   [Claude Response]
         ↓
[Store in Episodic + Milvus]

At query time, the agent runs two parallel retrievals: one against episodic memory (structured logs) and one against the vector index in Milvus. The results are merged, trimmed to fit the context window, and passed to Claude with the current query. Claude reasons over this assembled context and returns a response. That response — along with the query — gets logged back to episodic memory and optionally indexed in Milvus for future retrieval.

Step 1: Set Up Milvus

For local development, Milvus Lite is the fastest path. Install it with pip:

pip install pymilvus[model]

For a persistent local server, use Docker:

docker run -d --name milvus_standalone \
  -p 19530:19530 \
  -v milvus_data:/var/lib/milvus \
  milvusdb/milvus:latest

Connect to it from Python:

from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")  # Milvus Lite (file-based)
# Or for the Docker instance:
# client = MilvusClient("http://localhost:19530")

Create a Collection

A collection is Milvus’s equivalent of a table. Define it with a schema that includes the vector field and any metadata you want to filter on:

from pymilvus import MilvusClient, DataType

client.create_collection(
    collection_name="agent_memory",
    dimension=1536,  # Match your embedding model's output
    metric_type="COSINE",
    auto_id=True,
)

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

For more complex schemas with metadata fields (document name, page number, timestamp), use CollectionSchema to define each field explicitly.

Step 2: Build the Document Ingestion Pipeline

This is where PDFs enter the system. The pipeline does four things: parse the document, chunk it into segments, embed each chunk, and upsert into Milvus.

Parse PDFs

Use pypdf or pdfplumber for text extraction. For PDFs with tables or complex layouts, pdfplumber gives better results:

import pdfplumber

def extract_pdf_text(path: str) -> list[dict]:
    pages = []
    with pdfplumber.open(path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text:
                pages.append({"page": i + 1, "text": text})
    return pages

Chunk the Text

Don’t embed entire pages. Break text into overlapping chunks of 300–500 tokens. Overlap (typically 50–100 tokens) ensures that sentences split across chunk boundaries don’t lose context:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80,
    length_function=len,
)

def chunk_pages(pages: list[dict]) -> list[dict]:
    chunks = []
    for page in pages:
        splits = splitter.split_text(page["text"])
        for j, chunk in enumerate(splits):
            chunks.append({
                "page": page["page"],
                "chunk_index": j,
                "text": chunk,
            })
    return chunks

Embed and Insert

Use OpenAI’s text-embedding-3-small (1536 dimensions) or a local model via sentence-transformers. For production, consider Voyage AI’s embeddings — Anthropic recommends them specifically for use with Claude:

import anthropic
from voyageai import Client as VoyageClient

voyage = VoyageClient(api_key="YOUR_VOYAGE_KEY")

def embed_chunks(chunks: list[dict], doc_name: str) -> list[dict]:
    texts = [c["text"] for c in chunks]
    result = voyage.embed(texts, model="voyage-3")
    
    records = []
    for i, chunk in enumerate(chunks):
        records.append({
            "vector": result.embeddings[i],
            "text": chunk["text"],
            "doc_name": doc_name,
            "page": chunk["page"],
            "chunk_index": chunk["chunk_index"],
        })
    return records

def ingest_pdf(path: str, doc_name: str):
    pages = extract_pdf_text(path)
    chunks = chunk_pages(pages)
    records = embed_chunks(chunks, doc_name)
    client.insert(collection_name="agent_memory", data=records)
    print(f"Inserted {len(records)} chunks from {doc_name}")

Step 3: Implement the Retrieval Layer

At query time, embed the user’s question and search Milvus for the most relevant chunks. Add metadata filters when the user specifies a particular document:

def retrieve(query: str, doc_name: str = None, top_k: int = 5) -> list[dict]:
    result = voyage.embed([query], model="voyage-3")
    query_vector = result.embeddings[0]
    
    filter_expr = f'doc_name == "{doc_name}"' if doc_name else None
    
    results = client.search(
        collection_name="agent_memory",
        data=[query_vector],
        limit=top_k,
        filter=filter_expr,
        output_fields=["text", "doc_name", "page"],
    )
    
    return [
        {
            "text": hit["entity"]["text"],
            "doc": hit["entity"]["doc_name"],
            "page": hit["entity"]["page"],
            "score": hit["distance"],
        }
        for hit in results[0]
    ]

The score field (cosine similarity) lets you filter out low-confidence results — anything below 0.7 is often noise and not worth passing to Claude.

Step 4: Build the Episodic Memory Store

Episodic memory tracks conversation history across sessions. A simple approach uses SQLite with a table for conversation turns:

import sqlite3
import json
from datetime import datetime

def init_episodic_store(db_path: str = "episodic.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS conversations (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            session_id TEXT,
            role TEXT,
            content TEXT,
            timestamp TEXT
        )
    """)
    conn.commit()
    return conn

def log_turn(conn, session_id: str, role: str, content: str):
    conn.execute(
        "INSERT INTO conversations (session_id, role, content, timestamp) VALUES (?, ?, ?, ?)",
        (session_id, role, content, datetime.utcnow().isoformat())
    )
    conn.commit()

def get_history(conn, session_id: str, limit: int = 10) -> list[dict]:
    rows = conn.execute(
        "SELECT role, content FROM conversations WHERE session_id = ? ORDER BY id DESC LIMIT ?",
        (session_id, limit)
    ).fetchall()
    return [{"role": r[0], "content": r[1]} for r in reversed(rows)]

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

For more advanced use cases, you can also embed episodic memory entries and store them in Milvus alongside document chunks — then retrieve relevant past conversations the same way you retrieve document content.

Step 5: Connect Everything to Claude

Now wire up the retrieval layers with Claude’s API. The key is assembling a well-structured system prompt that includes retrieved context without overwhelming the token budget:

import anthropic

claude = anthropic.Anthropic(api_key="YOUR_CLAUDE_KEY")

def run_agent(query: str, session_id: str, doc_filter: str = None):
    # Retrieve from both memory layers
    semantic_hits = retrieve(query, doc_name=doc_filter, top_k=5)
    past_turns = get_history(conn, session_id, limit=6)
    
    # Filter low-confidence chunks
    relevant_chunks = [h for h in semantic_hits if h["score"] > 0.70]
    
    # Build context block
    context_text = "\n\n".join([
        f"[Source: {h['doc']}, Page {h['page']}]\n{h['text']}"
        for h in relevant_chunks
    ])
    
    system_prompt = f"""You are a helpful assistant with access to a knowledge base.
Use the following retrieved context to answer the user's question accurately.
If the context doesn't contain the answer, say so clearly.

--- RETRIEVED CONTEXT ---
{context_text}
--- END CONTEXT ---
"""
    
    # Build message history
    messages = past_turns + [{"role": "user", "content": query}]
    
    response = claude.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system_prompt,
        messages=messages,
    )
    
    answer = response.content[0].text
    
    # Log to episodic store
    log_turn(conn, session_id, "user", query)
    log_turn(conn, session_id, "assistant", answer)
    
    return answer

This gives Claude both the retrieved document context and the recent conversation history — enough to answer follow-up questions intelligently without needing the full document in memory.

Step 6: Handle Complex PDF Documents

Standard text extraction breaks down with complex PDFs: multi-column layouts, embedded tables, scanned pages, headers and footers that pollute chunk text.

Dealing with Tables

pdfplumber can extract tables as structured data. For each table, convert it to a Markdown-formatted string before chunking — this preserves row/column relationships that plain text extraction destroys:

def extract_tables(page) -> str:
    tables = page.extract_tables()
    result = []
    for table in tables:
        rows = [" | ".join(str(cell) for cell in row if cell) for row in table]
        result.append("\n".join(rows))
    return "\n\n".join(result)

Handling Scanned PDFs

For image-based PDFs, add OCR via pytesseract or a cloud service like AWS Textract. Process each page as an image and feed the OCR output through the same chunking pipeline.

Metadata-Aware Chunking

For long documents with clear section headers, use a hierarchical chunker that respects document structure. LangChain’s MarkdownHeaderTextSplitter works well if you first convert the PDF to Markdown using a tool like marker-pdf.

The goal is chunks that are semantically coherent — not just arbitrary 400-token windows that cut sentences mid-thought.

Common Mistakes and How to Avoid Them

Skipping chunk overlap. Without overlap, context at chunk boundaries is lost. Always use a 10–20% overlap relative to chunk size.

Ignoring retrieval scores. Not every result from Milvus will be relevant. Set a minimum similarity threshold (0.65–0.75 for cosine) and discard weak matches rather than passing them to Claude as if they were meaningful.

Stuffing too much context. It’s tempting to retrieve 20 chunks and pass them all to Claude. But more context isn’t always better — irrelevant chunks can confuse the model and dilute the signal from relevant ones. Five high-quality chunks usually beat twenty mixed ones.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Not logging to episodic memory. If you don’t persist conversation history, the agent can’t handle follow-up questions like “what did you say earlier about clause 7?” without re-retrieving everything.

Using the wrong embedding model. The embedding model used at ingestion time must match the one used at query time. Mixing models produces meaningless similarity scores.

How MindStudio Fits Into This Stack

Building the above system from scratch is doable for developers comfortable with Python and infrastructure setup. But if you want to deploy this kind of agent — with persistent memory, document retrieval, and Claude as the reasoning layer — as a usable application without managing servers, MindStudio gives you a faster path.

MindStudio’s visual agent builder lets you connect Claude to retrieval workflows, define multi-step reasoning chains, and deploy the result as a web app, API endpoint, or background agent — all without writing the orchestration layer yourself. The platform handles rate limiting, retries, and auth so you can focus on the logic.

For teams building document-heavy agents — contract review, research synthesis, technical support — MindStudio’s multi-agent workflow capabilities let you chain specialized agents together: one for document parsing, one for retrieval, one for Claude-based reasoning, one for formatting output. Each agent does one thing well, and the platform coordinates the handoffs.

You can also use MindStudio’s Agent Skills Plugin if you prefer coding your own agent and just want to offload capabilities like agent.searchGoogle() or agent.sendEmail() without building those integrations yourself.

Try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is persistent memory in an AI agent?

Persistent memory means the agent retains information across conversations and sessions — not just within a single context window. It typically involves storing past interactions in a database and retrieving relevant history when needed. For document-based agents, it also means indexing large knowledge bases so the agent can retrieve relevant content on demand rather than loading entire documents into the context window each time.

Why use Milvus instead of a simpler vector store like Chroma?

Chroma is fine for small-scale prototyping. Milvus is better suited when you need to handle large document collections (millions of chunks), require hybrid search (combining dense and sparse retrieval), or want production-grade performance with horizontal scaling. Milvus also offers finer-grained filtering on metadata fields, which matters when agents need to scope searches to specific documents or time ranges.

How many chunks should I retrieve per query?

The right number depends on your documents and query complexity, but 3–7 chunks is a practical range for most use cases. Too few and you risk missing relevant context. Too many and you waste tokens on noise. Use similarity score thresholds alongside a fixed top-k to ensure quality — if your top-5 results all score below 0.65, it’s better to tell the user you don’t have relevant information than to hallucinate from weak context.

Can this approach work for multi-agent systems?

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

Yes — and this is where it gets useful. In a multi-agent workflow, different agents can share the same Milvus collection. A document ingestion agent parses and indexes PDFs; a retrieval agent queries Milvus and assembles context; a reasoning agent (Claude) generates responses. Each agent has a defined role, and the shared vector store acts as the common memory layer they all read from and write to.

What embedding model should I use with Claude?

Anthropic officially recommends Voyage AI’s embedding models for use with Claude — specifically voyage-3 for general tasks and voyage-3-large for higher accuracy. They’re trained to align well with Claude’s tokenization and reasoning patterns. OpenAI’s text-embedding-3-small is a solid alternative if you’re already using the OpenAI SDK. Avoid mixing embedding models — always use the same model at ingestion and query time.

How do I handle documents that are updated over time?

Track document versions using metadata fields in Milvus. When a document is updated, delete all vectors with the old document ID and re-ingest the new version. Use Milvus’s delete method with a filter expression: client.delete(collection_name="agent_memory", filter='doc_name == "contract_v1"'). For frequently updated documents, a scheduled ingestion agent that checks for file changes can automate this.

Key Takeaways

Multi-layered memory beats single-layer approaches. Short-term context, semantic vector search, and episodic history each serve different needs. A useful agent uses all three.
Milvus is suited for production-scale retrieval. Its hybrid search, metadata filtering, and scalability make it practical for complex document collections.
Chunk quality matters more than chunk quantity. Overlapping, structure-aware chunking produces better retrieval results than splitting by token count alone.
Always filter by similarity score. Don’t pass every retrieved result to Claude — only include chunks that clear a meaningful relevance threshold.
For deployment without infrastructure overhead, MindStudio lets you build and ship Claude-based agents with retrieval workflows through a visual builder, with no server management required.

Building persistent memory into a Claude agent is a meaningful engineering step — but the result is an agent that actually accumulates knowledge, handles complex documents intelligently, and maintains useful context across sessions. That’s the difference between a demo and a tool people can rely on.