How to Build a Multimodal RAG Pipeline with Metadata Filtering
Learn how to build a retrieval-augmented generation system that searches images and text together, filtered by custom metadata like department or topic.
Why Standard RAG Falls Short for Mixed-Content Knowledge Bases
Most teams building retrieval-augmented generation systems start with text. They chunk PDFs, embed the chunks, store them in a vector database, and call it done. That works well until someone asks a question about a product diagram, a scanned invoice, a marketing asset, or anything that lives primarily in image form.
That’s where a multimodal RAG pipeline comes in. By combining image and text retrieval in a single system — and layering metadata filtering on top — you can build knowledge bases that actually reflect how information exists in the real world: mixed, messy, and spread across formats.
This guide walks through how to design and build a multimodal RAG pipeline with metadata filtering from scratch. You’ll learn how to structure your data, choose the right embedding strategy, set up filtering logic, and connect it all to a multimodal language model like Gemini. By the end, you’ll have a working mental model — and a practical roadmap — for building this system yourself.
What Multimodal RAG Actually Means
Standard RAG is a two-step process: retrieve relevant content from a knowledge base, then pass it to an LLM to generate a response. The retrieval step typically uses vector search — text gets converted to embeddings, and the closest matches to your query are returned.
Multimodal RAG extends this to work across different content types — most commonly text and images, but also video frames, audio transcripts, and structured data.
One coffee. One working app.
You bring the idea. Remy manages the project.
How the Retrieval Side Changes
With text-only RAG, every document and every query gets turned into a single kind of embedding using a text encoder. With multimodal RAG, you have two options:
Option 1: Separate embedding spaces. Use a text encoder for text chunks and an image encoder for images. Store both in the same vector database, but run retrieval separately and merge results.
Option 2: A shared embedding space. Use a model like CLIP or Gemini’s multimodal embedding API to encode both text and images into the same vector space. A text query can then retrieve both text chunks and images based on semantic similarity.
The shared embedding space approach is generally cleaner for search quality, because a text query like “Q3 revenue chart broken down by region” can directly surface an image of that chart — not just documents that describe it.
How the Generation Side Changes
On the generation side, you need a model capable of processing both text and images as inputs. Gemini 1.5 Pro and Gemini 2.0 Flash are strong choices here — both can accept text chunks, image files, or a mix of both as context and generate coherent, grounded responses. GPT-4o and Claude 3.5 Sonnet also support this pattern.
The key difference from standard RAG: instead of passing retrieved text chunks to the LLM, you pass a combination of text snippets and image files. The model synthesizes across both.
Why Metadata Filtering Is Essential at Scale
Semantic similarity alone isn’t enough when your knowledge base grows. If you have 50,000 items across multiple departments, time periods, document types, and topics, a pure vector search will surface the “most similar” results — but those results might span every department, format, and time period in ways that don’t match what the user actually needs.
Metadata filtering solves this by constraining the search space before (or after) the vector search runs.
What Metadata Fields Look Like in Practice
Metadata is structured information attached to each item in your vector store. For a multimodal knowledge base, typical fields include:
- content_type —
"text","image","table","chart","slide" - department —
"engineering","marketing","finance","legal" - topic —
"product_roadmap","q3_earnings","brand_guidelines" - source —
"confluence","google_drive","sharepoint","figma" - date_created — ISO 8601 timestamp
- access_level —
"public","internal","confidential" - language —
"en","de","fr"
When a user submits a query, the system can apply filters before running the vector search. For example: only search items where department = "finance" and date_created > 2024-01-01 and content_type in ["text", "chart"].
This dramatically improves precision, reduces noise, and also enables access control — you can scope retrieval to what a given user is allowed to see.
Pre-Filtering vs. Post-Filtering
Most vector databases support both:
- Pre-filtering (also called metadata pre-filtering) applies the filter before the ANN (approximate nearest neighbor) search. It narrows the candidate set first, then searches within it. This is faster and more precise, but requires the database to support it natively. Pinecone, Weaviate, and Qdrant all support this.
- Post-filtering applies the filter after retrieval. You fetch the top-k results by similarity and then drop any that don’t match. This is simpler to implement but wasteful — you might discard most of your results, forcing you to fetch a much larger k.
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
For most production systems, pre-filtering is the better default. If you’re using Weaviate, you can express these as where filters alongside your vector query. Qdrant uses filter objects with must, should, and must_not conditions. Pinecone supports metadata filtering through its filter parameter in query calls.
Choosing Your Stack
Before writing any pipeline logic, you need to make a few key technology decisions. Here’s how to think through each one.
Embedding Model
| Use Case | Recommended Model |
|---|---|
| Text-only retrieval | text-embedding-3-large (OpenAI), text-embedding-004 (Google) |
| Image-only retrieval | CLIP (OpenAI), SigLIP (Google) |
| Unified text + image retrieval | Gemini Multimodal Embedding API, CLIP with text queries against image embeddings |
| On-premise / privacy-sensitive | nomic-embed-text + CLIP via Ollama |
For a true multimodal RAG pipeline, Gemini’s multimodal embedding API is worth evaluating — it encodes both text and images into the same 1408-dimensional space, which simplifies your retrieval architecture considerably.
Vector Database
- Pinecone — Managed, easy setup, strong metadata filtering. Good starting point.
- Weaviate — Open-source, schema-based, native multimodal support through its multi2vec modules.
- Qdrant — Open-source, excellent filtering performance, good for complex nested filters.
- ChromaDB — Simple, good for local prototyping, less suited for large-scale production.
Multimodal LLM
Gemini 1.5 Pro and Gemini 2.0 Flash both accept images and text in a single context window and generate grounded responses across both. They’re the natural fit for this pipeline, especially since you’re likely already using Google’s embedding infrastructure.
Step-by-Step: Building the Pipeline
Step 1 — Design Your Metadata Schema
Start here. Don’t start with the code. Think through:
- What types of content will this knowledge base contain?
- What dimensions will users want to filter by?
- What access control requirements exist?
- Will filters be applied programmatically (based on user role) or user-selected (via a UI)?
Create a schema document that lists every metadata field, its data type, and the allowed values. This becomes the contract between your ingestion pipeline and your retrieval logic.
Example schema for an internal company knowledge base:
{
"id": "string (uuid)",
"content_type": "enum: text | image | table | chart | slide",
"department": "enum: engineering | marketing | finance | hr | legal | product",
"topic_tags": "array of strings",
"source_system": "string",
"document_id": "string (parent document reference)",
"page_number": "integer (if applicable)",
"date_created": "ISO 8601 datetime",
"date_modified": "ISO 8601 datetime",
"access_level": "enum: public | internal | confidential",
"language": "string (ISO 639-1)"
}
Step 2 — Build the Ingestion Pipeline
Your ingestion pipeline needs to handle two types of content: text chunks and images.
For text documents (PDFs, Notion pages, Google Docs):
- Extract raw text using a parser (PyMuPDF, Unstructured, Docling).
- Chunk the text into segments of roughly 300–600 tokens, with 10–20% overlap.
- Generate a text embedding for each chunk.
- Attach metadata fields to each chunk.
- Upsert to your vector database.
For images (slides, diagrams, charts, photos):
- Extract images from source documents or pull directly from storage.
- Optionally generate a text caption using a vision model — this helps with text-based retrieval and can be stored as a metadata field or embedded alongside the image.
- Generate an image embedding using your chosen model.
- Attach metadata (content_type = “image”, plus all standard fields).
- Upsert to your vector database.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
A useful pattern: for mixed-content documents like PowerPoint files, process each slide twice — once as an image (to capture layout, charts, and visuals) and once as extracted text. Store both in the vector database with a shared document_id and page_number, so you can link them during retrieval.
Step 3 — Set Up Retrieval with Metadata Filtering
Your retrieval function needs to accept:
- A query string (or query image)
- A set of optional filter parameters
Here’s a simplified example using Qdrant:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, MatchAny
def retrieve(
query_embedding: list[float],
department: str | None = None,
content_types: list[str] | None = None,
top_k: int = 10
) -> list[dict]:
client = QdrantClient(url="your-qdrant-url")
conditions = []
if department:
conditions.append(
FieldCondition(key="department", match=MatchValue(value=department))
)
if content_types:
conditions.append(
FieldCondition(key="content_type", match=MatchAny(any=content_types))
)
filter_obj = Filter(must=conditions) if conditions else None
results = client.search(
collection_name="knowledge_base",
query_vector=query_embedding,
query_filter=filter_obj,
limit=top_k
)
return [hit.payload for hit in results]
For multimodal queries, you’ll need to handle two query types:
- Text query — embed the query text, retrieve across both text and image items
- Image query — embed the query image, retrieve similar images (and optionally related text)
If you’re using a shared embedding space (like Gemini’s multimodal embeddings), the same query function handles both cases.
Step 4 — Assemble the Context for Generation
Once you have your retrieved items, you need to construct the context payload for your multimodal LLM.
Text items get concatenated as text blocks. Image items get included as image attachments. Most multimodal APIs accept a list of content parts:
def build_context(retrieved_items: list[dict], query: str) -> list[dict]:
messages = [{"role": "user", "content": []}]
# Add retrieved text chunks
for item in retrieved_items:
if item["content_type"] == "text":
messages[0]["content"].append({
"type": "text",
"text": f"[Source: {item['source_system']}, {item['department']}]\n{item['text']}"
})
elif item["content_type"] in ["image", "chart", "slide"]:
messages[0]["content"].append({
"type": "image_url",
"image_url": {"url": item["image_url"]}
})
# Add the user query
messages[0]["content"].append({
"type": "text",
"text": f"\nQuestion: {query}"
})
return messages
Pass this to Gemini (or another multimodal LLM) and you’ll get a response that draws on both text and visual context.
Step 5 — Add a Query Parsing Layer
For a production system, you don’t want users to manually specify their metadata filters. Instead, build a query parsing step that:
- Takes the raw user query.
- Uses an LLM to extract any implicit filter signals (e.g., “show me the finance team’s Q3 slides” →
department: finance,content_type: slide). - Returns a structured filter object to pass to your retrieval function.
This can be as simple as a structured output prompt:
Given this user query, extract any filter parameters that apply.
Return a JSON object with these optional fields:
- department: one of [engineering, marketing, finance, hr, legal, product]
- content_types: array of [text, image, table, chart, slide]
- date_after: ISO 8601 date
- access_level: one of [public, internal, confidential]
If no filter applies for a field, omit it.
Query: {user_query}
This step separates “what the user is asking” from “where to look,” which improves both retrieval precision and response quality.
Step 6 — Handle Re-Ranking (Optional but Recommended)
At scale, your top-k results from vector search won’t always be perfectly ordered by relevance. A cross-encoder re-ranking step can improve this significantly.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
After retrieval, pass your query and retrieved items to a cross-encoder model (like Cohere’s Rerank API or a local cross-encoder via sentence-transformers). This re-scores each result based on the full query-document pair rather than just embedding similarity.
For multimodal results, apply re-ranking separately to text and image results, then merge. A common approach is to score image results using a vision-language model prompt like: “On a scale of 1–10, how relevant is this image to the query: ‘{query}’?” — though this adds latency.
Common Mistakes to Avoid
Inconsistent Metadata at Ingestion
The most common failure mode is metadata that gets populated inconsistently. If some documents have department = "Finance" and others have department = "finance", your filters will miss results. Enforce schema validation at ingestion time and normalize all metadata values to a canonical format.
Forgetting to Handle Image Storage Separately
Your vector database stores embeddings and metadata — not the actual image files. You need a separate image storage layer (S3, GCS, Cloudflare R2) and a reliable way to reference image URLs in your metadata. Make sure image URLs are stable and accessible from wherever your generation layer runs.
Ignoring Chunk-to-Image Linkage
When you process a document that contains both text and images, keep track of which text chunks and images came from the same source. This lets you surface the image when a user asks about content from that page, even if the text embedding scored higher than the image embedding.
Setting top_k Too Low When Filtering Heavily
If you apply aggressive metadata filters and your top_k is too small, you might return very few results — or none. Test your retrieval pipeline with representative queries across different filter combinations and adjust your top_k accordingly.
Not Caching Embeddings
Embedding generation is the slowest and most expensive step in your ingestion pipeline. If you re-ingest documents (for updates or maintenance), you’ll regenerate embeddings you already have. Cache embeddings by content hash so you only generate new ones for content that has actually changed.
Building Multimodal RAG Workflows in MindStudio
If you want to build and deploy a multimodal RAG pipeline without managing infrastructure directly, MindStudio is worth a look.
MindStudio is a no-code platform for building AI agents and automated workflows. Its visual workflow builder lets you connect embedding APIs, vector databases, and multimodal models — including Gemini — into working pipelines without writing backend code. The average workflow takes 15 minutes to an hour to set up.
For a multimodal RAG system specifically, you’d use MindStudio to:
- Wire up your vector database — Connect Pinecone, Weaviate, or Qdrant through MindStudio’s 1,000+ pre-built integrations or via webhook/API calls.
- Process queries with Gemini — Gemini is available as a built-in model with no API key setup required. You can build a query parsing step, retrieval step, and generation step as sequential nodes in a workflow.
- Add metadata filter logic — Use MindStudio’s conditional logic and JavaScript function support to extract filter parameters from user queries and pass them to your retrieval calls.
- Build a frontend — MindStudio supports custom UIs, so you can ship a polished chat interface for your knowledge base without a separate frontend project.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
This is particularly useful for teams that want to prototype quickly or that don’t have dedicated ML infrastructure. You get access to Gemini’s multimodal capabilities directly, without setting up embedding APIs, managing vector database credentials, or building a retrieval layer from scratch.
You can try MindStudio free at mindstudio.ai. If you’re already building workflows on the platform, the AI agent builder documentation covers how to structure multi-step retrieval pipelines.
FAQ
What is multimodal RAG?
Multimodal RAG (retrieval-augmented generation) is a system that retrieves relevant content from a knowledge base containing multiple content types — typically text and images — and passes that content to a multimodal language model to generate a response. Unlike standard RAG, which works only with text, multimodal RAG can surface charts, diagrams, photos, and slides alongside text passages as retrieval results.
How does metadata filtering work in a RAG pipeline?
Metadata filtering lets you constrain which items are searched in your vector database before (or after) the similarity search runs. Each item stored in the database has structured metadata fields attached to it — things like department, content type, date, or topic. When a user submits a query, the system applies filter conditions to narrow the search to only relevant items. This improves precision, reduces noise, and enables access control.
What embedding model should I use for multimodal RAG?
It depends on your architecture. If you want a unified embedding space for text and images (so a text query can retrieve images directly), Gemini’s multimodal embedding API or OpenAI’s CLIP are good choices. If you’re embedding text and images separately, use a strong text model like text-embedding-3-large for text and CLIP or SigLIP for images. For privacy-sensitive use cases, both can be run locally via Ollama.
What vector databases support metadata filtering?
Pinecone, Weaviate, Qdrant, and Milvus all support metadata filtering. Qdrant offers particularly expressive filtering with nested must, should, and must_not conditions. Weaviate supports metadata filtering natively as part of its GraphQL query interface. Pinecone supports filtering via the filter parameter in its query API. ChromaDB supports basic where filters and is well-suited for local prototyping.
Can Gemini handle both text and image context in a single request?
Yes. Gemini 1.5 Pro and Gemini 2.0 Flash both support multimodal inputs — you can pass a mix of text snippets and image files in a single API call. The model processes both in a shared context window and generates a response that draws on the full context. This makes Gemini a natural fit for the generation layer of a multimodal RAG pipeline.
How do I handle access control in a multimodal RAG system?
The cleanest approach is to store an access_level or user_role field in each item’s metadata and apply it as a mandatory filter on every retrieval call based on the authenticated user’s permissions. This way, the vector search is scoped to what the user is allowed to see — not just what’s most similar to their query. Never rely solely on post-retrieval filtering for access control, as that approach can expose restricted content if implemented incorrectly.
Key Takeaways
- A multimodal RAG pipeline extends standard RAG to handle both text and images, using shared or separate embedding spaces and a multimodal LLM for generation.
- Metadata filtering is essential for precision at scale — it narrows the search space by department, content type, date, or any other structured attribute before the vector search runs.
- Design your metadata schema before writing any pipeline code — inconsistent metadata is the most common cause of retrieval failures.
- Gemini is a strong default for both the embedding layer (via the Multimodal Embedding API) and the generation layer, with native support for mixed text-and-image context.
- For teams that want to prototype or deploy without building infrastructure from scratch, MindStudio’s workflow builder lets you connect vector databases, Gemini, and custom logic into a working RAG pipeline — without managing backend services directly.
Coding agents automate the 5%. Remy runs the 95%.
The bottleneck was never typing the code. It was knowing what to build.
If you’re ready to start building, MindStudio gives you access to Gemini, 1,000+ integrations, and a visual pipeline builder — all in one place. You can build a working prototype in an afternoon and scale from there.