How to Build a Unified Multimodal Search System with Gemini Embedding 2 and LangChain
Use Gemini Embedding 2 with LangChain and ChromaDB to build a single search index that handles text, images, audio, video, and PDFs in one query.
The Problem with Siloed Search
Real business data doesn’t live in one format. A company’s knowledge base might span engineering specs in PDFs, customer call recordings, training videos, dashboard screenshots, and plain text notes. Gemini Embedding 2 changes how you can search across all of it — by creating a single index where one query can surface relevant content regardless of format.
The traditional answer to multimodal data is to build separate indexes and merge the results downstream. It works, but it’s expensive to maintain, inconsistent in quality, and often misses connections between related content in different formats.
This guide walks through building a unified multimodal search system with Gemini’s embedding model, LangChain, and ChromaDB — one pipeline that ingests text, images, audio, video, and PDFs and makes them all queryable with a single natural language search.
What Gemini Embedding 2 Brings to the Table
gemini-embedding-001 — Google’s second-generation embedding model, often referred to as Gemini Embedding 2 — is a significant upgrade from text-embedding-004. The key differences:
- 3072-dimensional vectors (versus 768 in the previous generation), configurable down to 1536, 768, or 256 via Matryoshka Representation Learning
- 8,192 token context window (up from 2,048)
- Task-aware embeddings — specify
retrieval_documentorretrieval_queryas the task type and the model optimizes accordingly - Improved multilingual retrieval across dozens of languages
The model itself is text-only. But Gemini’s broader ecosystem — specifically the vision and audio capabilities in Gemini 2.0 Flash — is what makes true multimodal search practical. The pattern is: use Gemini to convert non-text content into rich text descriptions, then embed everything with the same model. Everything ends up in a shared vector space because a single embedding model produced it all.
Note: Google also offers
multimodalembedding@001on Vertex AI, which embeds text, images, and video directly without the intermediate description step. That’s a valid alternative for image-heavy use cases. The approach in this guide usesgemini-embedding-001and requires only an AI Studio API key.
You can verify the model’s performance characteristics against the MTEB leaderboard, where it ranks among the top retrieval models.
Architecture Overview
The pipeline has three stages:
Stage 1 — Ingest and preprocess. Each content type gets converted to text. Plain text and PDFs are chunked. Images are described by a Gemini vision model. Audio files are transcribed. Videos are analyzed for visual content and spoken words.
Stage 2 — Embed and index. Every chunk or description is embedded with gemini-embedding-001 and stored in ChromaDB alongside metadata: source file path, content type, file format, and any additional context.
Stage 3 — Query. A user submits a text query. It’s embedded with the same model. ChromaDB returns the closest vectors. The metadata tells you where the original content came from.
The query step is identical regardless of what’s in the index. That’s the point.
Setting Up Your Environment
You’ll need Python 3.9+ and a Google AI Studio API key.
pip install langchain langchain-google-genai langchain-chroma chromadb \
pypdf pillow google-generativeai
export GOOGLE_API_KEY="your-key-here"
Set up your base imports and initialize the embedding model and vector store:
import os
import base64
from pathlib import Path
import google.generativeai as genai
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_chroma import Chroma
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
embeddings = GoogleGenerativeAIEmbeddings(
model="models/gemini-embedding-001",
google_api_key=os.environ["GOOGLE_API_KEY"],
task_type="retrieval_document"
)
vectorstore = Chroma(
collection_name="multimodal_index",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
One thing to note upfront: pick your output_dimensionality before you start indexing. You can’t mix dimensions in the same ChromaDB collection. For most use cases, 768 dimensions is a good balance between storage cost and retrieval quality. For high-precision retrieval, go with 3072.
embeddings = GoogleGenerativeAIEmbeddings(
model="models/gemini-embedding-001",
google_api_key=os.environ["GOOGLE_API_KEY"],
task_type="retrieval_document",
output_dimensionality=768
)
Building the Indexing Pipeline
Processing Text and PDFs
Text is the baseline case. Load, chunk, tag with metadata, index.
from langchain_community.document_loaders import PyPDFLoader, TextLoader
def index_text_file(file_path: str) -> int:
loader = TextLoader(file_path)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(docs)
for chunk in chunks:
chunk.metadata.update({"file_type": "text", "source_file": file_path})
vectorstore.add_documents(chunks)
return len(chunks)
def index_pdf(file_path: str) -> int:
loader = PyPDFLoader(file_path)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(docs)
for chunk in chunks:
chunk.metadata.update({"file_type": "pdf", "source_file": file_path})
vectorstore.add_documents(chunks)
return len(chunks)
A chunk size of 800 tokens with 100-token overlap is a reasonable starting point. For dense technical content, go lower (400–600). For narrative writing, go higher (1000–1200).
Processing Images
For images, you send the file to Gemini and ask for a comprehensive description. That description becomes the indexed content.
vision_model = genai.GenerativeModel("gemini-2.0-flash")
def describe_image(image_path: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
suffix = Path(image_path).suffix.lower()
mime_map = {
".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".png": "image/png", ".webp": "image/webp"
}
mime_type = mime_map.get(suffix, "image/jpeg")
response = vision_model.generate_content([
{"mime_type": mime_type, "data": image_data},
"""Describe this image in detail for search indexing. Include:
- What is shown (objects, people, scenes, activities)
- Any visible text or labels (read them exactly)
- Colors, layout, and composition if relevant
- Any charts, graphs, or data shown (describe the values)
- Context clues about the purpose or setting
Be specific and comprehensive."""
])
return response.text
def index_image(image_path: str) -> int:
description = describe_image(image_path)
doc = Document(
page_content=description,
metadata={
"file_type": "image",
"source_file": image_path,
"content_type": "image_description"
}
)
vectorstore.add_documents([doc])
return 1
The prompt is doing real work here. Asking Gemini to read visible text, describe charts with actual values, and interpret the purpose of the image creates a much richer search surface than a generic caption. Spend time on this prompt before you index anything.
Processing Audio and Video
Gemini 2.0 Flash handles audio natively. You can send MP3, WAV, M4A, and OGG files directly.
def transcribe_audio(audio_path: str) -> str:
with open(audio_path, "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
suffix = Path(audio_path).suffix.lower()
mime_map = {
".mp3": "audio/mp3", ".wav": "audio/wav",
".m4a": "audio/mp4", ".ogg": "audio/ogg"
}
mime_type = mime_map.get(suffix, "audio/mp3")
response = vision_model.generate_content([
{"mime_type": mime_type, "data": audio_data},
"""Process this audio for search indexing:
[TRANSCRIPT] Transcribe the spoken content verbatim.
[SUMMARY] Summarize key topics, decisions, action items, and named entities.
[SPEAKERS] Identify speakers if distinguishable."""
])
return response.text
def index_audio(audio_path: str) -> int:
content = transcribe_audio(audio_path)
if len(content) > 800:
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.create_documents(
[content],
metadatas=[{"file_type": "audio", "source_file": audio_path}]
)
vectorstore.add_documents(chunks)
return len(chunks)
doc = Document(
page_content=content,
metadata={"file_type": "audio", "source_file": audio_path}
)
vectorstore.add_documents([doc])
return 1
For video, short clips (under ~20MB) can be sent directly. Longer files should be split with ffmpeg before processing.
def index_video(video_path: str) -> int:
file_size = Path(video_path).stat().st_size
if file_size < 20 * 1024 * 1024:
with open(video_path, "rb") as f:
video_data = base64.b64encode(f.read()).decode("utf-8")
response = vision_model.generate_content([
{"mime_type": "video/mp4", "data": video_data},
"""Analyze this video for search indexing:
1. Main topic and purpose
2. Visual content — scenes, objects, people, on-screen text
3. Spoken content — transcribe or summarize
4. Key moments with approximate timestamps
5. Any brands, products, or entities mentioned"""
])
content = response.text
else:
content = f"[Large video — manual segmentation required]: {video_path}"
doc = Document(
page_content=content,
metadata={"file_type": "video", "source_file": video_path}
)
vectorstore.add_documents([doc])
return 1
Running Unified Queries
Once the index is built, querying it is simple. One function, any content type.
def search(query: str, k: int = 5, file_type: str = None) -> list[dict]:
"""
Query the unified index.
Args:
query: Natural language query
k: Number of results
file_type: Optional filter — "text", "pdf", "image", "audio", "video"
"""
where_filter = {"file_type": {"$eq": file_type}} if file_type else None
# Use query-optimized embeddings for retrieval
query_embeddings = GoogleGenerativeAIEmbeddings(
model="models/gemini-embedding-001",
google_api_key=os.environ["GOOGLE_API_KEY"],
task_type="retrieval_query",
output_dimensionality=768
)
query_store = Chroma(
collection_name="multimodal_index",
embedding_function=query_embeddings,
persist_directory="./chroma_db"
)
results = query_store.similarity_search_with_score(query, k=k, filter=where_filter)
return [
{
"content": doc.page_content[:400],
"source_file": doc.metadata.get("source_file"),
"file_type": doc.metadata.get("file_type"),
"score": round(1 - score, 4)
}
for doc, score in results
]
Run a query:
results = search("supply chain delays in Southeast Asia", k=5)
for r in results:
print(f"[{r['file_type'].upper()}] {r['source_file']} — Score: {r['score']}")
print(r['content'][:200])
print("---")
A query like this can return a PDF analyst report, a screenshot of a supplier dashboard, a clip from a logistics team meeting recording, and a segment from a training video — all ranked by semantic relevance in a single result set.
Adding a QA Layer
To turn search into question-answering with source attribution, add a retrieval chain:
from langchain.chains import RetrievalQA
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key=os.environ["GOOGLE_API_KEY"]
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
response = qa_chain.invoke({
"query": "What were the top customer complaints from last quarter?"
})
print(response["result"])
for doc in response["source_documents"]:
print(f" Source: [{doc.metadata['file_type']}] {doc.metadata['source_file']}")
The LLM can now synthesize an answer from content that originally lived in call recordings, PDFs, and slide decks — without any per-type handling in your query code.
Production Considerations
Rate Limits and Batching
Gemini’s API has per-minute request and token limits. For large indexing jobs, add exponential backoff. The tenacity library handles this cleanly.
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=2, max=30))
def embed_with_retry(text: str) -> list[float]:
return embeddings.embed_query(text)
Handling Index Updates
When source files change, delete their old vectors and re-index. Keep a manifest file (JSON or SQLite) tracking what’s been indexed with timestamps. On each run, compare file modification times against the manifest and only reprocess changed files.
Dimension and Task Type Consistency
Set output_dimensionality once at collection creation time. Use task_type="retrieval_document" when indexing and task_type="retrieval_query" when querying — Gemini’s embedding model is instruction-tuned and using the right task type meaningfully improves retrieval precision.
For more detail on LangChain’s retriever options, the LangChain retrieval documentation covers hybrid search, MMR reranking, and multi-query retrieval strategies that pair well with this setup.
Where MindStudio Fits
Building this pipeline in Python gives you full control. But if the goal is getting multimodal search into the hands of non-technical teammates — or wiring it into business workflows without maintaining infrastructure — MindStudio offers a different path.
MindStudio is a no-code platform for building AI agents and automated workflows. It has native Gemini support across the full model family, including vision and embedding, without requiring API keys or separate accounts. You can build a workflow that accepts a file upload, passes it through a Gemini processing step, stores results in a connected knowledge base, and exposes a search interface through a custom UI — without writing a line of code.
For teams building AI-powered knowledge management tools or wanting to automate document workflows, this approach cuts the setup time from hours to minutes. Because MindStudio handles the infrastructure layer — rate limiting, retries, auth — you can focus on what the agent does, not how it runs.
You can try it free at mindstudio.ai.
Frequently Asked Questions
What is Gemini Embedding 2 and how is it different from text-embedding-004?
gemini-embedding-001 (Gemini Embedding 2) produces 3072-dimensional vectors vs. 768 for text-embedding-004, supports 8,192 input tokens vs. 2,048, and includes Matryoshka Representation Learning for flexible output dimensions. It also supports explicit task types (retrieval_document, retrieval_query, semantic_similarity) that improve precision for specific use cases. For new projects, gemini-embedding-001 is the right choice.
Can Gemini Embedding 2 embed images directly?
No — gemini-embedding-001 is a text-only model. For native multimodal embedding in a shared vector space, Google’s multimodalembedding@001 on Vertex AI handles text, images, and video directly. The approach in this guide converts images and other media to text descriptions via Gemini’s vision capabilities first, then embeds with the same text model, achieving comparable retrieval quality without Vertex AI.
What vector database should I use with Gemini embeddings?
ChromaDB works well for local and small-scale projects. For production at scale, Pinecone (managed, low latency), Weaviate (supports hybrid BM25 + vector search), and pgvector (if you’re already on PostgreSQL) are all solid options. All integrate with LangChain and accept Gemini embeddings without any special configuration.
How do I handle large video files that exceed Gemini’s direct upload limit?
Split them into segments using ffmpeg — typically 2–5 minute chunks — and process each segment independently. Store each chunk as a separate document with metadata referencing the original file and the timestamp range it covers. This also improves search granularity, since a query can surface the specific segment of a long video rather than just the file.
Is LangChain required to build this pipeline?
No. LangChain simplifies document loading, text splitting, and retrieval chain construction, but you can replicate the same pipeline using the Gemini Python SDK and ChromaDB’s client directly. LangChain earns its place when you want to connect retrieval to downstream reasoning steps — QA chains, agents, tool use — without writing all that orchestration logic yourself.
How do I validate that cross-modal search is working correctly?
Build a small evaluation set: 20–30 queries where you know which documents should appear in the top results. Include cross-modal cases (e.g., a text query that should surface a specific image). Run the queries, measure precision at 5 (P@5), and iterate on your description prompts for any modality that underperforms. Description quality drives retrieval quality — better prompts produce better results.
Key Takeaways
- One model, one index: Routing all content types through Gemini’s description-then-embed pipeline creates a single vector space where text, images, audio, and video are semantically comparable.
- The preprocessing step is critical: The quality of Gemini’s descriptions determines retrieval accuracy. Invest time in prompts before you index at scale.
- Tag everything at index time: Source type, file path, date, and context metadata make filtering and debugging much easier later.
- Specify task types: Using
retrieval_documentandretrieval_queryappropriately is a small change with measurable retrieval quality impact. - LangChain bridges search to reasoning: The retrieval chain pattern turns your index into a full QA system with source attribution, not just a list of results.
If you’d rather deploy this capability without managing the infrastructure, MindStudio lets you build Gemini-powered agents with multimodal processing in a fraction of the time — no servers, no API key management, no maintenance overhead.