Skip to main content
MindStudio
Pricing
Blog About
My Workspace

How to Build a Multimodal Document Intelligence Agent with Gemini Embedding 2

Gemini Embedding 2 embeds PDFs, audio, video, and text in one vector space. Learn how to build a document search agent that retrieves across all content types.

MindStudio Team
How to Build a Multimodal Document Intelligence Agent with Gemini Embedding 2

What Makes Gemini Embedding 2 Different for Document Intelligence

Most embedding models speak one language: text. Feed them a PDF and they’ll process the extracted text. Feed them an image and they’ll either fail or require a separate vision pipeline bolted on. This creates a fundamental problem for document intelligence — real documents aren’t just text. They’re tables, charts, scanned pages, annotated slides, audio recordings of meetings, and video walkthroughs.

Gemini Embedding 2 (specifically the gemini-embedding-exp-03-07 model, also referred to as text-embedding-004 and the newer experimental versions) changes the architecture here. It maps text, images, audio, and video into a shared vector space, meaning a question asked in plain text can retrieve a relevant chart, a scanned invoice, or a spoken explanation from a recorded meeting — all through the same similarity search.

This article walks through exactly how to build a multimodal document intelligence agent using Gemini Embedding 2. You’ll learn how the embedding model works, how to structure your vector store, how to build the retrieval layer, and how to wire it all into an agent that can answer questions across content types. Every section includes working code and practical design decisions.


Understanding Gemini Embedding 2 and Its Multimodal Architecture

How the Embedding Model Works

Gemini Embedding 2 is a dense retrieval model built on top of Google’s Gemini architecture. Unlike older embedding approaches that used separate encoders for different modalities (one for text, one for images), Gemini Embedding 2 uses a unified encoder that processes all input types through a shared representation layer.

The result is a 3072-dimensional embedding vector — the same dimensionality regardless of whether the input is a paragraph of text, a PNG screenshot, a 30-second audio clip, or a short video segment. Because these vectors exist in the same space, you can compute cosine similarity between a text query and an image, and get a meaningful result.

This matters for document intelligence because it means:

  • A user asking “what does the Q3 revenue breakdown look like?” can retrieve a bar chart embedded directly from a PDF page
  • A query about “onboarding steps” can surface both a written procedure document and a screen-recorded walkthrough video
  • Audio transcripts and written documents compete for relevance in the same search index

Embedding Dimensions and Task Types

Gemini Embedding 2 supports Matryoshka Representation Learning (MRL), which means you can truncate the embedding dimensions without a significant drop in quality. The supported output dimensions are:

DimensionUse case
3072Maximum accuracy, full storage cost
1536Good balance for most production systems
768Lightweight, mobile/edge use cases
256Ultra-compact, very high-throughput systems

The model also supports task-type hints that shift its optimization target:

  • RETRIEVAL_DOCUMENT — for indexing documents at ingest time
  • RETRIEVAL_QUERY — for embedding the user query at search time
  • SEMANTIC_SIMILARITY — for clustering and deduplication tasks
  • CLASSIFICATION — for category assignment

For a document intelligence agent, you’ll always use RETRIEVAL_DOCUMENT when ingesting content and RETRIEVAL_QUERY when embedding user questions.

Rate Limits and Model Availability

As of mid-2025, gemini-embedding-exp-03-07 is available through Google AI Studio and Vertex AI. The experimental version has a context window of up to 8,192 tokens for text inputs and supports images up to 20MB. Rate limits vary by tier — the free tier allows 1,500 requests per day at 5 requests per minute; production tier limits are much higher.

For multimodal inputs (images, audio, video), the model accepts content via base64-encoded inline data or Google Cloud Storage URIs. Text inputs are passed as plain strings.


Before writing any code, you need to make clear architectural decisions about how you’ll store, index, and retrieve multimodal content. Getting this wrong early is expensive to fix later.

Choosing a Vector Store

Not all vector stores handle multimodal workloads equally well. The key requirements for this use case are:

  1. Support for high-dimensional vectors (3072 dims if you’re using full Gemini embeddings)
  2. Metadata filtering — you’ll need to filter by content type, document source, date, etc.
  3. Hybrid search — combining dense (embedding) search with sparse (keyword) search improves recall
  4. Namespace or collection separation — useful if you’re building multi-tenant systems

Good options include:

  • Pinecone — managed, easy setup, strong metadata filtering, supports MRL truncation
  • Weaviate — open-source or managed, good multimodal support, built-in hybrid search
  • Qdrant — open-source, fast, supports payload filtering and multi-vector per document
  • ChromaDB — lightweight, great for local development and prototyping
  • pgvector (PostgreSQL) — if you’re already on Postgres and want to avoid a separate service

For this guide, we’ll use Qdrant as the vector store, primarily because it supports storing multiple vector types per point (useful for storing both full and truncated embeddings), has strong filtering, and runs well locally via Docker for development.

Data Model Design

Each document chunk stored in your vector database needs a well-designed payload. Here’s the schema you’ll use:

{
  "id": "uuid",
  "embedding": [3072 floats],
  "payload": {
    "content_type": "image | text | audio | video",
    "source_document": "q3-report.pdf",
    "page_number": 4,
    "chunk_index": 2,
    "raw_content": "base64 or text string",
    "content_url": "gs://bucket/path",
    "summary": "LLM-generated summary of this chunk",
    "created_at": "2025-06-01T00:00:00Z",
    "tags": ["financial", "Q3", "revenue"],
    "parent_doc_id": "uuid of parent document"
  }
}

A few design notes:

  • Always store a summary generated by a Gemini model at ingest time. You’ll use this for re-ranking and for the LLM’s context window.
  • Store the raw_content or content_url so you can fetch the original for the LLM at query time.
  • The parent_doc_id links chunks back to their source document, enabling “fetch surrounding context” patterns.

Chunking Strategy for Each Content Type

Chunking rules differ significantly by content type:

Text documents (PDFs, Word, Markdown)

Use semantic chunking over fixed-size chunking where possible. A good default is:

  • Target chunk size: 512 tokens
  • Overlap: 64 tokens
  • Break at paragraph or sentence boundaries, never mid-sentence
  • Keep headings with their section content (don’t split a heading from its body)

Images (charts, diagrams, scanned pages)

For image-rich PDFs, extract each page as a PNG and embed the full page image. For standalone images, embed as-is. Also generate a text caption using Gemini Vision and store it as metadata — this improves hybrid search recall.

Audio

Chunk at natural pause points or every 60–90 seconds. Use Gemini’s speech-to-text to generate a transcript and store it alongside the audio embedding. The transcript helps with exact-match queries that embedding search might miss.

Video

Extract keyframes every 5–10 seconds and embed them as images. For the audio track, process it as audio. Store both types of embeddings linked to the same parent_doc_id.


Setting Up the Environment and Dependencies

Prerequisites

You’ll need:

  • Python 3.10+
  • A Google AI Studio API key (or Vertex AI credentials)
  • Docker (for running Qdrant locally)
  • The following Python packages
pip install google-generativeai google-genai qdrant-client \
  pypdf2 pillow pydub moviepy langchain-text-splitters \
  python-dotenv httpx asyncio

Environment Setup

Create a .env file:

GOOGLE_API_KEY=your_google_ai_studio_key
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION=doc_intelligence
GEMINI_MODEL=gemini-2.0-flash
EMBEDDING_MODEL=gemini-embedding-exp-03-07
EMBEDDING_DIMENSIONS=1536

Start Qdrant with Docker

docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant

Initialize the Qdrant Collection

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, CreateCollection
)
import os

client = QdrantClient(url=os.getenv("QDRANT_URL"))

client.create_collection(
    collection_name=os.getenv("QDRANT_COLLECTION"),
    vectors_config=VectorParams(
        size=int(os.getenv("EMBEDDING_DIMENSIONS")),
        distance=Distance.COSINE
    )
)
print("Collection created successfully")

Building the Multimodal Ingestion Pipeline

This is the core of the system. The ingestion pipeline processes incoming documents of any type, chunks them appropriately, generates Gemini Embedding 2 vectors, and stores everything in Qdrant.

The Embedding Client

First, create a reusable embedding client that handles both text and multimodal inputs:

import google.generativeai as genai
import base64
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

class GeminiEmbeddingClient:
    def __init__(self):
        self.model = os.getenv("EMBEDDING_MODEL", "gemini-embedding-exp-03-07")
        self.dimensions = int(os.getenv("EMBEDDING_DIMENSIONS", 1536))
    
    def embed_text(self, text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
        result = genai.embed_content(
            model=self.model,
            content=text,
            task_type=task_type,
            output_dimensionality=self.dimensions
        )
        return result["embedding"]
    
    def embed_image(self, image_path: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode("utf-8")
        
        # Determine MIME type
        suffix = Path(image_path).suffix.lower()
        mime_map = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
        mime_type = mime_map.get(suffix, "image/png")
        
        content = {
            "parts": [
                {
                    "inline_data": {
                        "mime_type": mime_type,
                        "data": image_data
                    }
                }
            ]
        }
        
        result = genai.embed_content(
            model=self.model,
            content=content,
            task_type=task_type,
            output_dimensionality=self.dimensions
        )
        return result["embedding"]
    
    def embed_query(self, query: str) -> list[float]:
        return self.embed_text(query, task_type="RETRIEVAL_QUERY")

Text Document Ingestion

import PyPDF2
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client.models import PointStruct
import uuid
import json
from datetime import datetime

class TextDocumentIngestor:
    def __init__(self, embedding_client, qdrant_client, collection_name):
        self.embedder = embedding_client
        self.qdrant = qdrant_client
        self.collection = collection_name
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=64,
            separators=["\n\n", "\n", ".", "!", "?", " "]
        )
    
    def ingest_pdf(self, file_path: str, metadata: dict = None) -> int:
        """Ingest a PDF file, returns number of chunks ingested."""
        
        reader = PyPDF2.PdfReader(file_path)
        all_chunks = []
        parent_id = str(uuid.uuid4())
        
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            if not text.strip():
                continue
            
            chunks = self.splitter.split_text(text)
            
            for chunk_idx, chunk in enumerate(chunks):
                if len(chunk.strip()) < 50:  # Skip very short chunks
                    continue
                    
                embedding = self.embedder.embed_text(chunk)
                
                point = PointStruct(
                    id=str(uuid.uuid4()),
                    vector=embedding,
                    payload={
                        "content_type": "text",
                        "source_document": Path(file_path).name,
                        "page_number": page_num + 1,
                        "chunk_index": chunk_idx,
                        "raw_content": chunk,
                        "summary": chunk[:200],  # Use first 200 chars as quick summary
                        "parent_doc_id": parent_id,
                        "created_at": datetime.utcnow().isoformat(),
                        **(metadata or {})
                    }
                )
                all_chunks.append(point)
        
        # Batch upsert to Qdrant
        self.qdrant.upsert(
            collection_name=self.collection,
            points=all_chunks
        )
        
        print(f"Ingested {len(all_chunks)} text chunks from {file_path}")
        return len(all_chunks)

Image and Visual Content Ingestion

For PDFs with significant visual content (charts, diagrams, scanned pages), extract each page as an image and embed it directly:

from PIL import Image
import io
import fitz  # PyMuPDF - install with: pip install pymupdf

class ImageIngestor:
    def __init__(self, embedding_client, qdrant_client, collection_name, gemini_model):
        self.embedder = embedding_client
        self.qdrant = qdrant_client
        self.collection = collection_name
        self.gemini = gemini_model  # For generating captions
    
    def generate_image_caption(self, image_path: str) -> str:
        """Use Gemini to generate a text description of the image."""
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode("utf-8")
        
        model = genai.GenerativeModel(os.getenv("GEMINI_MODEL", "gemini-2.0-flash"))
        
        response = model.generate_content([
            {
                "inline_data": {
                    "mime_type": "image/png",
                    "data": image_data
                }
            },
            "Describe this image in detail. If it's a chart or graph, describe the data it shows. "
            "If it's a table, summarize the key information. Be specific and factual."
        ])
        
        return response.text
    
    def ingest_pdf_as_images(self, file_path: str, dpi: int = 150) -> int:
        """Extract each PDF page as an image and embed it."""
        
        doc = fitz.open(file_path)
        parent_id = str(uuid.uuid4())
        points = []
        
        for page_num in range(len(doc)):
            page = doc[page_num]
            
            # Render page to image
            mat = fitz.Matrix(dpi/72, dpi/72)
            pix = page.get_pixmap(matrix=mat)
            
            # Save temp image
            temp_path = f"/tmp/page_{page_num}.png"
            pix.save(temp_path)
            
            # Generate embedding and caption
            embedding = self.embedder.embed_image(temp_path)
            caption = self.generate_image_caption(temp_path)
            
            point = PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding,
                payload={
                    "content_type": "image",
                    "source_document": Path(file_path).name,
                    "page_number": page_num + 1,
                    "raw_content": temp_path,
                    "summary": caption,
                    "parent_doc_id": parent_id,
                    "created_at": datetime.utcnow().isoformat()
                }
            )
            points.append(point)
        
        self.qdrant.upsert(collection_name=self.collection, points=points)
        print(f"Ingested {len(points)} page images from {file_path}")
        return len(points)

Audio Ingestion

import google.generativeai as genai

class AudioIngestor:
    def __init__(self, embedding_client, qdrant_client, collection_name):
        self.embedder = embedding_client
        self.qdrant = qdrant_client
        self.collection = collection_name
    
    def transcribe_audio(self, audio_path: str) -> str:
        """Use Gemini to transcribe audio."""
        model = genai.GenerativeModel(os.getenv("GEMINI_MODEL"))
        
        with open(audio_path, "rb") as f:
            audio_data = base64.b64encode(f.read()).decode("utf-8")
        
        suffix = Path(audio_path).suffix.lower()
        mime_map = {".mp3": "audio/mpeg", ".wav": "audio/wav", ".m4a": "audio/mp4"}
        mime_type = mime_map.get(suffix, "audio/mpeg")
        
        response = model.generate_content([
            {"inline_data": {"mime_type": mime_type, "data": audio_data}},
            "Transcribe this audio recording accurately. Include speaker labels if multiple speakers."
        ])
        
        return response.text
    
    def embed_audio_file(self, audio_path: str, metadata: dict = None) -> list[float]:
        """Embed an audio file directly using Gemini Embedding 2."""
        with open(audio_path, "rb") as f:
            audio_data = base64.b64encode(f.read()).decode("utf-8")
        
        suffix = Path(audio_path).suffix.lower()
        mime_map = {".mp3": "audio/mpeg", ".wav": "audio/wav", ".m4a": "audio/mp4"}
        mime_type = mime_map.get(suffix, "audio/mpeg")
        
        content = {
            "parts": [
                {"inline_data": {"mime_type": mime_type, "data": audio_data}}
            ]
        }
        
        result = genai.embed_content(
            model=os.getenv("EMBEDDING_MODEL"),
            content=content,
            task_type="RETRIEVAL_DOCUMENT",
            output_dimensionality=int(os.getenv("EMBEDDING_DIMENSIONS", 1536))
        )
        return result["embedding"]
    
    def ingest_audio(self, audio_path: str, metadata: dict = None) -> str:
        """Ingest an audio file: embed it and store with transcript."""
        
        transcript = self.transcribe_audio(audio_path)
        embedding = self.embed_audio_file(audio_path)
        
        point = PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={
                "content_type": "audio",
                "source_document": Path(audio_path).name,
                "raw_content": audio_path,
                "summary": transcript[:500],
                "full_transcript": transcript,
                "created_at": datetime.utcnow().isoformat(),
                **(metadata or {})
            }
        )
        
        self.qdrant.upsert(collection_name=self.collection, points=[point])
        print(f"Ingested audio file: {audio_path}")
        return transcript

Building the Retrieval and Re-ranking Layer

The retrieval layer is where most document intelligence systems fall apart. A single nearest-neighbor search works in demos but fails in production. Real users ask ambiguous questions, use different vocabulary than what’s in the documents, and often need context from multiple sources.

Multi-Stage Retrieval

The architecture here uses three stages:

  1. Dense retrieval — Gemini Embedding 2 cosine similarity search (top 20 candidates)
  2. Keyword re-ranking — Filter and boost based on exact term matches in metadata
  3. Cross-encoder re-ranking — Use a small LLM to score each candidate against the query
from qdrant_client.models import Filter, FieldCondition, MatchValue, SearchRequest
from typing import Optional

class MultimodalRetriever:
    def __init__(self, embedding_client, qdrant_client, collection_name, gemini_model):
        self.embedder = embedding_client
        self.qdrant = qdrant_client
        self.collection = collection_name
        self.gemini = gemini_model
    
    def retrieve(
        self, 
        query: str, 
        top_k: int = 5, 
        content_type_filter: Optional[str] = None,
        source_filter: Optional[str] = None
    ) -> list[dict]:
        """
        Retrieve relevant chunks for a query using multi-stage search.
        Returns ranked list of results with metadata.
        """
        
        # Stage 1: Dense retrieval
        query_embedding = self.embedder.embed_query(query)
        
        # Build optional filters
        must_conditions = []
        if content_type_filter:
            must_conditions.append(
                FieldCondition(key="content_type", match=MatchValue(value=content_type_filter))
            )
        if source_filter:
            must_conditions.append(
                FieldCondition(key="source_document", match=MatchValue(value=source_filter))
            )
        
        search_filter = Filter(must=must_conditions) if must_conditions else None
        
        results = self.qdrant.search(
            collection_name=self.collection,
            query_vector=query_embedding,
            limit=20,  # Retrieve more than needed for re-ranking
            query_filter=search_filter,
            with_payload=True
        )
        
        if not results:
            return []
        
        # Stage 2: Re-rank using Gemini
        reranked = self._rerank_results(query, results, top_k)
        
        return reranked
    
    def _rerank_results(self, query: str, candidates: list, top_k: int) -> list[dict]:
        """Use Gemini to re-rank candidates by relevance to the query."""
        
        model = genai.GenerativeModel(os.getenv("GEMINI_MODEL"))
        
        # Build the re-ranking prompt
        candidate_text = "\n\n".join([
            f"[{i}] (Type: {r.payload.get('content_type', 'unknown')}) "
            f"{r.payload.get('summary', r.payload.get('raw_content', ''))[:300]}"
            for i, r in enumerate(candidates)
        ])
        
        prompt = f"""You are a relevance judge. Given a query and a list of document excerpts, 
rank the TOP {top_k} most relevant excerpts by their index number.

Query: {query}

Candidates:
{candidate_text}

Return ONLY a JSON array of the top {top_k} index numbers, ordered by relevance (most relevant first).
Example: [3, 0, 7, 1, 5]
Return only the JSON array, no other text."""
        
        response = model.generate_content(prompt)
        
        try:
            # Parse the ranking
            ranking = json.loads(response.text.strip())
            ranked_results = []
            for idx in ranking:
                if 0 <= idx < len(candidates):
                    result = candidates[idx]
                    ranked_results.append({
                        "id": result.id,
                        "score": result.score,
                        "content_type": result.payload.get("content_type"),
                        "source": result.payload.get("source_document"),
                        "page": result.payload.get("page_number"),
                        "summary": result.payload.get("summary", ""),
                        "raw_content": result.payload.get("raw_content", ""),
                        "full_transcript": result.payload.get("full_transcript", ""),
                        "parent_doc_id": result.payload.get("parent_doc_id")
                    })
            return ranked_results[:top_k]
        except (json.JSONDecodeError, IndexError):
            # Fallback: return top_k by original score
            return [
                {
                    "id": r.id,
                    "score": r.score,
                    "content_type": r.payload.get("content_type"),
                    "source": r.payload.get("source_document"),
                    "summary": r.payload.get("summary", ""),
                    "raw_content": r.payload.get("raw_content", ""),
                }
                for r in candidates[:top_k]
            ]

Handling Cross-Modal Context Retrieval

One of the most powerful patterns in multimodal retrieval is fetching the full context around a retrieved chunk. If you retrieve an image from page 7 of a document, you probably also want the text from that page.

def get_surrounding_context(self, parent_doc_id: str, page_number: int) -> list[dict]:
    """Fetch all chunks from the same page of a document."""
    
    results = self.qdrant.scroll(
        collection_name=self.collection,
        scroll_filter=Filter(
            must=[
                FieldCondition(key="parent_doc_id", match=MatchValue(value=parent_doc_id)),
                FieldCondition(key="page_number", match=MatchValue(value=page_number))
            ]
        ),
        limit=50
    )
    
    return [
        {
            "content_type": r.payload.get("content_type"),
            "summary": r.payload.get("summary", ""),
            "raw_content": r.payload.get("raw_content", "")
        }
        for r in results[0]
    ]

Assembling the Document Intelligence Agent

Now that you have ingestion and retrieval working, you can build the agent layer on top. The agent handles user queries, calls the retrieval pipeline, assembles context, and generates structured responses.

Agent Architecture

The agent follows a simple loop:

  1. Receive user query
  2. Classify query intent (search all types, or specific content type?)
  3. Retrieve top candidates
  4. Fetch surrounding context if needed
  5. Assemble context window
  6. Generate response with source citations
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class AgentResponse:
    answer: str
    sources: list[dict]
    content_types_used: list[str]
    confidence: str

class DocumentIntelligenceAgent:
    def __init__(
        self, 
        retriever: MultimodalRetriever,
        embedding_client: GeminiEmbeddingClient
    ):
        self.retriever = retriever
        self.embedder = embedding_client
        self.model = genai.GenerativeModel(os.getenv("GEMINI_MODEL"))
        self.conversation_history = []
    
    def classify_query(self, query: str) -> dict:
        """Determine what type of content the query is looking for."""
        
        prompt = f"""Classify this search query for a document intelligence system.

Query: "{query}"

Determine:
1. content_type_hint: "text", "image", "audio", "video", or "all" (if unclear or spans types)
2. query_type: "factual", "analytical", "comparison", "summary", or "visual"
3. needs_full_document: true if the user needs a full document vs. a specific excerpt

Respond with valid JSON only.
Example: {{"content_type_hint": "image", "query_type": "visual", "needs_full_document": false}}"""
        
        response = self.model.generate_content(prompt)
        
        try:
            return json.loads(response.text.strip())
        except json.JSONDecodeError:
            return {"content_type_hint": "all", "query_type": "factual", "needs_full_document": False}
    
    def assemble_context(self, retrieved_chunks: list[dict]) -> str:
        """Build the context string for the LLM from retrieved chunks."""
        
        context_parts = []
        
        for i, chunk in enumerate(retrieved_chunks):
            content_type = chunk.get("content_type", "unknown")
            source = chunk.get("source", "unknown source")
            page = chunk.get("page", "")
            page_str = f", page {page}" if page else ""
            
            if content_type == "text":
                content = chunk.get("raw_content", chunk.get("summary", ""))
                context_parts.append(
                    f"[Source {i+1}: {source}{page_str} | Type: text]\n{content}"
                )
            
            elif content_type == "image":
                caption = chunk.get("summary", "No description available")
                context_parts.append(
                    f"[Source {i+1}: {source}{page_str} | Type: image/visual]\n"
                    f"Visual content description: {caption}"
                )
            
            elif content_type == "audio":
                transcript = chunk.get("full_transcript", chunk.get("summary", ""))
                context_parts.append(
                    f"[Source {i+1}: {source} | Type: audio recording]\n"
                    f"Transcript: {transcript[:1000]}"
                )
            
            elif content_type == "video":
                description = chunk.get("summary", "No description available")
                context_parts.append(
                    f"[Source {i+1}: {source} | Type: video]\n"
                    f"Video content: {description}"
                )
        
        return "\n\n---\n\n".join(context_parts)
    
    def answer(self, query: str, top_k: int = 5) -> AgentResponse:
        """Process a user query and return a structured response."""
        
        # Step 1: Classify the query
        classification = self.classify_query(query)
        content_type_hint = classification.get("content_type_hint")
        
        # Step 2: Retrieve relevant chunks
        content_filter = content_type_hint if content_type_hint != "all" else None
        
        retrieved = self.retriever.retrieve(
            query=query,
            top_k=top_k,
            content_type_filter=content_filter
        )
        
        if not retrieved:
            return AgentResponse(
                answer="I couldn't find relevant information in the indexed documents.",
                sources=[],
                content_types_used=[],
                confidence="low"
            )
        
        # Step 3: Assemble context
        context = self.assemble_context(retrieved)
        content_types_used = list(set(c.get("content_type") for c in retrieved))
        
        # Step 4: Generate answer
        system_prompt = """You are a document intelligence assistant. 
You have access to content from multiple document types: text documents, images, audio recordings, and videos.
When answering, cite your sources using [Source N] notation.
If visual content is relevant, describe what it shows.
Be concise and factual. If you're uncertain, say so."""
        
        full_prompt = f"""{system_prompt}

Context from indexed documents:
{context}

User question: {query}

Provide a clear, accurate answer based on the context above. Cite specific sources."""
        
        response = self.model.generate_content(full_prompt)
        
        # Step 5: Package response
        sources = [
            {
                "source": c.get("source"),
                "page": c.get("page"),
                "type": c.get("content_type"),
                "excerpt": c.get("summary", "")[:150]
            }
            for c in retrieved
        ]
        
        return AgentResponse(
            answer=response.text,
            sources=sources,
            content_types_used=content_types_used,
            confidence="high" if len(retrieved) >= 3 else "medium"
        )
    
    def chat(self, query: str) -> str:
        """Simple conversational interface."""
        
        self.conversation_history.append({"role": "user", "content": query})
        result = self.answer(query)
        self.conversation_history.append({"role": "assistant", "content": result.answer})
        
        # Format output
        output = f"{result.answer}\n\n"
        if result.sources:
            output += "**Sources:**\n"
            for s in result.sources:
                type_label = s.get("type", "document")
                page = f", p.{s['page']}" if s.get("page") else ""
                output += f"- {s['source']}{page} [{type_label}]\n"
        
        return output

Running the Full Pipeline

# Wire everything together

from qdrant_client import QdrantClient

# Initialize clients
embedding_client = GeminiEmbeddingClient()
qdrant_client = QdrantClient(url=os.getenv("QDRANT_URL"))
collection_name = os.getenv("QDRANT_COLLECTION")

# Ingestion
text_ingestor = TextDocumentIngestor(embedding_client, qdrant_client, collection_name)
image_ingestor = ImageIngestor(embedding_client, qdrant_client, collection_name, None)
audio_ingestor = AudioIngestor(embedding_client, qdrant_client, collection_name)

# Ingest documents
text_ingestor.ingest_pdf("documents/q3_financial_report.pdf")
image_ingestor.ingest_pdf_as_images("documents/product_presentation.pdf")
audio_ingestor.ingest_audio("recordings/team_meeting_june.mp3")

# Build retriever and agent
retriever = MultimodalRetriever(
    embedding_client, qdrant_client, collection_name, 
    os.getenv("GEMINI_MODEL")
)

agent = DocumentIntelligenceAgent(retriever, embedding_client)

# Query the agent
print(agent.chat("What were the main revenue drivers in Q3?"))
print(agent.chat("What did the team discuss about the product roadmap?"))
print(agent.chat("Show me the chart comparing sales by region"))

Production Considerations and Performance Tuning

Managing Embedding Costs

Gemini Embedding 2 is not free at scale. At production volumes, embedding costs can add up quickly, especially if you’re re-embedding content unnecessarily.

Key strategies to manage costs:

  • Cache embeddings — Store embeddings in your vector database and only re-embed when content changes. A simple hash of the content determines if re-embedding is needed.
  • Use MRL truncation — If 1536 dimensions gives equivalent recall on your dataset to 3072, use 1536. This halves your storage and speeds up search.
  • Batch embedding requests — Process chunks in batches of 10–25 rather than one at a time. This improves throughput significantly.
  • Skip redundant content — Don’t embed the same document twice. Maintain an ingestion log with content hashes.
import hashlib

def content_hash(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()

# Before ingesting, check if this content hash already exists
def is_already_ingested(qdrant_client, collection, content_hash_val: str) -> bool:
    results = qdrant_client.scroll(
        collection_name=collection,
        scroll_filter=Filter(
            must=[FieldCondition(key="content_hash", match=MatchValue(value=content_hash_val))]
        ),
        limit=1
    )
    return len(results[0]) > 0

Improving Retrieval Quality

A few patterns that consistently improve retrieval accuracy in production:

HyDE (Hypothetical Document Embeddings) — Instead of embedding the raw query, ask Gemini to generate a hypothetical document that would answer the query, then embed that. This often improves recall because it shifts the query embedding closer to document-style language.

def embed_with_hyde(query: str, embedding_client: GeminiEmbeddingClient) -> list[float]:
    model = genai.GenerativeModel(os.getenv("GEMINI_MODEL"))
    
    hypothetical_doc = model.generate_content(
        f"Write a short document excerpt that would directly answer this question: {query}"
    )
    
    return embedding_client.embed_text(hypothetical_doc.text, task_type="RETRIEVAL_DOCUMENT")

Query expansion — Generate 2–3 variations of the user query and retrieve against all of them, then deduplicate:

def expand_query(query: str) -> list[str]:
    model = genai.GenerativeModel(os.getenv("GEMINI_MODEL"))
    
    response = model.generate_content(
        f"""Generate 2 alternative phrasings of this search query.
        
Original: {query}
        
Return a JSON array with 2 alternative queries.
Example: ["alternative 1", "alternative 2"]"""
    )
    
    try:
        alternatives = json.loads(response.text.strip())
        return [query] + alternatives[:2]
    except:
        return [query]

Scaling Beyond a Single Collection

When document volumes grow past a few thousand documents, consider these architecture changes:

  • Namespace by document type — Create separate collections for text, image, audio, and video. This makes targeted queries faster and allows different index configurations per type.
  • Shard by department or project — If this is a multi-team system, namespace collections by team to avoid cross-contamination and simplify access control.
  • Add a sparse vector index — Qdrant supports sparse vectors alongside dense vectors. Adding a BM25-style sparse index enables true hybrid search, which outperforms pure dense search on factual, keyword-heavy queries.

Where MindStudio Fits Into This Architecture

Building this agent in pure Python is instructive — it shows exactly what’s happening at each layer. But running it in production means dealing with infrastructure that has nothing to do with the actual intelligence: hosting, authentication, rate limiting, scheduling, user interfaces, and integrations with the rest of your toolchain.

This is where MindStudio is useful. MindStudio is a no-code platform for building and deploying AI agents, and it supports Gemini models — including access to Gemini’s embedding and generation capabilities — without requiring you to manage your own API keys, hosting, or scaling infrastructure.

If you want to deploy the document intelligence agent described here as a production tool for a team, MindStudio lets you:

  • Wrap the agent in a web app with a proper UI — no frontend development needed
  • Connect it to Google Drive, Notion, or SharePoint natively to pull in documents automatically
  • Schedule background ingestion jobs that keep the index current as new documents are added
  • Build Slack or email interfaces so teammates can query the agent without opening a separate app
  • Share it across a team with access controls, without exposing API keys or infrastructure

The core retrieval and generation logic you’ve built here maps cleanly onto MindStudio’s workflow model. You can use their Gemini integrations for the generation layer while connecting your Qdrant instance via webhook, or migrate the whole ingestion pipeline into automated background agents that run on a schedule.

MindStudio’s average build takes under an hour — which is significantly less than setting up auth, rate limiting, a frontend, and deployment infrastructure from scratch. You can try it free at mindstudio.ai.

If you’re already comfortable managing the backend Python code, MindStudio also offers an Agent Skills Plugin — an npm SDK that lets any AI agent call MindStudio’s 120+ typed capabilities (including email, image generation, Google search, and workflow execution) as simple method calls, so you can extend your agent’s capabilities without rebuilding everything.


Common Issues and How to Fix Them

Embeddings Are Being Generated but Retrieval Is Poor

Problem: The vector search returns results with low cosine similarity scores (< 0.5), and results feel irrelevant.

Likely causes and fixes:

  • Make sure you’re using RETRIEVAL_QUERY for queries and RETRIEVAL_DOCUMENT for documents. Using the wrong task type significantly degrades retrieval quality.
  • Check your chunking. Chunks that are too short (< 50 tokens) or too long (> 600 tokens) both hurt embedding quality.
  • Verify that your collection was created with the same dimensions as your embeddings. A mismatch causes Qdrant to reject or silently mishandle vectors.
  • For image retrieval specifically, poor results often mean the images are too low-resolution or heavily compressed. Try re-ingesting at higher DPI.

API Errors on Multimodal Inputs

Problem: 400 Invalid argument errors when embedding images or audio.

Fixes:

  • Image files must be under 20MB. Large PDFs rendered at high DPI can exceed this. Reduce DPI from 300 to 150.
  • Audio files have a 9.5MB inline data limit. For longer recordings, upload to Google Cloud Storage and pass the URI instead of base64-encoded data.
  • Check MIME types. audio/mpeg is correct for MP3, but audio/mp4 is correct for M4A — passing the wrong MIME type causes silent failures.

Context Window Overflow When Generating Responses

Problem: Your assembled context is too long for the Gemini generation model’s context window.

Fixes:

  • Reduce top_k from 5 to 3 during response generation
  • Store shorter summaries at ingest time and use those for the context window instead of raw content
  • Implement a context window budget — allocate a max token count per source and truncate accordingly

Re-ranking Is Slow in Production

Problem: The Gemini re-ranking step adds 2–4 seconds of latency per query.

Fixes:

  • Use a smaller, faster model for re-ranking (gemini-1.5-flash instead of gemini-2.0-flash)
  • Move re-ranking to an async background task and return initial dense search results immediately, then update when re-ranking completes
  • For non-critical use cases, skip re-ranking entirely and rely on cosine similarity scores with a minimum threshold (e.g., only return results with score > 0.6)

Frequently Asked Questions

What content types does Gemini Embedding 2 support?

Gemini Embedding 2 (specifically gemini-embedding-exp-03-07) supports text, images (PNG, JPEG, GIF, WebP), audio (MP3, WAV, M4A, OGG), and video. All inputs are encoded into the same 3072-dimensional vector space, enabling cross-modal similarity search. Text has the broadest support and the most mature performance characteristics; image and audio multimodal embedding is newer and still marked as experimental in some contexts.

How does Gemini Embedding 2 compare to OpenAI’s text-embedding-3-large?

The key difference is modality. OpenAI’s text-embedding-3-large is text-only, which means any document intelligence system using it requires separate pipelines for vision and audio content. Gemini Embedding 2 handles all modalities in a single API call with a unified vector space. For pure text retrieval tasks, benchmark performance is comparable between the two models. Gemini Embedding 2 scores notably higher on multilingual benchmarks like MTEB, making it a better choice for non-English or mixed-language document sets.

Can I use Gemini Embedding 2 without Google Cloud?

Yes. Gemini Embedding 2 is available through Google AI Studio using a standard API key — no Google Cloud account, no Vertex AI setup required. The free tier allows 1,500 requests per day. For production workloads, you’ll either need a paid AI Studio plan or migrate to Vertex AI for higher rate limits and enterprise SLAs.

What vector database works best with Gemini embeddings?

There’s no single best choice — it depends on your scale and constraints. For local development and small projects, ChromaDB is the fastest to set up. For production systems with moderate scale (up to a few million vectors), Qdrant offers excellent performance and strong filtering. For enterprise-scale deployments, Pinecone or Weaviate are good managed options with strong support and SLAs. The most important thing is that your chosen database supports 1536 or 3072-dimensional vectors and metadata filtering.

What’s the maximum document size I can process?

For text, the model handles inputs up to 8,192 tokens per call — roughly 6,000 words. Longer documents need chunking. For images, the file size limit is 20MB. For audio, inline base64 encoding supports files up to ~9.5MB; larger audio files should be stored in Google Cloud Storage and passed as URIs. Video support has additional constraints — short clips under 60 seconds work best when processed as sequences of keyframes.

How accurate is cross-modal retrieval in practice?

Cross-modal retrieval works well for direct semantic matches — asking “what does the revenue chart show?” should reliably retrieve a chart image if it exists and is correctly captioned. It’s weaker for highly specific factual queries (“what was the exact revenue figure in Q3 2024?”) where the answer is embedded in a chart image rather than text. A practical fix is to always generate text captions for visual content at ingest time and store them as metadata, giving you both dense embedding search and keyword fallback.


Key Takeaways

Building a multimodal document intelligence agent with Gemini Embedding 2 is genuinely practical today, not just a research exercise. Here’s a summary of what you’ve covered:

  • Gemini Embedding 2 uses a unified vector space for text, images, audio, and video — this is the architectural foundation that makes cross-modal retrieval possible without separate pipelines.
  • The ingestion pipeline is the most important part. Good chunking, rich metadata, and auto-generated captions at ingest time directly determine retrieval quality.
  • Multi-stage retrieval (dense + re-ranking) consistently outperforms single-stage search in production, especially for ambiguous or analytical queries.
  • Task types matter — always use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search. Getting this wrong degrades quality significantly.
  • HyDE and query expansion are high-impact, low-effort improvements for retrieval quality that don’t require changes to your index.
  • Production deployment introduces non-trivial infrastructure overhead — authentication, scheduling, UIs, integrations. Tools like MindStudio can handle this layer so you can focus on the intelligence layer.

If you want to take this further without building infrastructure from scratch, MindStudio lets you deploy agents like this one as production-ready tools with built-in Gemini model access, integrations, and workflow automation. Start building free at mindstudio.ai.