Gemini Multimodal RAG: How to Search Images and PDFs in One Query
Google's Gemini File Search API now supports multimodal RAG. Learn how to embed images and text together and query both with page-level citations.
What Multimodal RAG Actually Means
Most retrieval-augmented generation (RAG) systems treat documents as text. They chunk the content, embed the chunks as vectors, and retrieve the most relevant pieces when a query comes in. That works well for plain text — but it falls apart the moment your documents contain charts, diagrams, scanned pages, or image-heavy PDFs.
Gemini multimodal RAG changes that. By combining Google’s native document understanding with semantic search, you can now query across images and text in a single pipeline — and get back answers with page-level citations that point to exactly where the information came from.
This post covers how multimodal RAG works with the Gemini API, what the File Search API enables, how to embed images and text together, and how to build this into a usable workflow.
Why Standard RAG Fails on Rich Documents
Before getting into Gemini specifics, it helps to understand why traditional RAG breaks on real-world documents.
The text-only extraction problem
When you extract text from a PDF using a standard parser, you lose a lot. Tables get flattened into a stream of numbers. Captions get separated from their images. Infographics become blank spaces. Diagrams disappear entirely.
For many business documents — financial reports, technical manuals, research papers, product catalogs — the meaning is partially or entirely in the visual elements. A RAG system that ignores those elements gives incomplete or wrong answers.
Chunking doesn’t respect visual context
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
Standard chunking splits documents into fixed-size text segments. A sentence referring to a chart on the same page might get split into a different chunk than the surrounding context. The retrieval system then surfaces that chunk without the visual that makes it meaningful.
OCR is a workaround, not a solution
Some pipelines run OCR over scanned documents to convert images to text. OCR quality has improved significantly, but it still drops formatting, misreads handwriting, and can’t interpret charts or non-text visuals at all.
Multimodal RAG addresses all three of these problems by treating documents as pages — preserving layout, text, and images together.
How Gemini Handles Multimodal Documents Natively
Gemini 1.5 Pro and Gemini 2.0 models are natively multimodal, which means they can process text, images, PDFs, audio, and video as input. When you send a PDF to Gemini, it doesn’t just extract the text — it processes each page visually, seeing the layout, images, tables, and text as a unified whole.
This is the foundation of Gemini multimodal RAG. Instead of stripping documents down to text before retrieval, you keep the full page representation intact.
The Gemini File API
The Gemini File API allows you to upload documents and media files that can then be referenced in API requests. Files are stored temporarily (up to 48 hours) and can be referenced by URI in subsequent requests.
Supported file types include:
- PDFs (processed page-by-page as images)
- Images (JPEG, PNG, WebP, GIF, HEIC)
- Audio and video files
- Plain text
Once a file is uploaded, you reference it in a prompt alongside your query. Gemini can then analyze the file content in context with your question.
What page-level processing looks like
When Gemini processes a PDF, each page is rendered as an image. The model can see:
- Text and its position on the page
- Images and figures with their associated captions
- Tables with their structure preserved
- Headers, footers, and section markers
- Visual layouts that carry meaning (e.g., sidebars, callout boxes)
This means a query like “What does the revenue chart on page 4 show?” actually works — because Gemini can see the chart.
Building a Multimodal RAG Pipeline with Gemini
Here’s how a working multimodal RAG system with Gemini is structured, from document ingestion to query response.
Step 1: Upload documents via the File API
Start by uploading your PDFs and images using the File API. Each upload returns a file URI and metadata including the MIME type and file name.
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Upload a PDF
uploaded_file = genai.upload_file(
path="financial_report.pdf",
mime_type="application/pdf",
display_name="Q3 Financial Report"
)
print(uploaded_file.uri) # e.g., files/abc123
For a large document corpus, you’d loop through files and store the returned URIs alongside metadata (document name, page count, source) in a database.
Step 2: Generate embeddings for retrieval
The File API alone lets you query a single file at a time. For multi-document retrieval, you need a separate embedding layer.
Google’s text-embedding-004 model generates semantic embeddings for text. For multimodal documents, you have a few options:
Option A: Extract text per page, embed text Use the Gemini API to extract text from each page (including OCR’d text from images), then embed each page’s text content. This is text-only but leverages Gemini’s superior text extraction.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
Option B: Use Vertex AI Multimodal Embeddings Google’s Vertex AI offers a multimodal embedding model that can embed both image and text inputs into the same vector space. This means an image of a chart and a text query about that chart will have similar embeddings — enabling true cross-modal retrieval.
Option C: Embed page screenshots directly Render each PDF page as an image, then use the multimodal embedding API to embed those images. Your retrieval layer then works at the page-image level.
Option B and C are more powerful for image-heavy documents. Option A is simpler and works well when documents are mostly text with occasional visuals.
Step 3: Store embeddings with page references
Store each embedding alongside metadata that lets you reconstruct context later:
# Example metadata structure
page_record = {
"document_id": "financial_report_q3",
"file_uri": "files/abc123",
"page_number": 4,
"embedding": [...], # 768-dim vector
"text_extract": "Revenue increased 23% YoY...",
"has_images": True
}
Use a vector database like Pinecone, Weaviate, Qdrant, or pgvector to store and query these records.
Step 4: Retrieve relevant pages
When a user submits a query, embed the query using the same embedding model, then perform an approximate nearest-neighbor search to retrieve the top-k most relevant page records.
# Embed the query
query_embedding = embed_text("What drove the revenue increase in Q3?")
# Retrieve top 5 pages
results = vector_db.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
The results include page numbers and document references — this is what enables page-level citations.
Step 5: Send retrieved pages to Gemini for answer generation
With the relevant pages identified, you now construct a Gemini API request that includes the actual file content alongside the query.
model = genai.GenerativeModel("gemini-1.5-pro")
# Reference the uploaded file and specify relevant pages
response = model.generate_content([
uploaded_file, # The full PDF (Gemini can navigate to specific pages)
f"Based on pages {', '.join(page_numbers)}, answer the following question. "
f"Include the page number where you found each piece of information.\n\n"
f"Question: What drove the revenue increase in Q3?"
])
print(response.text)
Because Gemini processes the PDF natively, it can read charts, tables, and visuals on those specific pages — not just extracted text.
Step 6: Return answers with citations
The output includes an answer that references specific pages, giving users a clear audit trail back to the source material.
Example response:
Revenue increased 23% year-over-year in Q3, driven primarily by growth in the enterprise segment (page 4) and international expansion into APAC markets (page 7). The breakdown by product line is shown in the revenue waterfall chart on page 5.
Embedding Images and Text in the Same Vector Space
The most technically interesting aspect of multimodal RAG is co-embedding — placing text and images into the same vector space so you can retrieve images with text queries and vice versa.
Why co-embedding matters
Without co-embedding, you’d need separate retrieval pipelines: one for text chunks, one for images. That means running two queries and merging results — which introduces complexity and misses connections between related image and text content.
With co-embedding, a single query retrieves both text passages and relevant images ranked by semantic similarity.
Vertex AI Multimodal Embeddings
Google’s multimodal embedding model (available through Vertex AI) accepts image and text inputs and outputs embeddings in a shared 1408-dimensional space. This allows:
- Text queries that find images: “Show me the product architecture diagram” returns the relevant diagram even if no surrounding text describes it
- Image queries that find text: Submit an image of a chart and retrieve text passages that reference or explain it
- Cross-modal coherence: A caption and its image will have similar embeddings, so retrieval respects that relationship
- ✕a coding agent
- ✕no-code
- ✕vibe coding
- ✕a faster Cursor
The one that tells the coding agents what to build.
Practical implementation note
Vertex AI Multimodal Embeddings requires a Google Cloud project with Vertex AI enabled. The API differs slightly from the standard Gemini API — you’ll use the google-cloud-aiplatform SDK rather than google-generativeai. Both can be combined in the same pipeline: Vertex AI for embedding, Gemini API for generation.
Page-Level Citations in Practice
Citations are one of the most important features in a production RAG system. Without them, users can’t verify answers, and hallucinations are harder to catch.
How Gemini generates citations
When you include page numbers in your prompt and instruct Gemini to cite sources, it references specific pages in its response. You can structure this more formally by asking for JSON output with citation metadata:
response = model.generate_content([
uploaded_file,
"""Answer the question below. Return your response in JSON with the following structure:
{
"answer": "...",
"citations": [
{"page": 4, "quote": "...", "relevance": "..."}
]
}
Question: What are the key risks identified in this report?"""
])
The structured output makes it easy to display citations in a UI and link them back to specific pages in the original document.
Handling multi-document citations
For pipelines that span multiple documents, include document identifiers alongside page numbers. At retrieval time, store which document each page came from, then pass that metadata into your prompt:
Sources provided:
- Document: Q3_Financial_Report.pdf, Pages: 3, 4, 7
- Document: Annual_Investor_Deck.pdf, Pages: 12
Please answer the question and cite the document name and page number for each claim.
This approach scales to large document libraries while keeping citations traceable.
Where MindStudio Fits Into This
Building a multimodal RAG pipeline from scratch involves a lot of plumbing: file upload handling, embedding generation, vector database management, Gemini API calls, structured output parsing, and UI for displaying results. Most teams want the capability without building all that infrastructure.
MindStudio’s visual workflow builder lets you assemble a multimodal RAG pipeline as a deployable AI agent — without writing the infrastructure layer yourself. You can connect to the Gemini API (one of 200+ models available on the platform), configure file inputs, add retrieval steps, and build a custom interface for querying documents — typically in under an hour.
For teams that want to expose this as an internal tool, MindStudio agents can be deployed as web apps with custom UIs, webhook endpoints, or even email-triggered agents that process attached documents automatically. You can connect the output to Google Workspace, Slack, or Notion directly, so retrieved information flows into the tools your team already uses.
If you’re iterating on a multimodal RAG setup and want to test different models or retrieval configurations without rewriting code each time, MindStudio’s no-code environment makes that practical. You can try it free at mindstudio.ai.
Common Mistakes and How to Avoid Them
Chunking PDFs before sending to Gemini
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
One of the most common mistakes is treating Gemini like a text-only model and pre-processing PDFs with a standard text extractor before sending to the API. This throws away the visual context that makes Gemini’s native PDF processing valuable.
Send the full PDF file via the File API and let Gemini process the pages natively.
Using only text embeddings for image-heavy documents
If your documents are heavy on charts, diagrams, or scanned images, text embeddings of extracted content will miss most of the signal. Consider multimodal embeddings or at minimum use Gemini to generate rich text descriptions of each page (including visual content) before embedding.
Not specifying page numbers in prompts
Gemini’s native PDF support can reference specific pages, but it needs guidance. Always instruct the model to cite page numbers in its answers, and consider specifying which pages to focus on based on your retrieval results.
Ignoring file expiration
Files uploaded via the Gemini File API expire after 48 hours. For production systems, you need a re-upload mechanism or use Vertex AI’s more persistent storage options. Store file URIs with timestamps and refresh them before expiration.
Treating all documents the same
Some documents are mostly text with occasional images (many legal or policy documents). Others are image-first (product catalogs, engineering drawings). Tailor your embedding and retrieval strategy to the document type — a single approach won’t be optimal for both.
Frequently Asked Questions
What is multimodal RAG?
Multimodal RAG is a retrieval-augmented generation approach where the knowledge base includes both text and non-text content — images, charts, diagrams, tables — and the retrieval layer can surface relevant content from any modality. The generation model then uses that multimodal context to produce answers. Standard RAG only handles text; multimodal RAG works across content types.
Can Gemini search through PDFs natively?
Yes. Gemini 1.5 Pro and Gemini 2.0 models process PDFs natively through the File API. When you upload a PDF, Gemini processes each page visually — seeing the layout, images, and text as rendered — rather than extracting only the text. This means it can read charts, tables, and diagrams that standard text extraction misses.
What’s the difference between the Gemini File API and standard RAG?
The Gemini File API handles document upload and enables single-file querying. Standard RAG adds a retrieval layer on top: you embed document content into a vector database, retrieve the most relevant segments based on a user query, and then pass those segments to the model for answer generation. Multimodal RAG with Gemini combines both — the File API handles document understanding, while a vector retrieval layer enables multi-document search.
How do page-level citations work in Gemini RAG?
Page-level citations work by tracking which pages are retrieved during the search step, then prompting Gemini to reference specific page numbers in its answers. You can structure the prompt to request JSON output that includes citation objects with page numbers and supporting quotes. This gives users a direct path back to the source material.
Is Vertex AI required for multimodal embeddings?
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
For true cross-modal embeddings (where images and text share the same vector space), yes — Vertex AI’s multimodal embedding model is Google’s primary offering for this. The standard Gemini API’s text-embedding-004 model handles text only. That said, you can approximate multimodal retrieval by using Gemini to generate rich text descriptions of each page (including visual content) and then embedding those descriptions with a text embedding model.
What file types does the Gemini File API support?
The API supports PDFs, JPEG, PNG, WebP, HEIC, and GIF images, as well as MP3, WAV, and other audio formats, video files, and plain text. MIME types must be specified at upload time. PDF is the most commonly used format for document RAG applications.
Key Takeaways
- Gemini processes PDFs natively as visual page representations, not just extracted text — which means charts, tables, and images are included in what the model can see and reason over.
- A multimodal RAG pipeline with Gemini has two layers: the File API for document understanding, and a vector retrieval layer (using Vertex AI multimodal embeddings or text embeddings of extracted content) for multi-document search.
- Co-embedding images and text in the same vector space enables single-query retrieval across both modalities — no need for separate pipelines.
- Page-level citations are achievable by tracking page metadata through the retrieval step and prompting Gemini to reference specific pages in structured output.
- The biggest implementation mistakes are pre-processing PDFs before sending to Gemini, using text-only embeddings for visual documents, and not handling file expiration in production systems.
- Platforms like MindStudio let you build and deploy multimodal RAG agents without writing infrastructure from scratch — connecting Gemini’s document understanding to custom UIs and business tool integrations.