How to Build a Multimodal RAG Chatbot for Product Manuals with Gemini Embedding 2

Q: Does Gemini Embedding 2 support image inputs directly?

Gemini Embedding 2, accessed through the Google AI API (google.generativeai), primarily accepts text. For true native image embedding in Google's ecosystem, you'd use the multimodalembedding@001 model on Vertex AI. However, the pipeline described in this article uses a practical alternative: pass images to a Gemini vision model (like gemini-2.0-flash) to generate detailed text descriptions, then embed those descriptions using Gemini Embedding 2. This produces strong retrieval results for technical documentation without requiring Vertex AI.

Why Product Manual Chatbots Usually Fall Short

Product manuals are not like blog posts. They contain wiring diagrams, parts tables, installation flowcharts, safety warning icons, and labeled technical drawings — all embedded inside PDFs. Standard RAG setups extract the text, chunk it, embed it, and call it done. The result is a chatbot that can answer questions about numbered steps but has no idea what’s in Figure 3-B or why a specific torque value appears next to an arrow in a diagram.

This article walks through how to build a multimodal RAG chatbot that handles all of it — text, tables, images, and technical diagrams — using Gemini Embedding 2 and Pinecone as the vector store. The approach is cleaner than most multimodal pipelines you’ll find online, and it works well in production without requiring a GPU cluster or a research background.

By the end, you’ll have a working pipeline that ingests product manuals as PDFs, indexes both textual and visual content, and answers natural-language questions with grounded, accurate responses.

Multimodal RAG: What It Actually Means

The word “multimodal” gets used loosely. Before building anything, it’s worth being precise about what multimodal RAG means in this context.

The Standard RAG Pipeline

A typical RAG system:

Splits a document into text chunks
Embeds each chunk as a vector
Stores vectors in a database like Pinecone
At query time, embeds the question, retrieves the most similar chunks, and passes them to an LLM for generation

This works well for text-heavy documents. It breaks down for technical manuals because a significant portion of the useful information lives in non-text form.

What Makes It Multimodal

Multimodal RAG extends the pipeline to handle non-text content — specifically images and diagrams. There are two main approaches:

Approach 1 — Convert and embed: Extract images from the PDF, use a vision model to generate text descriptions of those images, then embed those descriptions alongside regular text. Everything ends up in the same vector space.

Approach 2 — Native multimodal embeddings: Use an embedding model that can directly encode both text and images into the same vector space, so you can retrieve images and text chunks together using a single query.

This article uses a hybrid: Gemini Embedding 2 handles text (and image descriptions converted to text) while Gemini’s vision capabilities handle the image-to-description step. This keeps the pipeline simple without sacrificing retrieval quality for visual content.

Why This Matters for Product Manuals Specifically

Consider a user asking: “What does the error light pattern mean when the motor overheats?” The answer might live entirely in a diagram on page 12, with a small text label that says “Error Code E3” next to a specific LED icon. Text extraction alone would pull in that label but lose the visual context entirely. With multimodal RAG, the diagram gets described in detail — “a diagram showing LED indicator patterns, with three red blinks indicating motor thermal overload, error code E3” — and that description becomes searchable.

Gemini Embedding 2: What Sets It Apart

Google has steadily improved its embedding models. Gemini Embedding 2 (available as models/gemini-embedding-2-preview-05-14 at the time of writing — verify the current model ID at ai.google.dev) represents a meaningful step forward from older models like text-embedding-004.

Key Characteristics

Output dimensions: Gemini Embedding 2 produces 3072-dimensional vectors by default. You can lower this with the output_dimensionality parameter for storage efficiency, though higher dimensions generally improve retrieval precision for complex technical content.

Task type specification: The model accepts a task_type parameter that tells it how to optimize the embedding. For RAG:

RETRIEVAL_DOCUMENT — use this when embedding chunks from your manual
RETRIEVAL_QUERY — use this when embedding the user’s question
SEMANTIC_SIMILARITY — useful for clustering or deduplication

This distinction matters. The same text embedded as a document versus a query produces slightly different vectors, optimized for asymmetric retrieval (short question finding long passages). Most older embedding models don’t support this natively.

Performance: Gemini Embedding 2 achieves top-tier scores on the MTEB (Massive Text Embedding Benchmark), particularly for retrieval tasks. For technical documentation with specialized vocabulary — part numbers, torque specs, error codes — this translates to better results than general-purpose embeddings.

Multilingual support: If you’re building for a global product line with manuals in multiple languages, the model handles multilingual retrieval without separate per-language models.

What It Doesn’t Do Natively (In This Pipeline)

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Gemini Embedding 2 through the standard Google AI API primarily accepts text input. For image content, you’ll use Gemini’s vision models (like gemini-2.0-flash) to generate rich text descriptions first, then embed those. This keeps your pipeline portable and avoids Vertex AI dependencies for smaller projects.

Set Up Your Development Environment

Prerequisites

Before writing any code, you need:

Python 3.10 or later
A Google AI API key (get one at aistudio.google.com)
A Pinecone account and API key (pinecone.io)
A product manual PDF to test with (use any multi-page technical manual that includes images)

Install Dependencies

pip install google-generativeai pinecone-client PyMuPDF Pillow python-dotenv tqdm

A note on these libraries:

PyMuPDF (imported as fitz) handles PDF parsing — it’s the most reliable option for extracting both text and images
Pillow handles image processing before sending to Gemini
tqdm gives you progress bars during batch indexing — useful when processing large manuals

Configure Your Environment

Create a .env file:

GOOGLE_API_KEY=your_google_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=product-manual-index

Then load these in your script:

import os
from dotenv import load_dotenv
import google.generativeai as genai
from pinecone import Pinecone

load_dotenv()

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

Extract Text and Visual Content from PDFs

This is where most multimodal pipelines get complicated. PyMuPDF makes it manageable.

Extract Text Chunks

You want each chunk to be semantically coherent — not cut mid-sentence or in the middle of a numbered step. A simple approach is to chunk by page first, then split by paragraph breaks. For most product manuals, page-level chunking with a maximum character limit works well.

import fitz  # PyMuPDF

def extract_text_chunks(pdf_path: str, max_chunk_size: int = 1000) -> list[dict]:
    """
    Extract text from each page of a PDF.
    Returns a list of chunks with page metadata.
    """
    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        text = page.get_text("text").strip()
        
        if not text:
            continue
        
        # Split into paragraphs
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        
        current_chunk = ""
        for para in paragraphs:
            if len(current_chunk) + len(para) < max_chunk_size:
                current_chunk += para + "\n\n"
            else:
                if current_chunk:
                    chunks.append({
                        "content": current_chunk.strip(),
                        "page": page_num + 1,
                        "content_type": "text",
                        "source": pdf_path
                    })
                current_chunk = para + "\n\n"
        
        if current_chunk:
            chunks.append({
                "content": current_chunk.strip(),
                "page": page_num + 1,
                "content_type": "text",
                "source": pdf_path
            })
    
    doc.close()
    return chunks

Extract Images

PyMuPDF can extract embedded images from PDFs. Each image gets saved temporarily so you can pass it to Gemini’s vision model.

import io
from PIL import Image

def extract_images_from_pdf(pdf_path: str, min_size: int = 100) -> list[dict]:
    """
    Extract images from PDF pages.
    Skips very small images (likely decorative icons).
    Returns list of image dicts with page metadata and PIL Image objects.
    """
    doc = fitz.open(pdf_path)
    images = []

    for page_num, page in enumerate(doc):
        image_list = page.get_images(full=True)

        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]

            # Load as PIL Image
            pil_image = Image.open(io.BytesIO(image_bytes))
            width, height = pil_image.size

            # Skip tiny images
            if width < min_size or height < min_size:
                continue

            images.append({
                "image": pil_image,
                "image_bytes": image_bytes,
                "page": page_num + 1,
                "image_index": img_index,
                "content_type": "image",
                "source": pdf_path,
                "width": width,
                "height": height
            })

    doc.close()
    return images

Generate Image Descriptions with Gemini Vision

This is the critical step for multimodal retrieval. You’re converting visual content into semantic text that can be embedded and searched.

The quality of your prompt here directly affects retrieval quality. Be specific about what you want Gemini to describe.

import google.generativeai as genai
from PIL import Image
import io

vision_model = genai.GenerativeModel("gemini-2.0-flash")

DIAGRAM_DESCRIPTION_PROMPT = """
You are analyzing a technical image from a product manual.

Describe this image in detail, covering:
1. What type of image it is (diagram, photograph, chart, warning label, parts diagram, etc.)
2. What components, parts, or elements are visible, including any labels or text visible in the image
3. Any numbered steps, arrows, or indicators showing sequence or direction
4. Any measurements, specifications, or technical values shown
5. What the image appears to be instructing or explaining
6. Any safety warnings or caution symbols present

Write your description as a clear, informative paragraph that would help someone understand what this image shows without being able to see it.
"""

def describe_image(image: Image.Image) -> str:
    """
    Use Gemini Vision to generate a detailed description of a technical image.
    """
    try:
        response = vision_model.generate_content([
            DIAGRAM_DESCRIPTION_PROMPT,
            image
        ])
        return response.text.strip()
    except Exception as e:
        print(f"Error describing image: {e}")
        return ""


def process_images(image_dicts: list[dict]) -> list[dict]:
    """
    Add text descriptions to extracted images.
    """
    processed = []
    for img_dict in image_dicts:
        description = describe_image(img_dict["image"])
        if description:
            processed.append({
                "content": description,
                "page": img_dict["page"],
                "content_type": "image",
                "source": img_dict["source"],
                "image_index": img_dict["image_index"],
                "original_dimensions": f"{img_dict['width']}x{img_dict['height']}"
            })
    return processed

Index Everything Into Pinecone

With text chunks and image descriptions ready, you can embed them all using Gemini Embedding 2 and store them in Pinecone.

Create a Pinecone Index

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = os.getenv("PINECONE_INDEX_NAME", "product-manual-index")

# Create index if it doesn't exist
if index_name not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=3072,  # Gemini Embedding 2 default dimension
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

index = pc.Index(index_name)

One important note: if you adjust the output_dimensionality parameter when generating embeddings, make sure your Pinecone index dimension matches. Using 3072 is safest for technical documentation where small semantic differences matter.

Embed Content with Gemini Embedding 2

import time

EMBEDDING_MODEL = "models/gemini-embedding-2-preview-05-14"
# Verify the exact model ID at ai.google.dev — Google updates preview model names

def embed_text(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
    """
    Embed a text string using Gemini Embedding 2.
    """
    result = genai.embed_content(
        model=EMBEDDING_MODEL,
        content=text,
        task_type=task_type,
        output_dimensionality=3072
    )
    return result["embedding"]


def embed_batch(chunks: list[dict], task_type: str = "RETRIEVAL_DOCUMENT", 
                batch_size: int = 5, delay: float = 0.5) -> list[dict]:
    """
    Embed a list of content chunks.
    Includes rate limiting to avoid API quota issues.
    """
    embedded = []
    
    for i, chunk in enumerate(chunks):
        if i > 0 and i % batch_size == 0:
            time.sleep(delay)
        
        try:
            embedding = embed_text(chunk["content"], task_type=task_type)
            embedded.append({
                **chunk,
                "embedding": embedding
            })
        except Exception as e:
            print(f"Error embedding chunk {i}: {e}")
            continue
    
    return embedded

Upsert Vectors to Pinecone

import uuid
from tqdm import tqdm

def upsert_to_pinecone(embedded_chunks: list[dict], batch_size: int = 50):
    """
    Store embedded content in Pinecone.
    Metadata includes page number, content type, and source for filtering.
    """
    vectors = []
    
    for chunk in embedded_chunks:
        vector_id = str(uuid.uuid4())
        
        metadata = {
            "content": chunk["content"][:2000],  # Pinecone metadata limit
            "page": chunk["page"],
            "content_type": chunk["content_type"],
            "source": chunk.get("source", ""),
        }
        
        # Add image-specific metadata
        if chunk["content_type"] == "image":
            metadata["image_index"] = chunk.get("image_index", 0)
        
        vectors.append({
            "id": vector_id,
            "values": chunk["embedding"],
            "metadata": metadata
        })
    
    # Upsert in batches
    for i in tqdm(range(0, len(vectors), batch_size), desc="Uploading to Pinecone"):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)
    
    print(f"Uploaded {len(vectors)} vectors to Pinecone index: {index_name}")

Hermes Crash Course — free 1-hour live workshop

The Full Ingestion Pipeline

Tie everything together:

def ingest_manual(pdf_path: str):
    """
    Full pipeline: PDF → text chunks + image descriptions → embeddings → Pinecone
    """
    print(f"Processing: {pdf_path}")
    
    # Step 1: Extract text
    print("Extracting text chunks...")
    text_chunks = extract_text_chunks(pdf_path)
    print(f"Found {len(text_chunks)} text chunks")
    
    # Step 2: Extract and describe images
    print("Extracting images...")
    raw_images = extract_images_from_pdf(pdf_path)
    print(f"Found {len(raw_images)} images. Generating descriptions...")
    image_chunks = process_images(raw_images)
    print(f"Generated {len(image_chunks)} image descriptions")
    
    # Step 3: Combine all content
    all_chunks = text_chunks + image_chunks
    print(f"Total content chunks: {len(all_chunks)}")
    
    # Step 4: Embed
    print("Embedding with Gemini Embedding 2...")
    embedded_chunks = embed_batch(all_chunks)
    
    # Step 5: Store in Pinecone
    print("Uploading to Pinecone...")
    upsert_to_pinecone(embedded_chunks)
    
    print("Ingestion complete.")
    return len(embedded_chunks)

# Run it
ingest_manual("product_manual.pdf")

Build the Query and Answer Pipeline

Ingestion is half the job. Now you need to retrieve relevant content and generate answers.

The Query Function

def retrieve_context(question: str, top_k: int = 6, 
                     filter_content_type: str = None) -> list[dict]:
    """
    Embed the user's question and retrieve the most relevant chunks from Pinecone.
    
    Args:
        question: The user's natural language question
        top_k: Number of chunks to retrieve
        filter_content_type: Optional - filter by "text" or "image" only
    """
    # Use RETRIEVAL_QUERY task type for questions
    question_embedding = embed_text(question, task_type="RETRIEVAL_QUERY")
    
    # Optional metadata filter
    filter_dict = {}
    if filter_content_type:
        filter_dict["content_type"] = {"$eq": filter_content_type}
    
    query_kwargs = {
        "vector": question_embedding,
        "top_k": top_k,
        "include_metadata": True
    }
    
    if filter_dict:
        query_kwargs["filter"] = filter_dict
    
    results = index.query(**query_kwargs)
    
    retrieved = []
    for match in results["matches"]:
        retrieved.append({
            "content": match["metadata"]["content"],
            "page": match["metadata"]["page"],
            "content_type": match["metadata"]["content_type"],
            "score": match["score"]
        })
    
    return retrieved

Generate the Answer

generation_model = genai.GenerativeModel("gemini-2.0-flash")

SYSTEM_PROMPT = """
You are a helpful technical support assistant for product manuals.
Answer questions based only on the provided context from the manual.
If the answer involves a diagram or visual element, describe what the visual shows.
If you're not sure about something, say so — don't guess at specifications or safety values.

Format your responses clearly. Use numbered steps when describing procedures.
Always mention which page the information comes from if available.
"""

def answer_question(question: str, context_chunks: list[dict]) -> str:
    """
    Generate an answer using retrieved context from the manual.
    """
    # Build context string with source references
    context_parts = []
    for chunk in context_chunks:
        source_label = f"[Page {chunk['page']}, {chunk['content_type'].capitalize()}]"
        context_parts.append(f"{source_label}\n{chunk['content']}")
    
    context_str = "\n\n---\n\n".join(context_parts)
    
    prompt = f"""
Context from the product manual:

{context_str}

---

User question: {question}

Answer based on the context above:
"""
    
    response = generation_model.generate_content([
        {"role": "user", "parts": [SYSTEM_PROMPT]},
        {"role": "model", "parts": ["Understood. I'll answer based only on the provided manual context."]},
        {"role": "user", "parts": [prompt]}
    ])
    
    return response.text.strip()


def chat(question: str) -> str:
    """
    Main chat function: retrieve context + generate answer.
    """
    print(f"Question: {question}")
    context = retrieve_context(question, top_k=6)
    
    if not context:
        return "I couldn't find relevant information in the manual for that question."
    
    answer = answer_question(question, context)
    return answer

Test It

# Test queries
questions = [
    "How do I reset the device to factory settings?",
    "What does the red blinking LED indicate?",
    "What is the maximum operating temperature?",
    "How do I connect the output terminals?",
    "What tools are required for installation?"
]

for q in questions:
    print(f"\nQ: {q}")
    print(f"A: {chat(q)}\n")
    print("-" * 60)

Improve Retrieval Quality

A basic pipeline will work, but a few adjustments make a significant difference for technical documentation.

Hybrid Metadata Filtering

Pinecone supports filtering by metadata fields. For product manuals, this is useful when:

A user asks specifically about installation (you can filter for pages in a known range)
A user wants to see only diagrams
You’re supporting multiple manual versions and need to route queries to the right document

# Example: retrieve only from the first 20 pages
results = index.query(
    vector=question_embedding,
    top_k=5,
    include_metadata=True,
    filter={
        "page": {"$lte": 20}
    }
)

# Example: prioritize image descriptions for visual questions
if any(word in question.lower() for word in ["diagram", "shows", "look like", "illustration"]):
    context = retrieve_context(question, top_k=8, filter_content_type="image")
else:
    context = retrieve_context(question, top_k=6)

Rerank Results

If you’re finding that lower-scoring chunks still contain relevant information, consider adding a simple reranking step using Gemini itself:

def rerank_chunks(question: str, chunks: list[dict], top_n: int = 4) -> list[dict]:
    """
    Use Gemini to score retrieved chunks for relevance to the question.
    Useful when semantic similarity isn't perfectly capturing relevance.
    """
    scored = []
    
    for chunk in chunks:
        prompt = f"""
Rate how relevant this text is to answering the question below.
Return only a number from 1-10.

Question: {question}
Text: {chunk['content'][:500]}

Relevance score (1-10):"""
        
        try:
            response = generation_model.generate_content(prompt)
            score_text = response.text.strip()
            score = int(''.join(filter(str.isdigit, score_text[:3])))
            scored.append({**chunk, "rerank_score": score})
        except Exception:
            scored.append({**chunk, "rerank_score": 5})
    
    scored.sort(key=lambda x: x["rerank_score"], reverse=True)
    return scored[:top_n]

Handle Multi-Page Diagrams

Some product manuals have fold-out diagrams or diagrams that span multiple pages. When you detect that an image description references “continued on next page” or “see adjacent diagram,” retrieve adjacent pages automatically:

def retrieve_with_adjacency(question: str, top_k: int = 5) -> list[dict]:
    """
    Retrieve context and include adjacent pages for continuity.
    """
    primary_results = retrieve_context(question, top_k=top_k)
    
    # Collect page numbers
    pages_to_include = set()
    for r in primary_results:
        page = r["page"]
        pages_to_include.update([page - 1, page, page + 1])
    
    pages_to_include.discard(0)
    
    # Fetch adjacent page content
    adjacent = index.query(
        vector=embed_text(question, task_type="RETRIEVAL_QUERY"),
        top_k=20,
        include_metadata=True,
        filter={"page": {"$in": list(pages_to_include)}}
    )
    
    seen_content = set()
    combined = []
    
    for match in adjacent.matches:
        content = match.metadata["content"]
        if content not in seen_content:
            seen_content.add(content)
            combined.append({
                "content": content,
                "page": match.metadata["page"],
                "content_type": match.metadata["content_type"],
                "score": match.score
            })
    
    combined.sort(key=lambda x: x["score"], reverse=True)
    return combined[:top_k + 3]

Troubleshooting Common Problems

Embeddings Are Returning Irrelevant Results

Likely cause: Chunking is too aggressive. If chunks are too small (under 100 tokens), they lose context. If they’re too large (over 600 tokens), the embedding averages over too much content and specificity suffers.

Fix: Try chunk sizes between 200–500 tokens for technical content. Add a 50–100 token overlap between adjacent chunks so sentences that straddle chunk boundaries don’t get split semantically.

Image Descriptions Are Too Generic

Likely cause: The vision prompt isn’t specific enough, or the images extracted from the PDF are low resolution.

Fix: Increase the DPI when rendering pages as images. PyMuPDF can render pages at higher resolution using the mat parameter:

page = doc.load_page(page_num)
mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for higher resolution
pix = page.get_pixmap(matrix=mat)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Also, make your vision prompt more specific to the domain — for HVAC manuals, ask Gemini to look for refrigerant lines; for electronics, look for connector pinouts.

Pinecone Upsert Errors

Likely cause: Metadata values exceeding Pinecone’s limits, or embedding dimensions mismatch.

Fixes:

Truncate metadata strings to 2000 characters max
Verify your index dimension matches output_dimensionality in your embed calls
Don’t include None values in metadata dictionaries — Pinecone rejects null values

Rate Limiting on Gemini API

Likely cause: Sending too many embedding or generation requests too quickly.

Fix: Add exponential backoff:

import time
import random

def embed_with_retry(text: str, task_type: str, max_retries: int = 3) -> list[float]:
    for attempt in range(max_retries):
        try:
            result = genai.embed_content(
                model=EMBEDDING_MODEL,
                content=text,
                task_type=task_type
            )
            return result["embedding"]
        except Exception as e:
            if "quota" in str(e).lower() or "rate" in str(e).lower():
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s...")
                time.sleep(wait)
            else:
                raise e
    raise Exception(f"Failed after {max_retries} retries")

Answers Citing Wrong Page Numbers

Likely cause: The PDF page numbering doesn’t match PyMuPDF’s 0-indexed pages.

Fix: Store both the PDF’s printed page number (extracted from the footer text) and the logical page index. Use a simple heuristic: scan the first 100 characters of each page for a standalone number that looks like a page number.

How MindStudio Fits Into This

Building this pipeline from scratch takes a few days of work — API setup, chunking logic, embedding batching, Pinecone configuration, frontend. That’s fine for a production system you own fully. But if you’re a product team that wants this chatbot running on your documentation site without maintaining Python infrastructure, there’s a faster path.

MindStudio lets you build AI agents visually, without code. You can wire together the same logic — PDF ingestion, Gemini calls, vector retrieval, answer generation — as a workflow in the visual builder. MindStudio connects to 200+ AI models out of the box, including Gemini, so you don’t need to manage API keys or rate limiting manually.

For a product manual chatbot specifically, the most practical use case is: upload PDFs through a MindStudio workflow trigger, run the ingestion steps as sequential actions, and expose the chatbot as an embeddable web app or a Slack bot that your support team can query directly.

If you’ve already built the Python pipeline above and just want a UI without building one, you can wrap it as a webhook endpoint and call it from a MindStudio agent. The Agent Skills Plugin handles the integration layer so your existing code doesn’t need to change.

You can try MindStudio free at mindstudio.ai — no credit card required to start.

Scaling to Multiple Manuals

The pipeline above handles a single PDF. For a product library with dozens or hundreds of manuals, you need a few more things.

Use Pinecone Namespaces

Each product manual should get its own Pinecone namespace. This lets you route queries to the right manual without cross-contamination between, say, a dishwasher manual and an HVAC system guide.

# When upserting
index.upsert(vectors=vectors, namespace="product-model-xyz-v2")

# When querying
results = index.query(
    vector=question_embedding,
    top_k=6,
    include_metadata=True,
    namespace="product-model-xyz-v2"
)

Track Manual Versions

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

Product manuals get updated. When a new version arrives, you want to replace the old vectors without leaving stale content. The cleanest approach:

Delete the old namespace: index.delete(delete_all=True, namespace="product-model-xyz-v1")
Re-ingest the updated PDF into the new namespace
Update your routing logic to point to the new namespace

Build a Simple Routing Layer

When users don’t specify which product they’re asking about, a routing layer helps:

def route_question(question: str, available_products: list[str]) -> str:
    """
    Determine which product manual namespace to query.
    Uses Gemini to classify the question.
    """
    product_list = "\n".join(f"- {p}" for p in available_products)
    
    prompt = f"""
Given the following question, determine which product it refers to.
Return only the product name exactly as listed, or "unknown" if unclear.

Available products:
{product_list}

Question: {question}

Product:"""
    
    response = generation_model.generate_content(prompt)
    result = response.text.strip()
    
    if result in available_products:
        return result
    return "all"  # Fall back to searching all namespaces

Frequently Asked Questions

What is multimodal RAG?

Multimodal RAG (Retrieval-Augmented Generation) extends the standard RAG pipeline to handle multiple content types — not just text, but also images, diagrams, tables, and other visual formats. In a standard RAG system, documents are chunked, embedded as vectors, and searched using semantic similarity. Multimodal RAG adds steps to process non-text content: typically converting images to descriptive text using a vision model, or using a multimodal embedding model that can represent text and images in the same vector space. The result is a system that can retrieve and cite visual content like technical diagrams when answering questions.

Does Gemini Embedding 2 support image inputs directly?

Gemini Embedding 2, accessed through the Google AI API (google.generativeai), primarily accepts text. For true native image embedding in Google’s ecosystem, you’d use the multimodalembedding@001 model on Vertex AI. However, the pipeline described in this article uses a practical alternative: pass images to a Gemini vision model (like gemini-2.0-flash) to generate detailed text descriptions, then embed those descriptions using Gemini Embedding 2. This produces strong retrieval results for technical documentation without requiring Vertex AI.

How many pages can this pipeline handle?

There’s no hard limit from the pipeline itself. The constraints are practical:

Pinecone: Free tier allows 100K vectors. A 100-page manual with images might generate 500–800 vectors. That’s plenty for dozens of manuals on the free tier.
Gemini API rate limits: The embedding API has per-minute quotas. For large ingestion jobs (1,000+ chunks), add rate limiting and batching as shown in the code above.
Image processing time: Generating descriptions for many images is slow (about 1–2 seconds per image). For a 200-page illustrated manual, expect 10–20 minutes for the first ingestion run. After that, queries are fast.

Can I use this with manuals in languages other than English?

Yes. Gemini Embedding 2 is multilingual and handles retrieval across languages well. Gemini’s vision model can also describe images and generate text in multiple languages. If you’re building for a global user base, use the same pipeline with prompts in the target language. For cross-lingual retrieval (user asks in English about a Spanish manual), the embedding model generally handles this reasonably well, though adding a translation step before retrieval improves accuracy.

Catch up on Hermes — free 60-minute live workshop

How is this different from using a PDF reader with GPT-4?

Tools like ChatGPT’s file upload or Claude’s document mode send the entire PDF as context to the model. For short documents, that works fine. For long technical manuals (50+ pages), it runs into context window limits and the model struggles to focus on the right section. A RAG pipeline retrieves only the 5–10 most relevant chunks before generation, which stays within context limits and tends to produce more precise, page-referenced answers. It also scales to thousands of documents without changing the query cost, since retrieval is cheap regardless of corpus size.

Is Pinecone the only vector database option?

No. The embedding pipeline works with any vector database that supports cosine or dot product similarity. Popular alternatives:

Weaviate — open source, self-hostable, built-in multimodal support
Qdrant — open source, good filtering capabilities
ChromaDB — easy for local development and smaller projects
pgvector — if you’re already on Postgres and want to avoid a separate vector DB
Supabase — managed pgvector with a REST API

Pinecone’s serverless offering makes it easy to get started without provisioning infrastructure, which is why it’s used here.

What chunk size works best for product manuals?

For technical manuals, 300–500 tokens per chunk with a 50-token overlap tends to work well. Technical specifications (lists of values, error code tables) benefit from staying together, so it’s worth adding logic to avoid splitting tables mid-row. Procedural steps should also stay intact — don’t split “Step 3” from its accompanying instruction. If you’re seeing poor retrieval results, increasing chunk size is often more effective than switching embedding models.

Key Takeaways

Standard text-only RAG misses critical information in technical manuals that lives in diagrams, tables, and illustrations — multimodal RAG closes that gap by converting visual content to searchable descriptions.
Gemini Embedding 2’s task type parameter (RETRIEVAL_DOCUMENT vs. RETRIEVAL_QUERY) meaningfully improves retrieval quality for asymmetric search scenarios like Q&A.
The practical approach for PDFs: extract images with PyMuPDF, describe them with Gemini Vision, embed everything with Gemini Embedding 2, and store in Pinecone with page and content type metadata.
Pinecone namespaces let you cleanly separate multiple products and manual versions in a single index.
Rate limiting, retry logic, and chunk sizing are the three areas most likely to cause problems in production — address them early.

If you want to run this pipeline without managing the infrastructure yourself, MindStudio offers a no-code way to build the same agent and deploy it as a web app, Slack bot, or API endpoint. You can get started free at mindstudio.ai.