How to Build a Multimodal Search System with Gemini Embedding 2

Why Unified Search Across Modalities Is Harder Than It Looks

Most search systems are designed around a single content type. Document search handles text. Image galleries have their own separate indexing. Videos get manually tagged. Audio files often just have a filename.

This works until your knowledge base contains everything at once — meeting recordings, product photos, slide decks, customer emails, and training videos — and you need to find the one piece of content that answers a question, regardless of what format it’s in.

The usual fix is to build separate pipelines for each content type and stitch the results together at query time. But that approach makes cross-modal search nearly impossible. You can’t ask “show me content related to this product photo” and get back matching text documents, audio mentions, and video clips — not without a unified embedding space.

That’s what a multimodal search system built on Gemini Embedding 2 solves. By encoding different content types into the same vector space, you can run a single query against text, images, audio, video, and PDFs and get ranked results across all of them.

This guide walks through building that system end-to-end: ingestion, embedding, indexing, and retrieval — including all five content modalities.

What Gemini Embedding 2 Actually Does

Before writing any code, it helps to understand what multimodal embeddings actually mean — because the architecture decisions follow from this.

Embeddings and Vector Spaces

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

An embedding is a list of floating-point numbers that represents the meaning of a piece of content. Two pieces of content with similar meaning will have embeddings that are close together in vector space, regardless of exact wording.

For text, this is well understood. “The dog chased the cat” and “a canine pursued a feline” share no words but will have nearby embeddings because they describe the same event.

Multimodal embeddings extend this across content types. A photo of a dog chasing a cat should land near both of those sentences in the embedding space — because they describe the same thing. This is the core idea that makes cross-modal retrieval work.

How the Model Handles Multiple Modalities

Google’s multimodal embedding model (multimodalembedding@001 on Vertex AI) generates 1408-dimensional vectors across a shared embedding space for text, images, and video. The Gemini text embedding models (text-embedding-004, gemini-embedding-001) extend this further for longer text documents.

This shared space means:

A text query can retrieve matching images
An image can retrieve matching documents
A video frame can retrieve matching audio transcripts
Everything lives in one index, searched with one query

The model was trained on massive amounts of paired content — images with captions, videos with transcripts, documents with descriptions — so semantically equivalent content from different modalities ends up close together in the space.

What This System Handles

Content Type	Processing Method	Embedding Approach
Text / Markdown	Direct input, chunked	Gemini text embedding
Images (JPEG, PNG, WebP)	Direct input	Gemini multimodal embedding
PDFs	Text extraction + page render	Both, per page
Audio (MP3, WAV)	Transcribed via Whisper	Gemini text embedding
Video (MP4)	Sampled frames + transcript	Multimodal + text

All outputs go into the same vector index with metadata. One query can return results from any format.

Architecture: How the Pipeline Fits Together

The system has three distinct layers: ingestion, indexing, and retrieval.

Content Sources (files, URLs, uploads)
          │
          ▼
    [Ingestion Layer]
     - File type detection
     - Content extraction (text, frames, audio)
     - Chunking and preprocessing
          │
          ▼
    [Embedding Layer]
     - Gemini multimodal embedding API
     - Per-chunk embedding generation
     - Metadata tagging (source, type, page, timestamp)
          │
          ▼
     [Vector Store]
     - ChromaDB / Pinecone / Qdrant
     - Stores vectors + metadata
     - ANN index for fast retrieval
          │
          ▼
   [Query Interface]
     - Accept text or image query
     - Embed with same model
     - Retrieve top-k nearest neighbors
     - Return results with source and type

You can run this entirely locally for development using ChromaDB. Swapping to a managed vector store for production is a matter of changing a few lines — the embedding logic stays the same.

Prerequisites and Environment Setup

What You’ll Need

Python 3.10 or later
A Google Cloud project with Vertex AI enabled (for multimodal embeddings)
A Google AI API key (for text-only embeddings, simpler setup)
ffmpeg installed locally (for audio/video processing)
Enough disk space for your vector index — typically a few hundred MB for small datasets

You’ll use both the Vertex AI SDK and the google-generativeai package in this guide, depending on content type.

Install Dependencies

pip install google-cloud-aiplatform google-generativeai \
            chromadb pdfplumber pdf2image pillow \
            opencv-python openai-whisper numpy fastapi uvicorn

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Install ffmpeg via your system package manager:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

Authenticate with Google Cloud:

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

For the Google AI text embedding API:

export GOOGLE_API_KEY="your_api_key_here"

Initialize the Clients

import google.generativeai as genai
from google.cloud import aiplatform
import chromadb
import os

# Text embedding client (Google AI)
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

# Multimodal embedding client (Vertex AI)
aiplatform.init(project="your-project-id", location="us-central1")

# Vector store — persistent local ChromaDB
chroma_client = chromadb.PersistentClient(path="./multimodal_index")
collection = chroma_client.get_or_create_collection(
    name="unified_search",
    metadata={"hnsw:space": "cosine"}
)

Building the Ingestion Layer

The ingestion layer converts raw files into embeddable chunks. Each modality needs different preprocessing, but the output format is consistent: a list of dicts with content, type, and source keys.

Text and Markdown

Text is the most straightforward. Chunk it into overlapping segments so context at chunk boundaries isn’t lost during retrieval.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping word chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

def ingest_text_file(filepath: str) -> list[dict]:
    """Load a text or Markdown file and return chunks with metadata."""
    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    chunks = chunk_text(text)
    return [
        {
            "content": chunk,
            "type": "text",
            "source": filepath,
            "chunk_index": i
        }
        for i, chunk in enumerate(chunks)
    ]

The 50-word overlap is a reasonable default. Increase it if your documents have important cross-boundary context (e.g., legal documents or technical specifications).

Images

Images go directly to the multimodal embedding model without preprocessing. Just load the file and tag it with metadata.

def ingest_image(filepath: str) -> dict:
    """Prepare an image file for embedding."""
    return {
        "content": filepath,  # Path passed to embedding model
        "type": "image",
        "source": filepath
    }

For very large image collections, resize images to around 512×512 before embedding. The model doesn’t require specific dimensions, but oversized images increase API payload size without meaningfully improving embedding quality.

PDFs

PDFs contain both text (paragraphs, headers) and visual content (charts, diagrams, photos). Process both.

The text extraction handles keyword-rich content. The page-as-image path handles visual content that wouldn’t appear in the text layer.

import pdfplumber
from pdf2image import convert_from_path

def ingest_pdf(filepath: str) -> list[dict]:
    """Extract text chunks and page images from a PDF."""
    items = []

    # Extract text per page
    with pdfplumber.open(filepath) as pdf:
        for page_num, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text and text.strip():
                for i, chunk in enumerate(chunk_text(text)):
                    items.append({
                        "content": chunk,
                        "type": "text",
                        "source": filepath,
                        "page": page_num + 1,
                        "chunk_index": i
                    })

    # Render each page as an image (captures charts, diagrams, layout)
    page_images = convert_from_path(filepath, dpi=150)
    for page_num, page_img in enumerate(page_images):
        img_path = f"/tmp/pdf_page_{page_num}.jpg"
        page_img.save(img_path, "JPEG")
        items.append({
            "content": img_path,
            "type": "image",
            "source": filepath,
            "page": page_num + 1,
            "note": "rendered_pdf_page"
        })

    return items

This dual approach means a query about “Q3 revenue breakdown” can match both the text of the financial summary and the image of the bar chart on the same page.

Audio Files

Audio needs transcription before it can be embedded. Whisper (OpenAI’s open-source transcription model) works well locally and handles 90+ languages.

import whisper

whisper_model = whisper.load_model("base")  # Use "small" or "medium" for better accuracy

def ingest_audio(filepath: str) -> list[dict]:
    """Transcribe an audio file and chunk the transcript with timestamps."""
    result = whisper_model.transcribe(filepath)

    items = []
    current_chunk = []
    current_start = 0.0
    word_count = 0

    for segment in result["segments"]:
        current_chunk.append(segment["text"])
        word_count += len(segment["text"].split())

        if word_count >= 200:
            items.append({
                "content": " ".join(current_chunk).strip(),
                "type": "audio_transcript",
                "source": filepath,
                "start_time": current_start,
                "end_time": segment["end"]
            })
            current_chunk = []
            current_start = segment["end"]
            word_count = 0

    if current_chunk:
        items.append({
            "content": " ".join(current_chunk).strip(),
            "type": "audio_transcript",
            "source": filepath,
            "start_time": current_start,
            "end_time": result["segments"][-1]["end"] if result["segments"] else 0.0
        })

    return items

The timestamps in the metadata are important. When search returns a match from an audio file, you want to know exactly where in the recording the relevant content appears — not just that the file matches.

Video Files

Video is the most complex modality. The approach is to sample frames at a regular interval (for visual content) and transcribe the audio track (for spoken content). Both go into the same index.

import cv2
import subprocess

def extract_audio_from_video(video_path: str, output_path: str) -> str:
    """Extract the audio track from a video file."""
    subprocess.run([
        "ffmpeg", "-i", video_path,
        "-vn", "-acodec", "pcm_s16le",
        "-ar", "16000", "-ac", "1",
        output_path, "-y"
    ], check=True, capture_output=True)
    return output_path

def ingest_video(filepath: str, frame_interval: int = 5) -> list[dict]:
    """Sample frames and transcribe audio from a video."""
    items = []

    # Sample frames at regular intervals
    cap = cv2.VideoCapture(filepath)
    fps = cap.get(cv2.CAP_PROP_FPS)
    if fps == 0:
        fps = 24  # Fallback
    frame_step = int(fps * frame_interval)
    frame_count = 0

    while True:
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_count)
        ret, frame = cap.read()
        if not ret:
            break

        timestamp = frame_count / fps
        img_path = f"/tmp/video_frame_{frame_count}.jpg"

        # Resize to reduce memory and API payload size
        frame_resized = cv2.resize(frame, (640, 360))
        cv2.imwrite(img_path, frame_resized)

        items.append({
            "content": img_path,
            "type": "video_frame",
            "source": filepath,
            "timestamp": timestamp
        })

        frame_count += frame_step

    cap.release()

    # Transcribe audio track
    audio_path = "/tmp/extracted_audio.wav"
    try:
        extract_audio_from_video(filepath, audio_path)
        for item in ingest_audio(audio_path):
            item["type"] = "video_transcript"
            item["source"] = filepath
            items.append(item)
    except Exception as e:
        print(f"Audio extraction failed for {filepath}: {e}")

    return items

For a 1-hour video at 5-second intervals, you get 720 frame embeddings. Increase frame_interval for longer videos — 10–15 seconds works well for most presentations and recorded meetings.

Generating Embeddings and Indexing

With your chunks prepared, you need to embed each one and store it in the vector index.

The Embedding Functions

from vertexai.vision_models import MultiModalEmbeddingModel, Image as VertexImage

mm_model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

def embed_text_chunk(text: str) -> list[float]:
    """Embed a text chunk using the dedicated text embedding model."""
    result = genai.embed_content(
        model="models/text-embedding-004",
        content=text,
        task_type="retrieval_document"
    )
    return result["embedding"]

def embed_image_file(image_path: str) -> list[float]:
    """Embed an image using the multimodal embedding model."""
    image = VertexImage.load_from_file(image_path)
    embeddings = mm_model.get_embeddings(image=image)
    return embeddings.image_embedding

One critical note: text-embedding-004 produces 768-dimensional vectors, while multimodalembedding@001 produces 1408-dimensional vectors. These cannot live in the same ChromaDB collection — dimensions must match across all indexed items.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

For true cross-modal retrieval (text queries finding images, image queries finding documents), you need everything in the same vector space. Use the multimodal model for all content types:

def embed_unified(item: dict) -> list[float]:
    """
    Embed any content type using the multimodal model.
    Text items use the text channel; image items use the image channel.
    Both produce vectors in the same 1408-dimensional space.
    """
    if item["type"] in ("text", "audio_transcript", "video_transcript"):
        embeddings = mm_model.get_embeddings(contextual_text=item["content"])
        return embeddings.text_embedding
    elif item["type"] in ("image", "video_frame"):
        image = VertexImage.load_from_file(item["content"])
        embeddings = mm_model.get_embeddings(image=image)
        return embeddings.image_embedding
    else:
        raise ValueError(f"Unknown content type: {item['type']}")

If you only need text-to-text retrieval (no image queries, no cross-modal matching), use the text-only model — it produces better text embeddings and doesn’t require Vertex AI. For full multimodal retrieval, embed_unified with the multimodal model is the right choice.

Indexing with Retry and Rate Limiting

import uuid
import time
import random

def embed_with_retry(item: dict, max_retries: int = 5) -> list[float]:
    """Embed with exponential backoff on rate limit errors."""
    for attempt in range(max_retries):
        try:
            return embed_unified(item)
        except Exception as e:
            err = str(e).lower()
            if "429" in err or "quota" in err or "rate" in err:
                wait = (2 ** attempt) + random.random()
                print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})...")
                time.sleep(wait)
            else:
                raise
    raise Exception(f"Embedding failed after {max_retries} attempts")

def index_items(items: list[dict], batch_size: int = 50):
    """Embed and store a list of content items in batches."""
    embeddings_batch, ids_batch, metadata_batch, documents_batch = [], [], [], []

    for i, item in enumerate(items):
        try:
            embedding = embed_with_retry(item)

            # ChromaDB requires string metadata values
            metadata = {
                "type": str(item.get("type", "")),
                "source": str(item.get("source", "")),
                "page": str(item.get("page", "")),
                "timestamp": str(item.get("timestamp", "")),
                "start_time": str(item.get("start_time", "")),
                "end_time": str(item.get("end_time", "")),
                "chunk_index": str(item.get("chunk_index", ""))
            }

            doc = item["content"] if item["type"] not in ("image", "video_frame") \
                  else f"[Image: {item['content']}]"

            embeddings_batch.append(embedding)
            ids_batch.append(str(uuid.uuid4()))
            metadata_batch.append(metadata)
            documents_batch.append(doc)

            if len(embeddings_batch) >= batch_size:
                collection.add(
                    embeddings=embeddings_batch,
                    documents=documents_batch,
                    metadatas=metadata_batch,
                    ids=ids_batch
                )
                embeddings_batch, ids_batch, metadata_batch, documents_batch = [], [], [], []
                print(f"Indexed {i + 1}/{len(items)} items...")

            time.sleep(0.05)  # Gentle rate limiting between requests

        except Exception as e:
            print(f"Skipping item {i} ({item.get('source', 'unknown')}): {e}")
            continue

    # Flush remaining
    if embeddings_batch:
        collection.add(
            embeddings=embeddings_batch,
            documents=documents_batch,
            metadatas=metadata_batch,
            ids=ids_batch
        )

    print(f"Done. Indexed {len(items)} items total.")

Running the Full Ingestion Pipeline

import glob

SUPPORTED_EXTENSIONS = {
    "txt": ingest_text_file,
    "md": ingest_text_file,
    "jpg": lambda p: [ingest_image(p)],
    "jpeg": lambda p: [ingest_image(p)],
    "png": lambda p: [ingest_image(p)],
    "webp": lambda p: [ingest_image(p)],
    "pdf": ingest_pdf,
    "mp3": ingest_audio,
    "wav": ingest_audio,
    "m4a": ingest_audio,
    "mp4": ingest_video,
    "mov": ingest_video,
    "mkv": ingest_video,
}

def ingest_directory(directory: str):
    """Ingest all supported files from a directory recursively."""
    all_items = []

    for filepath in sorted(glob.glob(f"{directory}/**/*", recursive=True)):
        ext = filepath.lower().rsplit(".", 1)[-1]
        handler = SUPPORTED_EXTENSIONS.get(ext)
        if not handler:
            continue
        try:
            result = handler(filepath)
            all_items.extend(result if isinstance(result, list) else [result])
            print(f"Prepared: {filepath} → {len(result)} item(s)")
        except Exception as e:
            print(f"Failed to ingest {filepath}: {e}")

    print(f"\nPrepared {len(all_items)} items. Starting indexing...")
    index_items(all_items)

# Run it
ingest_directory("./content")

Building the Search Interface

With everything indexed, the retrieval layer is relatively straightforward. The key is to embed the query with the same model used for indexing.

Text Query Search

def search_text(
    query: str,
    top_k: int = 10,
    filter_type: str = None
) -> list[dict]:
    """
    Search the index with a text query.
    Optionally filter to a specific content type.
    """
    # Embed the query using the multimodal model's text channel
    query_embedding = mm_model.get_embeddings(contextual_text=query).text_embedding

    where = {"type": filter_type} if filter_type else None

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=where,
        include=["documents", "metadatas", "distances"]
    )

    output = []
    for i in range(len(results["ids"][0])):
        output.append({
            "id": results["ids"][0][i],
            "document": results["documents"][0][i],
            "metadata": results["metadatas"][0][i],
            "score": 1 - results["distances"][0][i]  # Distance → similarity
        })

    return output

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Image Query Search

def search_image(image_path: str, top_k: int = 10) -> list[dict]:
    """Search the index using an image as the query."""
    image = VertexImage.load_from_file(image_path)
    query_embedding = mm_model.get_embeddings(image=image).image_embedding

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    output = []
    for i in range(len(results["ids"][0])):
        output.append({
            "id": results["ids"][0][i],
            "document": results["documents"][0][i],
            "metadata": results["metadatas"][0][i],
            "score": 1 - results["distances"][0][i]
        })

    return output

Displaying Results

def display_results(results: list[dict]):
    for i, result in enumerate(results):
        meta = result["metadata"]
        print(f"\n{'─' * 60}")
        print(f"Result {i+1}  |  Score: {result['score']:.3f}  |  Type: {meta['type']}")
        print(f"Source: {meta['source']}")
        if meta.get("page") and meta["page"] != "":
            print(f"Page: {meta['page']}")
        if meta.get("timestamp") and meta["timestamp"] != "":
            print(f"Timestamp: {float(meta['timestamp']):.1f}s")
        if meta.get("start_time") and meta["start_time"] != "":
            print(f"Time range: {meta['start_time']}s – {meta['end_time']}s")
        print(f"\n{result['document'][:300]}")

Example Queries

# Find customer onboarding content across all modalities
results = search_text("customer onboarding process")
display_results(results)

# Look only in video transcripts
results = search_text("quarterly revenue breakdown", filter_type="video_transcript")
display_results(results)

# Use an image as the query — find similar images and related documents
results = search_image("./reference_product_photo.jpg")
display_results(results)

Wrapping It in an HTTP API

For team use or application integration, expose your search as a REST API:

from fastapi import FastAPI, UploadFile, File
from pydantic import BaseModel
import uvicorn

app = FastAPI(title="Multimodal Search API")

class TextQuery(BaseModel):
    query: str
    top_k: int = 10
    filter_type: str = None

@app.post("/search/text")
async def text_search(body: TextQuery):
    results = search_text(body.query, body.top_k, body.filter_type)
    return {"results": results, "count": len(results)}

@app.post("/search/image")
async def image_search(file: UploadFile = File(...), top_k: int = 10):
    img_path = f"/tmp/query_{file.filename}"
    with open(img_path, "wb") as f:
        f.write(await file.read())
    results = search_image(img_path, top_k)
    return {"results": results, "count": len(results)}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Production Considerations and Scaling

The prototype above works well for development and moderate-scale deployments. A few things to address before production.

Choosing a Production Vector Store

ChromaDB is excellent for development but not built for large-scale production. When you need to scale:

Pinecone: Fully managed, scales automatically, straightforward REST API. Best for teams that don’t want to manage infrastructure. Pinecone’s documentation has good guides for migrating from local indexes.
Weaviate: Open-source with a cloud option. Has native multimodal indexing support and strong filtering. Good for complex metadata queries.
Qdrant: Rust-based, fast, self-hosted or cloud. Excellent for payload filtering and large datasets.
pgvector: PostgreSQL extension. Best if you’re already running Postgres and want everything in one place.

Switching vector stores only changes the add() and query() calls — all the embedding code stays the same.

Handling Dimension Mismatches

Different models produce different vector sizes:

Model	Output Dimensions	Use Case
`text-embedding-004`	768	Text-only retrieval
`multimodalembedding@001`	1408	Cross-modal retrieval
`gemini-embedding-001`	Configurable (up to 3072)	Text, long context

Vectors of different dimensions cannot live in the same collection. Pick one model and use it for everything. For a unified multimodal index, multimodalembedding@001 is the right choice. For text-only workloads, the dedicated text embedding models perform better.

Keeping the Index Fresh

Content changes. You need a strategy for updates:

Full re-index: Delete and rebuild. Simple but slow for large datasets.
Incremental updates: Track file hashes, re-embed only changed files. More complex, much faster.
Soft deletes: Mark records with a deleted: true metadata flag, filter them at query time. Simplest for high-frequency updates.

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

A basic hash-based incremental indexer:

import hashlib, json

def file_hash(filepath: str) -> str:
    with open(filepath, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def load_state(path: str = ".index_state.json") -> dict:
    if os.path.exists(path):
        with open(path) as f:
            return json.load(f)
    return {}

def save_state(state: dict, path: str = ".index_state.json"):
    with open(path, "w") as f:
        json.dump(state, f)

def ingest_directory_incremental(directory: str):
    state = load_state()
    new_state = {}
    items_to_index = []

    for filepath in glob.glob(f"{directory}/**/*", recursive=True):
        ext = filepath.lower().rsplit(".", 1)[-1]
        if ext not in SUPPORTED_EXTENSIONS:
            continue
        h = file_hash(filepath)
        new_state[filepath] = h
        if state.get(filepath) == h:
            continue  # File hasn't changed
        items = SUPPORTED_EXTENSIONS[ext](filepath)
        items_to_index.extend(items if isinstance(items, list) else [items])

    if items_to_index:
        index_items(items_to_index)

    save_state(new_state)
    print(f"Processed {len(items_to_index)} new/changed items.")

Embedding Cost and API Quotas

The Vertex AI Multimodal Embedding API has default quota limits that you may hit with large batch jobs. Check your project quotas in the Google Cloud Console before starting a large ingestion run.

Embedding cost is generally low compared to generative inference — but it adds up across large collections. Cache embeddings to avoid re-computing them unnecessarily, and batch requests efficiently.

How MindStudio Fits Into This Architecture

The pipeline you’ve built above is solid infrastructure — but infrastructure still needs a usable interface. Someone has to call the API, interpret the results, and act on them.

MindStudio is a no-code platform for building AI agents, and it’s a practical fit here. You can use it to wrap your multimodal search backend in a full application — with a conversational UI, authentication, and downstream actions — without building a frontend from scratch.

A Concrete Integration Pattern

Deploy your FastAPI search endpoint with a public URL (Railway, Fly.io, or any cloud provider). Then in MindStudio, build an agent that:

Accepts a user question through a custom chat interface
Calls your /search/text endpoint via an HTTP action
Receives the ranked results (with source, type, and content)
Passes those results to a Gemini or Claude model to synthesize a natural language answer with citations
Returns the response to the user

The result is a conversational search experience. A user asks “what did we decide about the packaging redesign?” and the agent returns a synthesized answer sourced from the relevant meeting transcript, design document, and Slack export — all in one response.

What MindStudio Adds

MindStudio handles the parts that are tedious to build yourself:

Authentication and user sessions without writing a login system
Conversation history so the agent can reference earlier queries in a session
Multi-step reasoning — if the first search doesn’t return strong results, the agent can refine the query and retry
Integrations — search results can be pushed to Slack, saved to Notion, or trigger a workflow in any of MindStudio’s 1,000+ connected tools

You could also run the ingestion pipeline itself as a scheduled MindStudio agent — for example, watching a shared Google Drive folder for new uploads and automatically triggering the embedding and indexing steps when new content arrives.

You can try MindStudio free at mindstudio.ai.

Troubleshooting Common Problems

Search Returns Poor Results Even With Correct Embeddings

Hermes, walked through line by line — free 1-hour workshop

This almost always means you’re comparing embeddings from different models or different task types. Check:

Your query embedding uses the exact same model as your document embeddings
If using the text-only model, query embeddings should use task_type="retrieval_query" and document embeddings should use task_type="retrieval_document"
ChromaDB returns distances (lower = more similar), not similarities. Convert with score = 1 - distance for cosine space

Video Frame Extraction Is Very Slow

Reading frames sequentially with cap.read() in a loop processes every frame even when you only want every Nth one. Seek directly to the frames you need:

cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame_number)
ret, frame = cap.read()

This is significantly faster for long videos at large frame intervals.

PDF Text Extraction Returns Nothing

Some PDFs are scanned images with no text layer — pdfplumber will return empty strings for every page. Two options:

Run OCR on the rendered page images using Tesseract:

import pytesseract
text = pytesseract.image_to_string(page_img)

Skip text extraction entirely and rely on the image embeddings for those pages. They’ll still be searchable via visual similarity, just not via text queries.

For mixed PDFs (some pages have text, some don’t), combine both approaches: use extracted text where available, OCR where the text layer is empty.

Memory Crashes During Video Processing

High-resolution video frames (1080p+) can exhaust RAM quickly when processed in sequence. Resize frames before saving:

frame_resized = cv2.resize(frame, (640, 360))

The multimodal embedding model is robust to resolution changes. You won’t lose meaningful semantic content by downscaling.

Whisper Transcription Is Too Slow

Whisper’s base model is faster but less accurate. For production, use small or medium on a GPU-enabled machine. Alternatively, use Google’s Speech-to-Text API for faster transcription if you’re already in the GCP ecosystem — it handles audio natively and supports batch processing.

Frequently Asked Questions

What is Gemini Embedding 2 and how does it differ from earlier embedding models?

Gemini Embedding 2 refers to Google’s latest generation of embedding models built on the Gemini architecture. The key difference from earlier models like text-embedding-gecko is a combination of longer input context, stronger multilingual performance, and — in the multimodal variant — the ability to encode images, text, and video into a single shared vector space. Earlier text-only models could only compare text against text. The multimodal model allows cross-modal retrieval, where a text query can match an image based on semantic meaning rather than metadata or tags.

Can I run this system without a Google Cloud account?

You can reduce GCP dependency but not eliminate it for full cross-modal retrieval. The multimodalembedding@001 model runs on Vertex AI and requires a GCP project. However, you can take a lighter approach: use text-embedding-004 (available through the Google AI API without a GCP project) for all text content, and use Gemini’s vision capabilities to generate text descriptions of images before embedding them as text. This loses some cross-modal retrieval quality but avoids the Vertex AI setup entirely.

How many documents can this system handle?

Scaling limits come from your vector store, not the embedding code. ChromaDB handles millions of vectors on a single machine with enough RAM. For larger datasets:

Move to a managed vector store (Pinecone, Weaviate, or Qdrant)
Use namespacing or collection-level filtering to scope searches to relevant subsets
Consider ANN (approximate nearest neighbor) indexes that trade a small amount of recall for much faster query times at scale

Catch up on Hermes — free 60-minute live workshop

The embedding pipeline itself can be parallelized using a task queue like Celery or Ray if ingestion speed becomes a bottleneck.

Does this support multilingual content?

Yes. Gemini’s embedding models support 100+ languages and embed content into a language-agnostic semantic space. An English query can retrieve a matching German or Japanese document if they describe the same concept. For audio and video in other languages, Whisper supports multilingual transcription — omit the language parameter to auto-detect, or pass it explicitly (e.g., whisper_model.transcribe(filepath, language="ja")).

What chunk size works best for each content type?

Chunk size balances retrieval precision (smaller chunks = more specific matches) against context (larger chunks = more complete information per result).

Text documents: 300–500 words with 50-word overlap. Good default for most prose.
Audio transcripts: 150–200 words with timestamps. Audio is less dense than written text, so smaller chunks work better.
Video transcripts: Same as audio. Align chunk boundaries with natural pauses when possible.
PDF pages: One chunk per page for visual-heavy documents; standard word chunking for text-heavy ones.

Start with these defaults and tune based on the retrieval quality you observe in practice.

How do I handle slide decks or presentations?

Export the deck as a PDF first, then use the PDF ingestion pipeline above. Each slide gets:

Rendered as an image (captures charts, diagrams, visual layout, brand elements)
Text extracted from the slide content (captures speaker notes, bullet points, headings)

Both go into the index. A query about “Q3 performance summary” can then match via the chart image (visual similarity) and the text on that slide (keyword + semantic match). This dual representation usually gives better results than either approach alone.

Key Takeaways

Building a multimodal search system with Gemini Embedding 2 is a real engineering task, but one with a clear path from prototype to production.

Shared vector space is everything. When text, images, audio, video, and PDF content all live in the same embedding space, a single query retrieves across all of them. Use the same embedding model for both indexing and querying — mixing models breaks retrieval.
Each modality needs its own preprocessing. Text needs chunking. Images go in directly. PDFs need both text extraction and page rendering. Audio and video need transcription before text embedding. Getting this layer right has the biggest impact on retrieval quality.
Metadata makes results usable. Store source paths, timestamps, page numbers, and content types alongside every vector. Without metadata, you know something matched — but not where to find it or what to do with it.
Start local, scale incrementally. ChromaDB is fine for development and small production deployments. Switching to a managed vector store later is straightforward — the embedding code doesn’t change.
The pipeline needs an interface. Search infrastructure only delivers value when people can use it. Whether you build a custom frontend or use a platform like MindStudio to wrap the backend in a conversational agent, the retrieval layer alone isn’t enough.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

If you want to skip the frontend work entirely, MindStudio lets you build a fully functional search application on top of this backend — with a custom UI, conversation history, and integrations — without writing additional code. Start for free at mindstudio.ai.