How to Build a Multimodal Vector Database with Gemini Embedding 2 and Pinecone

Q: What's the difference between multimodalembedding@001 and text-embedding-004?

text-embedding-004 is a text-only model that produces 768-dimensional vectors and is available via the Gemini Developer API. It's fast, stable, and great for pure text retrieval tasks. multimodalembedding@001 is available only on Vertex AI and embeds text, images, and video into the same 1408-dimensional space. For a truly multimodal index where a text query can retrieve images (and vice versa), you need multimodalembedding@001. For text-only retrieval, text-embedding-004 or the experimental gemini-embedding-exp-03-07 are more cost-effective.

Q: How do I update or delete indexed content?

Pinecone supports delete by vector ID and upsert to overwrite existing vectors. For updates, use the same vector ID as the original and call upsert with the new embedding. For deletions, call index.delete(ids=["vectorid1", "vectorid2"]). If you're using namespaces, include the namespace in both operations. For bulk deletions (e.g., removing all content from a specific source), use metadata filtering with deletebyfilter if your Pinecone plan supports it.

What Is a Multimodal Vector Database — and Why Should You Build One?

Most search systems are built around one data type. You search text with text. You search images with image queries. But the real world doesn’t work that way — a product catalog has images, descriptions, and spec sheets. A media library has videos, transcripts, and thumbnails. A support system has ticket text, screenshots, and voice recordings.

A multimodal vector database lets you search and retrieve across all of these using a single, unified index. You embed text, images, audio, video, and documents into the same vector space, then run queries that can match across all of them — regardless of modality.

This guide walks through building exactly that using Gemini Embedding 2 (Google’s multimodal embedding model) and Pinecone (a managed vector database). You’ll cover the full stack: setting up Pinecone, generating multimodal embeddings with the Gemini API, indexing mixed media, and querying across modalities.

Understanding Gemini Embedding 2

Google’s Gemini Embedding 2 refers to the gemini-embedding-exp-03-07 experimental model released in early 2025, as well as the production-ready text-embedding-004 for text and the multimodalembedding@001 model available via Vertex AI for image and video use cases. The newer Gemini 2.0 embedding lineup extends this to true cross-modal embedding.

What Makes It Multimodal

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

Traditional embedding models map text to vectors. Gemini’s multimodal embedding models map text, images, and video clips into the same latent space. This means a text query like “a golden retriever running on a beach” can retrieve a relevant image, even if that image has no caption. The semantic meaning is shared across modalities.

Key capabilities of the Gemini multimodal embedding API:

Supports text inputs up to 32,768 tokens
Supports image inputs (JPEG, PNG, GIF, BMP, WebP)
Supports video inputs (MP4, MOV, AVI, and others)
Returns 1408-dimensional embedding vectors by default
Cross-modal retrieval: query with one modality, retrieve another
Available via the generativelanguage.googleapis.com API or Vertex AI

Text-Specific Options: text-embedding-004 vs. gemini-embedding-exp

For text-only use cases, Google offers text-embedding-004, which produces 768-dimensional vectors and is fast and cost-effective. The experimental gemini-embedding-exp-03-07 model sits near the top of the MTEB benchmark (scoring around 72.4 at the time of writing), supports up to 8,192 input tokens, and offers variable output dimensions from 64 to 3072.

For multimodal use cases specifically, you’ll want to use either:

multimodalembedding@001 on Vertex AI (supports text + image + video)
Or combine gemini-embedding-exp for text with vision model features from gemini-2.0-flash

The exact model choice depends on whether you’re using the Gemini Developer API (AI Studio) or Vertex AI. This guide covers both paths.

Embedding Dimensions and Cost

Model	Modalities	Default Dimensions	Notes
`text-embedding-004`	Text	768	Production, stable
`gemini-embedding-exp-03-07`	Text	3072 (variable)	Experimental, high MTEB score
`multimodalembedding@001`	Text, Image, Video	1408	Vertex AI only
`embedding-001`	Text	768	Legacy

For a unified multimodal index in Pinecone, all vectors must have the same dimension. Plan this before you start.

Setting Up Pinecone

Pinecone is a fully managed vector database built for similarity search. It handles indexing, querying, and scaling — you just push vectors and metadata, then query.

Create a Pinecone Account and Index

Go to pinecone.io and create a free account.
From the console, click Create Index.
Choose a name (e.g., multimodal-index).
Set the dimension to match your embedding model output. For multimodalembedding@001, use 1408. For gemini-embedding-exp-03-07 with reduced dimensions, use whatever you configure (e.g., 1024).
Choose Cosine as the distance metric — it works well for semantic similarity.
Select the Serverless tier and a cloud region (e.g., us-east-1 on AWS).
Copy your API key from the API Keys tab.

Install the Pinecone SDK

pip install pinecone-client google-generativeai vertexai python-dotenv pillow

For Vertex AI specifically:

pip install google-cloud-aiplatform

Initialize the Pinecone Client

import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv

load_dotenv()

pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

index_name = "multimodal-index"

# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1408,  # Match your embedding model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

Generating Embeddings with the Gemini API

There are two paths here depending on your use case.

Path 1: Text Embeddings via Gemini Developer API

This uses the google-generativeai SDK with your AI Studio API key.

import google.generativeai as genai
import os

genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))

def embed_text(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
    """
    task_type options:
    - RETRIEVAL_DOCUMENT: for indexing
    - RETRIEVAL_QUERY: for querying
    - SEMANTIC_SIMILARITY: for similarity tasks
    - CLASSIFICATION: for classification
    - CLUSTERING: for clustering
    """
    result = genai.embed_content(
        model="models/gemini-embedding-exp-03-07",
        content=text,
        task_type=task_type,
        output_dimensionality=1408  # Match Pinecone index dimension
    )
    return result["embedding"]

# Test it
embedding = embed_text("A cat sitting on a red chair")
print(f"Vector length: {len(embedding)}")  # Should print 1408

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

The task_type parameter matters. Use RETRIEVAL_DOCUMENT when indexing content and RETRIEVAL_QUERY when embedding search queries. This distinction improves retrieval quality significantly.

Path 2: Multimodal Embeddings via Vertex AI

For true multimodal embedding (text + images + video in the same space), use Vertex AI’s multimodalembedding@001 model.

from vertexai.vision_models import MultiModalEmbeddingModel, Image, Video
import vertexai

vertexai.init(project=os.environ.get("GCP_PROJECT_ID"), location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

def embed_image(image_path: str) -> list[float]:
    image = Image.load_from_file(image_path)
    embeddings = model.get_embeddings(
        image=image,
        dimension=1408
    )
    return embeddings.image_embedding

def embed_text_vertex(text: str) -> list[float]:
    embeddings = model.get_embeddings(
        contextual_text=text,
        dimension=1408
    )
    return embeddings.text_embedding

def embed_video(video_path: str, video_segment_config=None) -> list[dict]:
    video = Video.load_from_file(video_path)
    embeddings = model.get_embeddings(
        video=video,
        video_segment_config=video_segment_config,
        dimension=1408
    )
    return [
        {
            "start_offset_sec": segment.start_offset_sec,
            "end_offset_sec": segment.end_offset_sec,
            "embedding": segment.embedding
        }
        for segment in embeddings.video_embeddings
    ]

Note that multimodalembedding@001 requires a Google Cloud project with the Vertex AI API enabled. You’ll need to authenticate via gcloud auth application-default login or a service account.

Indexing Mixed Media into Pinecone

Now the core of it: getting text, images, videos, audio, and PDFs into Pinecone as vectors with rich metadata.

Indexing Text Documents

import uuid

def index_text_document(text: str, metadata: dict) -> str:
    vector_id = str(uuid.uuid4())
    embedding = embed_text(text, task_type="RETRIEVAL_DOCUMENT")
    
    index.upsert(vectors=[{
        "id": vector_id,
        "values": embedding,
        "metadata": {
            "type": "text",
            "content": text[:1000],  # Store a preview, not the full text
            **metadata
        }
    }])
    
    return vector_id

# Example usage
doc_id = index_text_document(
    text="Pinecone is a vector database optimized for machine learning applications.",
    metadata={
        "source": "pinecone_docs",
        "title": "Pinecone Overview",
        "url": "https://docs.pinecone.io"
    }
)

For large documents, chunk the text first. A good rule of thumb is 512–1024 tokens per chunk with 10–20% overlap between chunks.

Chunking and Indexing PDFs

PDFs need preprocessing. Use PyMuPDF or pypdf to extract text, then chunk and embed.

import fitz  # PyMuPDF
import re

def chunk_text(text: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def index_pdf(pdf_path: str, metadata: dict) -> list[str]:
    doc = fitz.open(pdf_path)
    full_text = ""
    
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        full_text += page.get_text()
    
    chunks = chunk_text(full_text)
    vector_ids = []
    
    vectors_to_upsert = []
    for i, chunk in enumerate(chunks):
        embedding = embed_text(chunk, task_type="RETRIEVAL_DOCUMENT")
        vector_id = f"{metadata.get('doc_id', str(uuid.uuid4()))}_chunk_{i}"
        
        vectors_to_upsert.append({
            "id": vector_id,
            "values": embedding,
            "metadata": {
                "type": "pdf",
                "chunk_index": i,
                "total_chunks": len(chunks),
                "content": chunk[:500],
                **metadata
            }
        })
        vector_ids.append(vector_id)
    
    # Batch upsert for efficiency
    batch_size = 100
    for i in range(0, len(vectors_to_upsert), batch_size):
        batch = vectors_to_upsert[i:i + batch_size]
        index.upsert(vectors=batch)
    
    return vector_ids

# Example
ids = index_pdf(
    pdf_path="./research_paper.pdf",
    metadata={
        "doc_id": "paper_001",
        "title": "Vector Search at Scale",
        "author": "Research Team",
        "year": 2024
    }
)
print(f"Indexed {len(ids)} chunks from PDF")

Indexing Images

For images, use the Vertex AI multimodal model. Store the image URL or path in metadata so you can retrieve the actual image after a search.

def index_image(image_path: str, metadata: dict) -> str:
    vector_id = str(uuid.uuid4())
    embedding = embed_image(image_path)
    
    index.upsert(vectors=[{
        "id": vector_id,
        "values": embedding,
        "metadata": {
            "type": "image",
            "file_path": image_path,
            **metadata
        }
    }])
    
    return vector_id

# Batch index a directory of images
import os
from pathlib import Path

def index_image_directory(directory: str, base_metadata: dict = {}) -> list[str]:
    image_extensions = {".jpg", ".jpeg", ".png", ".webp", ".gif", ".bmp"}
    image_files = [
        f for f in Path(directory).iterdir()
        if f.suffix.lower() in image_extensions
    ]
    
    vector_ids = []
    for img_file in image_files:
        try:
            vec_id = index_image(
                image_path=str(img_file),
                metadata={
                    "filename": img_file.name,
                    "directory": directory,
                    **base_metadata
                }
            )
            vector_ids.append(vec_id)
            print(f"Indexed: {img_file.name}")
        except Exception as e:
            print(f"Failed to index {img_file.name}: {e}")
    
    return vector_ids

One thing to be aware of: images with text in them benefit from a two-pass approach — embed the image itself, and also extract and embed any visible text using OCR (Tesseract or Gemini Vision). Store both vectors and link them via a shared document_id in metadata.

Indexing Videos

Video indexing is more nuanced because a single video contains multiple semantic moments. multimodalembedding@001 handles this by generating segment-level embeddings — each covering a configurable window of time.

from vertexai.vision_models import VideoSegmentConfig

def index_video(video_path: str, metadata: dict, interval_sec: int = 10) -> list[str]:
    """
    Indexes a video by splitting into segments and embedding each.
    Default: one embedding per 10-second segment.
    """
    segment_config = VideoSegmentConfig(
        start_offset_sec=0,
        end_offset_sec=None,  # Embed entire video
        interval_sec=interval_sec
    )
    
    segments = embed_video(video_path, video_segment_config=segment_config)
    vector_ids = []
    
    vectors_to_upsert = []
    for segment in segments:
        vector_id = f"{metadata.get('video_id', str(uuid.uuid4()))}_{segment['start_offset_sec']}_{segment['end_offset_sec']}"
        
        vectors_to_upsert.append({
            "id": vector_id,
            "values": segment["embedding"],
            "metadata": {
                "type": "video",
                "start_sec": segment["start_offset_sec"],
                "end_sec": segment["end_offset_sec"],
                "file_path": video_path,
                **metadata
            }
        })
        vector_ids.append(vector_id)
    
    index.upsert(vectors=vectors_to_upsert)
    return vector_ids

With 10-second segments, a 5-minute video produces ~30 vectors. Each points back to the same file with a time range in metadata — so when you retrieve it, you know exactly which part of the video matched the query.

Indexing Audio

The multimodalembedding@001 model doesn’t directly embed audio. The standard approach is to transcribe audio first (using Gemini’s audio understanding or Whisper), then embed the transcript. You can optionally chunk the transcript with timestamps to preserve fine-grained retrieval.

import google.generativeai as genai

def transcribe_audio(audio_path: str) -> str:
    """Transcribe audio using Gemini's multimodal capabilities."""
    model = genai.GenerativeModel("gemini-2.0-flash")
    
    with open(audio_path, "rb") as audio_file:
        audio_data = audio_file.read()
    
    # Detect MIME type from extension
    ext = Path(audio_path).suffix.lower()
    mime_map = {
        ".mp3": "audio/mp3",
        ".wav": "audio/wav",
        ".m4a": "audio/mp4",
        ".ogg": "audio/ogg",
        ".flac": "audio/flac"
    }
    mime_type = mime_map.get(ext, "audio/mp3")
    
    response = model.generate_content([
        {"mime_type": mime_type, "data": audio_data},
        "Transcribe this audio accurately. Include speaker labels if multiple speakers are present."
    ])
    
    return response.text

def index_audio(audio_path: str, metadata: dict) -> list[str]:
    transcript = transcribe_audio(audio_path)
    chunks = chunk_text(transcript, chunk_size=500, overlap=50)
    
    vectors_to_upsert = []
    vector_ids = []
    audio_id = metadata.get("audio_id", str(uuid.uuid4()))
    
    for i, chunk in enumerate(chunks):
        embedding = embed_text(chunk, task_type="RETRIEVAL_DOCUMENT")
        vector_id = f"{audio_id}_transcript_chunk_{i}"
        
        vectors_to_upsert.append({
            "id": vector_id,
            "values": embedding,
            "metadata": {
                "type": "audio",
                "source": "transcript",
                "chunk_index": i,
                "content": chunk[:500],
                "file_path": audio_path,
                **metadata
            }
        })
        vector_ids.append(vector_id)
    
    index.upsert(vectors=vectors_to_upsert)
    return vector_ids

Querying Across Modalities

Once everything is in Pinecone, you can search across all modalities with a single query. The key insight: if your text and image embeddings share the same vector space (achieved via multimodalembedding@001), a text query can retrieve images, and an image query can retrieve text.

Text Query Across All Content Types

def search(
    query_text: str,
    top_k: int = 10,
    filter_type: str = None
) -> list[dict]:
    """
    Search across all indexed content using a text query.
    Optionally filter by content type: 'text', 'image', 'video', 'audio', 'pdf'
    """
    query_embedding = embed_text_vertex(query_text)
    
    filter_params = None
    if filter_type:
        filter_params = {"type": {"$eq": filter_type}}
    
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filter_params
    )
    
    return results["matches"]

# Search everything
results = search("golden retriever playing fetch on a beach", top_k=5)
for match in results:
    print(f"Score: {match['score']:.4f} | Type: {match['metadata']['type']} | ID: {match['id']}")
    if "content" in match["metadata"]:
        print(f"  Content: {match['metadata']['content'][:100]}...")
    if "file_path" in match["metadata"]:
        print(f"  File: {match['metadata']['file_path']}")

def search_by_image(
    image_path: str,
    top_k: int = 10,
    filter_type: str = None
) -> list[dict]:
    """Use an image to find semantically similar content."""
    image_embedding = embed_image(image_path)
    
    filter_params = None
    if filter_type:
        filter_params = {"type": {"$eq": filter_type}}
    
    results = index.query(
        vector=image_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filter_params
    )
    
    return results["matches"]

# Find text documents related to an image
text_matches = search_by_image("./my_photo.jpg", filter_type="text")

Hermes, walked through line by line — free 1-hour workshop

Hybrid Search with Metadata Filtering

Pinecone supports metadata filtering alongside vector search. This is useful for restricting search to a specific date range, source, or content category.

def search_with_filters(
    query_text: str,
    top_k: int = 10,
    filters: dict = None
) -> list[dict]:
    """
    Advanced search with metadata filters.
    
    Example filters:
    - {"type": {"$in": ["image", "video"]}}
    - {"year": {"$gte": 2023}}
    - {"source": {"$eq": "product_catalog"}}
    """
    query_embedding = embed_text_vertex(query_text)
    
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filters
    )
    
    return results["matches"]

# Find only product images from 2024
product_results = search_with_filters(
    query_text="running shoes",
    top_k=10,
    filters={
        "type": {"$eq": "image"},
        "category": {"$eq": "footwear"},
        "year": {"$gte": 2024}
    }
)

Building a Simple Retrieval API

With the indexing and querying functions in place, wrapping everything in a simple FastAPI service makes it usable by other applications.

from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
from typing import Optional, List
import tempfile
import shutil

app = FastAPI(title="Multimodal Search API")

class TextSearchRequest(BaseModel):
    query: str
    top_k: int = 10
    filter_type: Optional[str] = None
    filters: Optional[dict] = None

class SearchResult(BaseModel):
    id: str
    score: float
    type: str
    metadata: dict

@app.post("/search/text", response_model=List[SearchResult])
async def text_search(request: TextSearchRequest):
    try:
        matches = search_with_filters(
            query_text=request.query,
            top_k=request.top_k,
            filters=request.filters or (
                {"type": {"$eq": request.filter_type}} 
                if request.filter_type else None
            )
        )
        return [
            SearchResult(
                id=m["id"],
                score=m["score"],
                type=m["metadata"].get("type", "unknown"),
                metadata=m["metadata"]
            )
            for m in matches
        ]
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/search/image", response_model=List[SearchResult])
async def image_search(
    file: UploadFile = File(...),
    top_k: int = 10,
    filter_type: Optional[str] = None
):
    with tempfile.NamedTemporaryFile(
        delete=False, 
        suffix=Path(file.filename).suffix
    ) as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    
    try:
        matches = search_by_image(tmp_path, top_k=top_k, filter_type=filter_type)
        return [
            SearchResult(
                id=m["id"],
                score=m["score"],
                type=m["metadata"].get("type", "unknown"),
                metadata=m["metadata"]
            )
            for m in matches
        ]
    finally:
        os.unlink(tmp_path)

@app.post("/index/text")
async def index_text(text: str, metadata: dict = {}):
    vector_id = index_text_document(text=text, metadata=metadata)
    return {"id": vector_id, "status": "indexed"}

@app.post("/index/pdf")
async def index_pdf_endpoint(
    file: UploadFile = File(...),
    doc_id: Optional[str] = None
):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    
    try:
        ids = index_pdf(
            pdf_path=tmp_path,
            metadata={
                "doc_id": doc_id or str(uuid.uuid4()),
                "filename": file.filename
            }
        )
        return {"chunk_count": len(ids), "status": "indexed"}
    finally:
        os.unlink(tmp_path)

Run it with:

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Practical Architecture Patterns

Namespace Separation in Pinecone

Pinecone supports namespaces — logical partitions within an index. This is useful for multi-tenant applications or separating content by type while keeping the same index.

# Index with namespaces
def index_with_namespace(text: str, metadata: dict, namespace: str) -> str:
    vector_id = str(uuid.uuid4())
    embedding = embed_text(text)
    
    index.upsert(
        vectors=[{"id": vector_id, "values": embedding, "metadata": metadata}],
        namespace=namespace
    )
    return vector_id

# Search within a specific namespace
def search_namespace(query: str, namespace: str, top_k: int = 10):
    query_embedding = embed_text_vertex(query)
    return index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        namespace=namespace
    )["matches"]

# Example: separate namespaces per customer
index_with_namespace(text="...", metadata={}, namespace="customer_acme")
index_with_namespace(text="...", metadata={}, namespace="customer_globex")

Handling Embedding Drift and Updates

When your embedding model changes, old vectors in Pinecone become stale — they’re in a different vector space than new queries. The cleanest way to handle this is to track the model version in metadata and re-embed when needed.

MODEL_VERSION = "gemini-embedding-exp-03-07-v1"

def index_with_version(text: str, metadata: dict) -> str:
    vector_id = str(uuid.uuid4())
    embedding = embed_text(text)
    
    index.upsert(vectors=[{
        "id": vector_id,
        "values": embedding,
        "metadata": {
            **metadata,
            "embedding_model": MODEL_VERSION,
            "indexed_at": str(datetime.now().isoformat())
        }
    }])
    return vector_id

When you switch models, query for vectors with the old model version and re-embed them. Pinecone’s upsert will overwrite existing IDs, so no cleanup is needed.

Two-Stage Retrieval: Dense + Re-ranking

For production systems, a two-stage pipeline improves precision:

Stage 1 — Vector retrieval: Fetch the top 50 candidates from Pinecone using dense vector similarity.
Stage 2 — Re-ranking: Re-score candidates using a cross-encoder model (e.g., Cohere Rerank or a custom Gemini re-ranking prompt).

import cohere

co = cohere.Client(os.environ.get("COHERE_API_KEY"))

def search_and_rerank(query: str, top_k: int = 10, candidate_multiplier: int = 5):
    # Stage 1: get more candidates than needed
    candidates = search(query, top_k=top_k * candidate_multiplier)
    
    # Prepare documents for reranking
    docs = []
    for c in candidates:
        content = c["metadata"].get("content", "")
        if content:
            docs.append(content)
    
    if not docs:
        return candidates[:top_k]
    
    # Stage 2: rerank
    reranked = co.rerank(
        query=query,
        documents=docs,
        top_n=top_k,
        model="rerank-english-v3.0"
    )
    
    # Map reranked results back to original candidates
    results = []
    for result in reranked.results:
        original = candidates[result.index]
        original["rerank_score"] = result.relevance_score
        results.append(original)
    
    return results

Common Mistakes and How to Avoid Them

Mixing Embedding Models in the Same Index

This is the most common mistake. If you embed some documents with text-embedding-004 (768 dims) and others with gemini-embedding-exp-03-07 at 1024 dims, you get garbage retrieval results. Every vector in a Pinecone index must have the same dimensionality, and they should all come from the same model.

Fix: Decide on one model before you start. If you need to change models later, create a new index, re-embed everything, and migrate.

Not Using Task Types for Text Embeddings

Gemini’s text embedding models differentiate between documents (things you’re indexing) and queries (things you’re searching with). Using RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries meaningfully improves recall.

Fix: Always specify the task_type parameter.

Embedding Too Much Text Per Chunk

Larger chunks produce more averaged-out, less specific embeddings. A 5,000-word document embedded as a single vector loses the nuance of individual sections. Conversely, chunks that are too small lose context.

Fix: Use 512–1024 token chunks with 10–15% overlap for most document types.

Storing Raw Binary Data in Metadata

Pinecone metadata values must be strings, numbers, booleans, or arrays of strings. You can’t store binary data, base64-encoded images, or large blobs.

Fix: Store file paths, URLs, or object storage keys in metadata. Retrieve the actual media from your storage layer (S3, GCS, etc.) after a search.

Ignoring Rate Limits

The Gemini API has per-minute rate limits. When batch-indexing thousands of documents, you’ll hit them.

Fix: Add rate limiting and exponential backoff.

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
def embed_with_retry(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
    return embed_text(text, task_type=task_type)

Skipping Metadata Planning

Metadata is what makes your vector database queryable beyond pure similarity. Deciding upfront what metadata to store on each vector type (content type, source, date, category, IDs for linking) saves major refactoring later.

Recommended metadata fields by type:

Content Type	Required	Recommended
Text	`type`, `content`	`source`, `title`, `date`, `author`
Image	`type`, `file_path`	`filename`, `category`, `dimensions`
Video	`type`, `file_path`, `start_sec`, `end_sec`	`duration`, `title`, `scene_description`
Audio	`type`, `file_path`, `chunk_index`	`speaker`, `duration`, `language`
PDF	`type`, `doc_id`, `chunk_index`	`title`, `page_range`, `author`

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Where MindStudio Fits in This Stack

Once you’ve built a multimodal vector database, the natural next step is exposing it to AI agents that can actually use it. That’s where MindStudio becomes relevant.

MindStudio is a no-code platform for building AI agents and automated workflows. It connects to 200+ AI models and 1,000+ business tools out of the box. More importantly for this context, it supports webhook and API endpoint agents — which means you can point a MindStudio workflow directly at the FastAPI retrieval service you built above.

Here’s a concrete use case: say you’ve indexed a company’s entire asset library — product images, spec sheet PDFs, training videos, and support audio recordings — into Pinecone using Gemini multimodal embeddings. You then want a support agent that can answer customer questions by pulling from all of these.

In MindStudio, you’d wire up:

A trigger (incoming support ticket or chat message)
A call to your Pinecone search API to retrieve relevant text, images, and video segments
A Gemini or GPT-4o step that uses the retrieved content to generate a response
An output step that sends the response back

The whole thing takes under an hour to build, no custom server infrastructure required. The agent can call your vector search endpoint as a webhook, format the retrieved results as context, and pass it to any model for synthesis.

If you want to skip managing embedding infrastructure entirely, MindStudio’s built-in workflow steps can also handle chunking, embedding, and retrieval using managed integrations — useful when you’re prototyping and don’t want to maintain a separate service.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is Gemini Embedding 2?

Gemini Embedding 2 refers to Google’s second generation of embedding models under the Gemini family, including gemini-embedding-exp-03-07 for text (which ranks highly on the MTEB benchmark) and multimodalembedding@001 on Vertex AI, which supports text, image, and video in a shared vector space. The key advancement over earlier models is cross-modal retrieval: text queries can retrieve images, and vice versa, without any paired training data for your specific content.

Can Pinecone store multimodal data directly?

No — Pinecone stores vectors (arrays of floats) and structured metadata. It doesn’t store the raw images, audio, or video files themselves. You embed your media into vectors using a model like Gemini’s multimodal embedding API, store those vectors in Pinecone, and keep the actual media files in a separate storage system (like Google Cloud Storage, AWS S3, or a CDN). Metadata fields like file_path or url link vectors back to the original media.

What’s the difference between multimodalembedding@001 and text-embedding-004?

text-embedding-004 is a text-only model that produces 768-dimensional vectors and is available via the Gemini Developer API. It’s fast, stable, and great for pure text retrieval tasks. multimodalembedding@001 is available only on Vertex AI and embeds text, images, and video into the same 1408-dimensional space. For a truly multimodal index where a text query can retrieve images (and vice versa), you need multimodalembedding@001. For text-only retrieval, text-embedding-004 or the experimental gemini-embedding-exp-03-07 are more cost-effective.

How do I handle large-scale indexing without hitting API rate limits?

Use batching, concurrency controls, and retry logic with exponential backoff. The tenacity library makes retry logic straightforward in Python. For very large datasets (millions of records), consider using a queue system (like Google Cloud Tasks, Redis Queue, or AWS SQS) to spread the embedding workload across time and avoid bursting. Pinecone itself has upsert rate limits too — batch upserts in groups of 100 vectors, not single vectors.

Can I search with an image query in Pinecone?

Yes, as long as your index contains vectors generated by the same multimodal embedding model. Embed the query image using multimodalembedding@001, then run a standard Pinecone query with that vector. You’ll get back all matching content — whether it was originally indexed as an image, text, or video — that is semantically similar in that shared embedding space. This is what makes true cross-modal retrieval possible.

How do I update or delete indexed content?

Pinecone supports delete by vector ID and upsert to overwrite existing vectors. For updates, use the same vector ID as the original and call upsert with the new embedding. For deletions, call index.delete(ids=["vector_id_1", "vector_id_2"]). If you’re using namespaces, include the namespace in both operations. For bulk deletions (e.g., removing all content from a specific source), use metadata filtering with delete_by_filter if your Pinecone plan supports it.

Key Takeaways

Gemini’s multimodal embedding models map text, images, and video into a shared vector space — enabling cross-modal retrieval without paired training data.
Pinecone handles the storage and retrieval layer. Plan your index dimensions and metadata schema before you start indexing.
Content type matters for preprocessing: PDFs and audio require text extraction/transcription; video requires segment-level splitting; images can be embedded directly.
Always match task types when embedding text — use RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries.
Production readiness requires retry logic, rate limiting, metadata versioning, and a two-stage retrieval strategy for best precision.
Agents built on top of this stack — using tools like MindStudio — can expose your multimodal search capability to business workflows without additional backend work.

Building this pipeline takes real effort to get right, but the payoff is significant: a single search interface that understands what your content means, not just what it says. That’s a fundamentally different capability than keyword search or file-type-specific retrieval — and it opens the door to AI agents that can genuinely reason over your entire knowledge base.