How to Build a Multimodal Vector Database with Gemini Embedding 2 and Pinecone
Step-by-step guide to building a multimodal vector database using Gemini Embedding 2 and Pinecone — covering text, images, video, audio, and PDFs.
What Is a Multimodal Vector Database — and Why Should You Build One?
Most search systems are built around one data type. You search text with text. You search images with image queries. But the real world doesn’t work that way — a product catalog has images, descriptions, and spec sheets. A media library has videos, transcripts, and thumbnails. A support system has ticket text, screenshots, and voice recordings.
A multimodal vector database lets you search and retrieve across all of these using a single, unified index. You embed text, images, audio, video, and documents into the same vector space, then run queries that can match across all of them — regardless of modality.
This guide walks through building exactly that using Gemini Embedding 2 (Google’s multimodal embedding model) and Pinecone (a managed vector database). You’ll cover the full stack: setting up Pinecone, generating multimodal embeddings with the Gemini API, indexing mixed media, and querying across modalities.
Understanding Gemini Embedding 2
Google’s Gemini Embedding 2 refers to the gemini-embedding-exp-03-07 experimental model released in early 2025, as well as the production-ready text-embedding-004 for text and the multimodalembedding@001 model available via Vertex AI for image and video use cases. The newer Gemini 2.0 embedding lineup extends this to true cross-modal embedding.
What Makes It Multimodal
Traditional embedding models map text to vectors. Gemini’s multimodal embedding models map text, images, and video clips into the same latent space. This means a text query like “a golden retriever running on a beach” can retrieve a relevant image, even if that image has no caption. The semantic meaning is shared across modalities.
Key capabilities of the Gemini multimodal embedding API:
- Supports text inputs up to 32,768 tokens
- Supports image inputs (JPEG, PNG, GIF, BMP, WebP)
- Supports video inputs (MP4, MOV, AVI, and others)
- Returns 1408-dimensional embedding vectors by default
- Cross-modal retrieval: query with one modality, retrieve another
- Available via the
generativelanguage.googleapis.comAPI or Vertex AI
Text-Specific Options: text-embedding-004 vs. gemini-embedding-exp
For text-only use cases, Google offers text-embedding-004, which produces 768-dimensional vectors and is fast and cost-effective. The experimental gemini-embedding-exp-03-07 model sits near the top of the MTEB benchmark (scoring around 72.4 at the time of writing), supports up to 8,192 input tokens, and offers variable output dimensions from 64 to 3072.
For multimodal use cases specifically, you’ll want to use either:
multimodalembedding@001on Vertex AI (supports text + image + video)- Or combine
gemini-embedding-expfor text with vision model features fromgemini-2.0-flash
The exact model choice depends on whether you’re using the Gemini Developer API (AI Studio) or Vertex AI. This guide covers both paths.
Embedding Dimensions and Cost
| Model | Modalities | Default Dimensions | Notes |
|---|---|---|---|
text-embedding-004 | Text | 768 | Production, stable |
gemini-embedding-exp-03-07 | Text | 3072 (variable) | Experimental, high MTEB score |
multimodalembedding@001 | Text, Image, Video | 1408 | Vertex AI only |
embedding-001 | Text | 768 | Legacy |
For a unified multimodal index in Pinecone, all vectors must have the same dimension. Plan this before you start.
Setting Up Pinecone
Pinecone is a fully managed vector database built for similarity search. It handles indexing, querying, and scaling — you just push vectors and metadata, then query.
Create a Pinecone Account and Index
- Go to pinecone.io and create a free account.
- From the console, click Create Index.
- Choose a name (e.g.,
multimodal-index). - Set the dimension to match your embedding model output. For
multimodalembedding@001, use1408. Forgemini-embedding-exp-03-07with reduced dimensions, use whatever you configure (e.g.,1024). - Choose Cosine as the distance metric — it works well for semantic similarity.
- Select the Serverless tier and a cloud region (e.g.,
us-east-1on AWS). - Copy your API key from the API Keys tab.
Install the Pinecone SDK
pip install pinecone-client google-generativeai vertexai python-dotenv pillow
For Vertex AI specifically:
pip install google-cloud-aiplatform
Initialize the Pinecone Client
import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
load_dotenv()
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index_name = "multimodal-index"
# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1408, # Match your embedding model
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
Generating Embeddings with the Gemini API
There are two paths here depending on your use case.
Path 1: Text Embeddings via Gemini Developer API
This uses the google-generativeai SDK with your AI Studio API key.
import google.generativeai as genai
import os
genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
def embed_text(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
"""
task_type options:
- RETRIEVAL_DOCUMENT: for indexing
- RETRIEVAL_QUERY: for querying
- SEMANTIC_SIMILARITY: for similarity tasks
- CLASSIFICATION: for classification
- CLUSTERING: for clustering
"""
result = genai.embed_content(
model="models/gemini-embedding-exp-03-07",
content=text,
task_type=task_type,
output_dimensionality=1408 # Match Pinecone index dimension
)
return result["embedding"]
# Test it
embedding = embed_text("A cat sitting on a red chair")
print(f"Vector length: {len(embedding)}") # Should print 1408
The task_type parameter matters. Use RETRIEVAL_DOCUMENT when indexing content and RETRIEVAL_QUERY when embedding search queries. This distinction improves retrieval quality significantly.
Path 2: Multimodal Embeddings via Vertex AI
For true multimodal embedding (text + images + video in the same space), use Vertex AI’s multimodalembedding@001 model.
from vertexai.vision_models import MultiModalEmbeddingModel, Image, Video
import vertexai
vertexai.init(project=os.environ.get("GCP_PROJECT_ID"), location="us-central1")
model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
def embed_image(image_path: str) -> list[float]:
image = Image.load_from_file(image_path)
embeddings = model.get_embeddings(
image=image,
dimension=1408
)
return embeddings.image_embedding
def embed_text_vertex(text: str) -> list[float]:
embeddings = model.get_embeddings(
contextual_text=text,
dimension=1408
)
return embeddings.text_embedding
def embed_video(video_path: str, video_segment_config=None) -> list[dict]:
video = Video.load_from_file(video_path)
embeddings = model.get_embeddings(
video=video,
video_segment_config=video_segment_config,
dimension=1408
)
return [
{
"start_offset_sec": segment.start_offset_sec,
"end_offset_sec": segment.end_offset_sec,
"embedding": segment.embedding
}
for segment in embeddings.video_embeddings
]
Note that multimodalembedding@001 requires a Google Cloud project with the Vertex AI API enabled. You’ll need to authenticate via gcloud auth application-default login or a service account.
Indexing Mixed Media into Pinecone
Now the core of it: getting text, images, videos, audio, and PDFs into Pinecone as vectors with rich metadata.
Indexing Text Documents
import uuid
def index_text_document(text: str, metadata: dict) -> str:
vector_id = str(uuid.uuid4())
embedding = embed_text(text, task_type="RETRIEVAL_DOCUMENT")
index.upsert(vectors=[{
"id": vector_id,
"values": embedding,
"metadata": {
"type": "text",
"content": text[:1000], # Store a preview, not the full text
**metadata
}
}])
return vector_id
# Example usage
doc_id = index_text_document(
text="Pinecone is a vector database optimized for machine learning applications.",
metadata={
"source": "pinecone_docs",
"title": "Pinecone Overview",
"url": "https://docs.pinecone.io"
}
)
For large documents, chunk the text first. A good rule of thumb is 512–1024 tokens per chunk with 10–20% overlap between chunks.
Chunking and Indexing PDFs
PDFs need preprocessing. Use PyMuPDF or pypdf to extract text, then chunk and embed.
import fitz # PyMuPDF
import re
def chunk_text(text: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
i += chunk_size - overlap
return chunks
def index_pdf(pdf_path: str, metadata: dict) -> list[str]:
doc = fitz.open(pdf_path)
full_text = ""
for page_num in range(len(doc)):
page = doc.load_page(page_num)
full_text += page.get_text()
chunks = chunk_text(full_text)
vector_ids = []
vectors_to_upsert = []
for i, chunk in enumerate(chunks):
embedding = embed_text(chunk, task_type="RETRIEVAL_DOCUMENT")
vector_id = f"{metadata.get('doc_id', str(uuid.uuid4()))}_chunk_{i}"
vectors_to_upsert.append({
"id": vector_id,
"values": embedding,
"metadata": {
"type": "pdf",
"chunk_index": i,
"total_chunks": len(chunks),
"content": chunk[:500],
**metadata
}
})
vector_ids.append(vector_id)
# Batch upsert for efficiency
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
batch = vectors_to_upsert[i:i + batch_size]
index.upsert(vectors=batch)
return vector_ids
# Example
ids = index_pdf(
pdf_path="./research_paper.pdf",
metadata={
"doc_id": "paper_001",
"title": "Vector Search at Scale",
"author": "Research Team",
"year": 2024
}
)
print(f"Indexed {len(ids)} chunks from PDF")
Indexing Images
For images, use the Vertex AI multimodal model. Store the image URL or path in metadata so you can retrieve the actual image after a search.
def index_image(image_path: str, metadata: dict) -> str:
vector_id = str(uuid.uuid4())
embedding = embed_image(image_path)
index.upsert(vectors=[{
"id": vector_id,
"values": embedding,
"metadata": {
"type": "image",
"file_path": image_path,
**metadata
}
}])
return vector_id
# Batch index a directory of images
import os
from pathlib import Path
def index_image_directory(directory: str, base_metadata: dict = {}) -> list[str]:
image_extensions = {".jpg", ".jpeg", ".png", ".webp", ".gif", ".bmp"}
image_files = [
f for f in Path(directory).iterdir()
if f.suffix.lower() in image_extensions
]
vector_ids = []
for img_file in image_files:
try:
vec_id = index_image(
image_path=str(img_file),
metadata={
"filename": img_file.name,
"directory": directory,
**base_metadata
}
)
vector_ids.append(vec_id)
print(f"Indexed: {img_file.name}")
except Exception as e:
print(f"Failed to index {img_file.name}: {e}")
return vector_ids
One thing to be aware of: images with text in them benefit from a two-pass approach — embed the image itself, and also extract and embed any visible text using OCR (Tesseract or Gemini Vision). Store both vectors and link them via a shared document_id in metadata.
Indexing Videos
Video indexing is more nuanced because a single video contains multiple semantic moments. multimodalembedding@001 handles this by generating segment-level embeddings — each covering a configurable window of time.
from vertexai.vision_models import VideoSegmentConfig
def index_video(video_path: str, metadata: dict, interval_sec: int = 10) -> list[str]:
"""
Indexes a video by splitting into segments and embedding each.
Default: one embedding per 10-second segment.
"""
segment_config = VideoSegmentConfig(
start_offset_sec=0,
end_offset_sec=None, # Embed entire video
interval_sec=interval_sec
)
segments = embed_video(video_path, video_segment_config=segment_config)
vector_ids = []
vectors_to_upsert = []
for segment in segments:
vector_id = f"{metadata.get('video_id', str(uuid.uuid4()))}_{segment['start_offset_sec']}_{segment['end_offset_sec']}"
vectors_to_upsert.append({
"id": vector_id,
"values": segment["embedding"],
"metadata": {
"type": "video",
"start_sec": segment["start_offset_sec"],
"end_sec": segment["end_offset_sec"],
"file_path": video_path,
**metadata
}
})
vector_ids.append(vector_id)
index.upsert(vectors=vectors_to_upsert)
return vector_ids
With 10-second segments, a 5-minute video produces ~30 vectors. Each points back to the same file with a time range in metadata — so when you retrieve it, you know exactly which part of the video matched the query.
Indexing Audio
The multimodalembedding@001 model doesn’t directly embed audio. The standard approach is to transcribe audio first (using Gemini’s audio understanding or Whisper), then embed the transcript. You can optionally chunk the transcript with timestamps to preserve fine-grained retrieval.
import google.generativeai as genai
def transcribe_audio(audio_path: str) -> str:
"""Transcribe audio using Gemini's multimodal capabilities."""
model = genai.GenerativeModel("gemini-2.0-flash")
with open(audio_path, "rb") as audio_file:
audio_data = audio_file.read()
# Detect MIME type from extension
ext = Path(audio_path).suffix.lower()
mime_map = {
".mp3": "audio/mp3",
".wav": "audio/wav",
".m4a": "audio/mp4",
".ogg": "audio/ogg",
".flac": "audio/flac"
}
mime_type = mime_map.get(ext, "audio/mp3")
response = model.generate_content([
{"mime_type": mime_type, "data": audio_data},
"Transcribe this audio accurately. Include speaker labels if multiple speakers are present."
])
return response.text
def index_audio(audio_path: str, metadata: dict) -> list[str]:
transcript = transcribe_audio(audio_path)
chunks = chunk_text(transcript, chunk_size=500, overlap=50)
vectors_to_upsert = []
vector_ids = []
audio_id = metadata.get("audio_id", str(uuid.uuid4()))
for i, chunk in enumerate(chunks):
embedding = embed_text(chunk, task_type="RETRIEVAL_DOCUMENT")
vector_id = f"{audio_id}_transcript_chunk_{i}"
vectors_to_upsert.append({
"id": vector_id,
"values": embedding,
"metadata": {
"type": "audio",
"source": "transcript",
"chunk_index": i,
"content": chunk[:500],
"file_path": audio_path,
**metadata
}
})
vector_ids.append(vector_id)
index.upsert(vectors=vectors_to_upsert)
return vector_ids
Querying Across Modalities
Once everything is in Pinecone, you can search across all modalities with a single query. The key insight: if your text and image embeddings share the same vector space (achieved via multimodalembedding@001), a text query can retrieve images, and an image query can retrieve text.
Text Query Across All Content Types
def search(
query_text: str,
top_k: int = 10,
filter_type: str = None
) -> list[dict]:
"""
Search across all indexed content using a text query.
Optionally filter by content type: 'text', 'image', 'video', 'audio', 'pdf'
"""
query_embedding = embed_text_vertex(query_text)
filter_params = None
if filter_type:
filter_params = {"type": {"$eq": filter_type}}
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=filter_params
)
return results["matches"]
# Search everything
results = search("golden retriever playing fetch on a beach", top_k=5)
for match in results:
print(f"Score: {match['score']:.4f} | Type: {match['metadata']['type']} | ID: {match['id']}")
if "content" in match["metadata"]:
print(f" Content: {match['metadata']['content'][:100]}...")
if "file_path" in match["metadata"]:
print(f" File: {match['metadata']['file_path']}")
Image Query to Find Similar Images or Related Text
def search_by_image(
image_path: str,
top_k: int = 10,
filter_type: str = None
) -> list[dict]:
"""Use an image to find semantically similar content."""
image_embedding = embed_image(image_path)
filter_params = None
if filter_type:
filter_params = {"type": {"$eq": filter_type}}
results = index.query(
vector=image_embedding,
top_k=top_k,
include_metadata=True,
filter=filter_params
)
return results["matches"]
# Find text documents related to an image
text_matches = search_by_image("./my_photo.jpg", filter_type="text")
Hybrid Search with Metadata Filtering
Pinecone supports metadata filtering alongside vector search. This is useful for restricting search to a specific date range, source, or content category.
def search_with_filters(
query_text: str,
top_k: int = 10,
filters: dict = None
) -> list[dict]:
"""
Advanced search with metadata filters.
Example filters:
- {"type": {"$in": ["image", "video"]}}
- {"year": {"$gte": 2023}}
- {"source": {"$eq": "product_catalog"}}
"""
query_embedding = embed_text_vertex(query_text)
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=filters
)
return results["matches"]
# Find only product images from 2024
product_results = search_with_filters(
query_text="running shoes",
top_k=10,
filters={
"type": {"$eq": "image"},
"category": {"$eq": "footwear"},
"year": {"$gte": 2024}
}
)
Building a Simple Retrieval API
With the indexing and querying functions in place, wrapping everything in a simple FastAPI service makes it usable by other applications.
from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
from typing import Optional, List
import tempfile
import shutil
app = FastAPI(title="Multimodal Search API")
class TextSearchRequest(BaseModel):
query: str
top_k: int = 10
filter_type: Optional[str] = None
filters: Optional[dict] = None
class SearchResult(BaseModel):
id: str
score: float
type: str
metadata: dict
@app.post("/search/text", response_model=List[SearchResult])
async def text_search(request: TextSearchRequest):
try:
matches = search_with_filters(
query_text=request.query,
top_k=request.top_k,
filters=request.filters or (
{"type": {"$eq": request.filter_type}}
if request.filter_type else None
)
)
return [
SearchResult(
id=m["id"],
score=m["score"],
type=m["metadata"].get("type", "unknown"),
metadata=m["metadata"]
)
for m in matches
]
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/search/image", response_model=List[SearchResult])
async def image_search(
file: UploadFile = File(...),
top_k: int = 10,
filter_type: Optional[str] = None
):
with tempfile.NamedTemporaryFile(
delete=False,
suffix=Path(file.filename).suffix
) as tmp:
shutil.copyfileobj(file.file, tmp)
tmp_path = tmp.name
try:
matches = search_by_image(tmp_path, top_k=top_k, filter_type=filter_type)
return [
SearchResult(
id=m["id"],
score=m["score"],
type=m["metadata"].get("type", "unknown"),
metadata=m["metadata"]
)
for m in matches
]
finally:
os.unlink(tmp_path)
@app.post("/index/text")
async def index_text(text: str, metadata: dict = {}):
vector_id = index_text_document(text=text, metadata=metadata)
return {"id": vector_id, "status": "indexed"}
@app.post("/index/pdf")
async def index_pdf_endpoint(
file: UploadFile = File(...),
doc_id: Optional[str] = None
):
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
shutil.copyfileobj(file.file, tmp)
tmp_path = tmp.name
try:
ids = index_pdf(
pdf_path=tmp_path,
metadata={
"doc_id": doc_id or str(uuid.uuid4()),
"filename": file.filename
}
)
return {"chunk_count": len(ids), "status": "indexed"}
finally:
os.unlink(tmp_path)
Run it with:
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Practical Architecture Patterns
Namespace Separation in Pinecone
Pinecone supports namespaces — logical partitions within an index. This is useful for multi-tenant applications or separating content by type while keeping the same index.
# Index with namespaces
def index_with_namespace(text: str, metadata: dict, namespace: str) -> str:
vector_id = str(uuid.uuid4())
embedding = embed_text(text)
index.upsert(
vectors=[{"id": vector_id, "values": embedding, "metadata": metadata}],
namespace=namespace
)
return vector_id
# Search within a specific namespace
def search_namespace(query: str, namespace: str, top_k: int = 10):
query_embedding = embed_text_vertex(query)
return index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
namespace=namespace
)["matches"]
# Example: separate namespaces per customer
index_with_namespace(text="...", metadata={}, namespace="customer_acme")
index_with_namespace(text="...", metadata={}, namespace="customer_globex")
Handling Embedding Drift and Updates
When your embedding model changes, old vectors in Pinecone become stale — they’re in a different vector space than new queries. The cleanest way to handle this is to track the model version in metadata and re-embed when needed.
MODEL_VERSION = "gemini-embedding-exp-03-07-v1"
def index_with_version(text: str, metadata: dict) -> str:
vector_id = str(uuid.uuid4())
embedding = embed_text(text)
index.upsert(vectors=[{
"id": vector_id,
"values": embedding,
"metadata": {
**metadata,
"embedding_model": MODEL_VERSION,
"indexed_at": str(datetime.now().isoformat())
}
}])
return vector_id
When you switch models, query for vectors with the old model version and re-embed them. Pinecone’s upsert will overwrite existing IDs, so no cleanup is needed.
Two-Stage Retrieval: Dense + Re-ranking
For production systems, a two-stage pipeline improves precision:
- Stage 1 — Vector retrieval: Fetch the top 50 candidates from Pinecone using dense vector similarity.
- Stage 2 — Re-ranking: Re-score candidates using a cross-encoder model (e.g., Cohere Rerank or a custom Gemini re-ranking prompt).
import cohere
co = cohere.Client(os.environ.get("COHERE_API_KEY"))
def search_and_rerank(query: str, top_k: int = 10, candidate_multiplier: int = 5):
# Stage 1: get more candidates than needed
candidates = search(query, top_k=top_k * candidate_multiplier)
# Prepare documents for reranking
docs = []
for c in candidates:
content = c["metadata"].get("content", "")
if content:
docs.append(content)
if not docs:
return candidates[:top_k]
# Stage 2: rerank
reranked = co.rerank(
query=query,
documents=docs,
top_n=top_k,
model="rerank-english-v3.0"
)
# Map reranked results back to original candidates
results = []
for result in reranked.results:
original = candidates[result.index]
original["rerank_score"] = result.relevance_score
results.append(original)
return results
Common Mistakes and How to Avoid Them
Mixing Embedding Models in the Same Index
This is the most common mistake. If you embed some documents with text-embedding-004 (768 dims) and others with gemini-embedding-exp-03-07 at 1024 dims, you get garbage retrieval results. Every vector in a Pinecone index must have the same dimensionality, and they should all come from the same model.
Fix: Decide on one model before you start. If you need to change models later, create a new index, re-embed everything, and migrate.
Not Using Task Types for Text Embeddings
Gemini’s text embedding models differentiate between documents (things you’re indexing) and queries (things you’re searching with). Using RETRIEVAL_DOCUMENT for indexing and RETRIEVAL_QUERY for search queries meaningfully improves recall.
Fix: Always specify the task_type parameter.
Embedding Too Much Text Per Chunk
Larger chunks produce more averaged-out, less specific embeddings. A 5,000-word document embedded as a single vector loses the nuance of individual sections. Conversely, chunks that are too small lose context.
Fix: Use 512–1024 token chunks with 10–15% overlap for most document types.
Storing Raw Binary Data in Metadata
Pinecone metadata values must be strings, numbers, booleans, or arrays of strings. You can’t store binary data, base64-encoded images, or large blobs.
Fix: Store file paths, URLs, or object storage keys in metadata. Retrieve the actual media from your storage layer (S3, GCS, etc.) after a search.
Ignoring Rate Limits
The Gemini API has per-minute rate limits. When batch-indexing thousands of documents, you’ll hit them.
Fix: Add rate limiting and exponential backoff.
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def embed_with_retry(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> list[float]:
return embed_text(text, task_type=task_type)
Skipping Metadata Planning
Metadata is what makes your vector database queryable beyond pure similarity. Deciding upfront what metadata to store on each vector type (content type, source, date, category, IDs for linking) saves major refactoring later.
Recommended metadata fields by type:
| Content Type | Required | Recommended |
|---|---|---|
| Text | type, content | source, title, date, author |
| Image | type, file_path | filename, category, dimensions |
| Video | type, file_path, start_sec, end_sec | duration, title, scene_description |
| Audio | type, file_path, chunk_index | speaker, duration, language |
type, doc_id, chunk_index | title, page_range, author |
Where MindStudio Fits in This Stack
Once you’ve built a multimodal vector database, the natural next step is exposing it to AI agents that can actually use it. That’s where MindStudio becomes relevant.
MindStudio is a no-code platform for building AI agents and automated workflows. It connects to 200+ AI models and 1,000+ business tools out of the box. More importantly for this context, it supports webhook and API endpoint agents — which means you can point a MindStudio workflow directly at the FastAPI retrieval service you built above.
Here’s a concrete use case: say you’ve indexed a company’s entire asset library — product images, spec sheet PDFs, training videos, and support audio recordings — into Pinecone using Gemini multimodal embeddings. You then want a support agent that can answer customer questions by pulling from all of these.
In MindStudio, you’d wire up:
- A trigger (incoming support ticket or chat message)
- A call to your Pinecone search API to retrieve relevant text, images, and video segments
- A Gemini or GPT-4o step that uses the retrieved content to generate a response
- An output step that sends the response back
The whole thing takes under an hour to build, no custom server infrastructure required. The agent can call your vector search endpoint as a webhook, format the retrieved results as context, and pass it to any model for synthesis.
If you want to skip managing embedding infrastructure entirely, MindStudio’s built-in workflow steps can also handle chunking, embedding, and retrieval using managed integrations — useful when you’re prototyping and don’t want to maintain a separate service.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What is Gemini Embedding 2?
Gemini Embedding 2 refers to Google’s second generation of embedding models under the Gemini family, including gemini-embedding-exp-03-07 for text (which ranks highly on the MTEB benchmark) and multimodalembedding@001 on Vertex AI, which supports text, image, and video in a shared vector space. The key advancement over earlier models is cross-modal retrieval: text queries can retrieve images, and vice versa, without any paired training data for your specific content.
Can Pinecone store multimodal data directly?
No — Pinecone stores vectors (arrays of floats) and structured metadata. It doesn’t store the raw images, audio, or video files themselves. You embed your media into vectors using a model like Gemini’s multimodal embedding API, store those vectors in Pinecone, and keep the actual media files in a separate storage system (like Google Cloud Storage, AWS S3, or a CDN). Metadata fields like file_path or url link vectors back to the original media.
What’s the difference between multimodalembedding@001 and text-embedding-004?
text-embedding-004 is a text-only model that produces 768-dimensional vectors and is available via the Gemini Developer API. It’s fast, stable, and great for pure text retrieval tasks. multimodalembedding@001 is available only on Vertex AI and embeds text, images, and video into the same 1408-dimensional space. For a truly multimodal index where a text query can retrieve images (and vice versa), you need multimodalembedding@001. For text-only retrieval, text-embedding-004 or the experimental gemini-embedding-exp-03-07 are more cost-effective.
How do I handle large-scale indexing without hitting API rate limits?
Use batching, concurrency controls, and retry logic with exponential backoff. The tenacity library makes retry logic straightforward in Python. For very large datasets (millions of records), consider using a queue system (like Google Cloud Tasks, Redis Queue, or AWS SQS) to spread the embedding workload across time and avoid bursting. Pinecone itself has upsert rate limits too — batch upserts in groups of 100 vectors, not single vectors.
Can I search with an image query in Pinecone?
Yes, as long as your index contains vectors generated by the same multimodal embedding model. Embed the query image using multimodalembedding@001, then run a standard Pinecone query with that vector. You’ll get back all matching content — whether it was originally indexed as an image, text, or video — that is semantically similar in that shared embedding space. This is what makes true cross-modal retrieval possible.
How do I update or delete indexed content?
Pinecone supports delete by vector ID and upsert to overwrite existing vectors. For updates, use the same vector ID as the original and call upsert with the new embedding. For deletions, call index.delete(ids=["vector_id_1", "vector_id_2"]). If you’re using namespaces, include the namespace in both operations. For bulk deletions (e.g., removing all content from a specific source), use metadata filtering with delete_by_filter if your Pinecone plan supports it.
Key Takeaways
- Gemini’s multimodal embedding models map text, images, and video into a shared vector space — enabling cross-modal retrieval without paired training data.
- Pinecone handles the storage and retrieval layer. Plan your index dimensions and metadata schema before you start indexing.
- Content type matters for preprocessing: PDFs and audio require text extraction/transcription; video requires segment-level splitting; images can be embedded directly.
- Always match task types when embedding text — use
RETRIEVAL_DOCUMENTfor indexing andRETRIEVAL_QUERYfor search queries. - Production readiness requires retry logic, rate limiting, metadata versioning, and a two-stage retrieval strategy for best precision.
- Agents built on top of this stack — using tools like MindStudio — can expose your multimodal search capability to business workflows without additional backend work.
Building this pipeline takes real effort to get right, but the payoff is significant: a single search interface that understands what your content means, not just what it says. That’s a fundamentally different capability than keyword search or file-type-specific retrieval — and it opens the door to AI agents that can genuinely reason over your entire knowledge base.