Skip to main content
MindStudio
Pricing
Blog About
My Workspace

How to Build an Image-to-Image Search System for Business Using Gemini Embedding 2

Learn how to build an image similarity search system for business use cases like roofing, real estate, or e-commerce using Gemini Embedding 2.

MindStudio Team
How to Build an Image-to-Image Search System for Business Using Gemini Embedding 2

Why Image Similarity Search Is Becoming a Business Essential

Finding visually similar images used to require expert human eyes or expensive custom software. Now it can be done in seconds with an embedding model and a vector database. And with Google’s Gemini Embedding 2 — a multimodal model that produces embeddings for both text and images in the same vector space — businesses can build image-to-image search systems that actually work at scale.

This article walks through exactly how to build one. You’ll learn what Gemini Embedding 2 is and why it matters, how image embeddings enable similarity search, how to set up the technical stack, and how to apply the system to real business problems like roofing inspections, real estate listings, e-commerce product matching, and more.

Whether you’re a developer building this from scratch or a product manager evaluating what’s possible, this guide covers the full picture.


What Gemini Embedding 2 Actually Is

Google’s Gemini Embedding 2 (also referred to in the API as gemini-embedding-exp-03-07 or its stable variants) is a multimodal embedding model. That means it can take both text and images as input and convert them into dense numerical vectors — embeddings — in a shared semantic space.

This is a significant shift from older image embedding approaches that required separate models for text and images, and then some bridging layer to make them comparable.

How It Differs from Earlier Embedding Models

Previous approaches often used:

  • CLIP-style models (like OpenAI’s CLIP or Google’s own ALIGN) — contrastive learning models trained to align image and text embeddings
  • Vision transformers (ViTs) — great for images but not natively multimodal
  • Separate pipelines — text embeddings from one model, image embeddings from another, with cross-modal retrieval requiring special handling

Gemini Embedding 2 handles all of this in a single model. You send it an image, it returns a vector. You send it text, it returns a vector in the same space. You can then compare those vectors directly using cosine similarity or dot product.

Key Technical Specs

  • Output dimensionality: Gemini Embedding 2 supports flexible output dimensions. Developers can specify vector size depending on their storage and precision requirements — from 768 dimensions up to 3072 dimensions.
  • Input types: Accepts images (JPEG, PNG, WebP, etc.) and text strings.
  • Task types: The API accepts task type hints like RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, which help the model calibrate embeddings for the specific use case.
  • API access: Available through the Google AI Gemini API and Vertex AI.

For pure image-to-image search, you don’t need the cross-modal capabilities at all — but having them in the same model means you can easily extend your system later. Start with image-to-image matching, then add text-to-image search, or mixed queries, without rebuilding your pipeline.


The Core Concept: How Image Similarity Search Works

Before building anything, it’s worth understanding the mechanics. Image similarity search works in two phases: indexing and querying.

Phase 1: Indexing (Building the Database)

  1. You have a collection of reference images (a product catalog, a library of inspected roofs, a real estate photo archive, etc.)
  2. Each image is passed through the embedding model — in this case, Gemini Embedding 2
  3. The model returns a vector for each image
  4. These vectors are stored in a vector database alongside metadata (image ID, URL, category, price, etc.)

This is a one-time operation per image. When new images are added, you embed them and add them to the index.

Phase 2: Querying (Finding Matches)

  1. A user submits a query image
  2. You pass the query image through the same embedding model
  3. You get a vector for the query image
  4. You search the vector database for the nearest neighbors — vectors that are mathematically close to the query vector
  5. The database returns the top K most similar images

The “closeness” between vectors corresponds to visual and semantic similarity. Two images of similar-looking roof damage will have vectors that are near each other in the embedding space. Two images of completely different things — a roof and a sofa — will have vectors far apart.

What “Similarity” Actually Means

This is an important nuance. Embedding-based similarity isn’t pixel-level matching (that would be perceptual hashing). It’s semantic similarity. The model has learned representations of what things are, not just what they look like at a pixel level.

This means:

  • A photo of a cracked tile taken from the left and one taken from the right will still be considered similar
  • An image of a red dress on a hanger and a red dress on a mannequin will be considered similar
  • Two structurally different but visually similar roof shapes will match

This is what makes it useful for business. Real-world photos are messy — different angles, lighting conditions, cameras. Semantic embeddings handle this naturally.


Setting Up the Technical Stack

Here’s the component list for a functional image-to-image search system:

  • Embedding model: Gemini Embedding 2 (via Google AI API or Vertex AI)
  • Vector database: Pinecone, Weaviate, Qdrant, Chroma, or pgvector (PostgreSQL extension)
  • Object storage: Google Cloud Storage, AWS S3, or similar for storing images
  • Backend API: Python (FastAPI or Flask) or Node.js
  • Frontend (optional): A simple web UI for uploading query images and displaying results

For small-scale pilots (< 100,000 images), Chroma or Qdrant are great free options you can run locally. For production at scale, Pinecone or Weaviate managed cloud instances are more practical.

Step 1: Get Access to the Gemini API

First, you need a Google AI API key. Go to Google AI Studio and create an API key. For production or high-volume use, Vertex AI is the preferred route with enterprise SLA.

Install the SDK:

pip install google-generativeai

Or for Vertex AI:

pip install google-cloud-aiplatform

Step 2: Generate Embeddings for Your Images

Here’s a basic Python function to embed an image using Gemini Embedding 2 via the Google AI SDK:

import google.generativeai as genai
import base64
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

def embed_image(image_path: str) -> list[float]:
    """
    Generate a Gemini Embedding 2 vector for an image.
    Returns a list of floats (the embedding vector).
    """
    image_data = Path(image_path).read_bytes()
    image_b64 = base64.b64encode(image_data).decode("utf-8")
    
    # Detect MIME type from extension
    ext = Path(image_path).suffix.lower()
    mime_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", 
                ".png": "image/png", ".webp": "image/webp"}
    mime_type = mime_map.get(ext, "image/jpeg")
    
    result = genai.embed_content(
        model="models/gemini-embedding-exp-03-07",
        content={
            "parts": [
                {
                    "inline_data": {
                        "mime_type": mime_type,
                        "data": image_b64
                    }
                }
            ]
        },
        task_type="RETRIEVAL_DOCUMENT",
        output_dimensionality=1024
    )
    
    return result["embedding"]

A few things to note:

  • task_type="RETRIEVAL_DOCUMENT" tells the model these are database items to be indexed
  • For query images, use task_type="RETRIEVAL_QUERY" instead
  • output_dimensionality=1024 is a reasonable balance between accuracy and storage cost

Step 3: Set Up a Vector Database

Using Qdrant locally as an example (free, runs in Docker):

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

Then in Python:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="images",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

Step 4: Index Your Images

import os
from uuid import uuid4

def index_images(image_folder: str):
    """
    Embed all images in a folder and insert them into the vector database.
    """
    points = []
    
    for filename in os.listdir(image_folder):
        if not filename.lower().endswith((".jpg", ".jpeg", ".png", ".webp")):
            continue
        
        image_path = os.path.join(image_folder, filename)
        vector = embed_image(image_path)
        
        point = PointStruct(
            id=str(uuid4()),
            vector=vector,
            payload={
                "filename": filename,
                "path": image_path,
                # Add any metadata you want to return with results
                "category": "roofing",
                "date_added": "2025-01-01"
            }
        )
        points.append(point)
    
    client.upsert(collection_name="images", points=points)
    print(f"Indexed {len(points)} images.")

For large collections, batch your upserts (100–500 images per batch) to avoid memory issues and respect API rate limits.

Step 5: Search for Similar Images

def find_similar_images(query_image_path: str, top_k: int = 5):
    """
    Find the top K most visually similar images to the query.
    """
    # Use RETRIEVAL_QUERY task type for the query image
    query_vector = embed_image_for_query(query_image_path)
    
    results = client.search(
        collection_name="images",
        query_vector=query_vector,
        limit=top_k
    )
    
    return [
        {
            "score": hit.score,
            "filename": hit.payload["filename"],
            "path": hit.payload["path"],
            "metadata": hit.payload
        }
        for hit in results
    ]

def embed_image_for_query(image_path: str) -> list[float]:
    """Same as embed_image but with RETRIEVAL_QUERY task type."""
    image_data = Path(image_path).read_bytes()
    image_b64 = base64.b64encode(image_data).decode("utf-8")
    
    result = genai.embed_content(
        model="models/gemini-embedding-exp-03-07",
        content={
            "parts": [{"inline_data": {"mime_type": "image/jpeg", "data": image_b64}}]
        },
        task_type="RETRIEVAL_QUERY",
        output_dimensionality=1024
    )
    
    return result["embedding"]

The score returned by Qdrant for cosine similarity ranges from -1 to 1, with 1 meaning identical. For business use cases, you’ll typically consider scores above 0.85 as strong matches.

Step 6: Build a Simple API Endpoint

Wrap this in a FastAPI endpoint so your frontend or other services can call it:

from fastapi import FastAPI, UploadFile, File
import tempfile
import shutil

app = FastAPI()

@app.post("/search")
async def search_similar_images(file: UploadFile = File(...), top_k: int = 5):
    # Save uploaded file temporarily
    with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    
    # Search
    results = find_similar_images(tmp_path, top_k=top_k)
    
    return {"results": results}

Run with uvicorn main:app --reload and you have a working search endpoint.


Business Use Cases: Where This Actually Gets Valuable

The technical implementation is only part of the story. The real value comes from applying image similarity search to specific business problems. Here are four high-impact use cases with concrete implementation notes.

Roofing and Property Inspection

Roofing companies, insurance adjusters, and home inspection services often need to:

  • Match a damaged roof photo to known defect categories (hail damage, flashing failures, moss/algae, etc.)
  • Compare current inspection photos to historical photos of the same property
  • Quickly surface similar past damage cases to estimate repair costs

How you’d build this:

  1. Create a labeled reference library of roof defect images, organized by type and severity
  2. Embed the entire library and store vectors with metadata: defect type, severity score, typical repair cost range, material type
  3. When a field inspector uploads a photo, the system returns the top 5 most similar defects with their associated metadata

This effectively gives every inspector in the field access to institutional knowledge accumulated from thousands of past jobs.

Practical notes:

  • Field photos vary wildly in quality — but Gemini Embedding 2 handles poor lighting and angle variation well because the embeddings are semantic, not pixel-based
  • Filter results by metadata (e.g., only match against similar roof materials or geographic regions) to improve relevance

Real Estate and Property Management

Real estate platforms can use image similarity search for:

  • Finding properties with visually similar kitchens, bathrooms, or architectural styles
  • Automatically categorizing listing photos (kitchen, bedroom, exterior, etc.)
  • Alerting agents when a new listing appears that matches a buyer’s saved visual preferences

How you’d build this:

Index each listing’s photos with metadata including room type, property price range, location, and listing ID. When a user finds a property they like, offer a “Find similar homes” button — the system embeds their saved photos and returns listings with the most visually similar interiors or exteriors.

This is more useful than keyword-based searches for style preferences. A buyer looking for “mid-century modern kitchen” can’t always describe what they want in text — but they know it when they see it.

E-Commerce Product Matching

E-commerce is probably the highest-volume use case for image similarity search:

  • “Shop the look” features — let users upload a photo of a product they want and find similar items in your catalog
  • Duplicate product detection — identify near-identical product listings across a marketplace
  • Visual upsell recommendations — show complementary products that visually match what’s in a shopper’s cart

How you’d build this:

Index your entire product catalog with embeddings. For duplicate detection, set a high similarity threshold (e.g., 0.95+) and flag any pairs above that threshold for review. For “shop the look,” a lower threshold (0.75–0.85) surfaces a wider range of similar items.

The key metadata to store alongside embeddings:

  • Product ID and SKU
  • Category and subcategory
  • Price
  • Inventory status (so you don’t surface out-of-stock items)
  • Seller ID (for deduplication workflows)

Manufacturing Quality Control

Manufacturers can use image similarity search to flag defective parts by comparing production line photos against a library of known defects:

  • A camera on the production line captures an image of each unit
  • The image is embedded and compared against defect examples
  • High-similarity matches to defect patterns trigger an alert or automatic rejection

This is simpler to implement than a fully trained defect detection model, and it’s easier to update — just add new defect examples to the index without retraining anything.

Important caveat: For safety-critical applications, this approach works best as a first-pass filter combined with human review, not as a standalone decision system.


Handling Scale, Performance, and Cost

A system that works for 1,000 images may behave differently at 1,000,000. Here’s what to think about before scaling up.

Embedding Costs

Gemini Embedding 2 charges per image or per 1,000 characters of text. At the time of writing, the pricing via the Google AI API for the experimental embedding model is in the range of $0.0001–$0.0004 per image, but check Google’s current pricing pages for exact rates as these can change.

For a catalog of 500,000 product images, you’d pay somewhere in the range of $50–$200 to embed the entire catalog once. Re-embedding only happens when new images are added, so ongoing costs are incremental.

Vector Database Sizing

At 1024 dimensions with float32 precision:

  • Each vector takes 4KB of memory
  • 100,000 images = ~400MB RAM
  • 1,000,000 images = ~4GB RAM

Most managed vector databases (Pinecone, Weaviate Cloud, Qdrant Cloud) handle this easily at low cost. For self-hosted options, plan your infrastructure accordingly.

Query Latency

A typical similarity search in a well-indexed vector database at 1M vectors returns results in 50–200ms. The embedding step itself (calling the Gemini API) adds 200–500ms depending on image size and network latency.

For user-facing applications, you’ll want total response time under 2 seconds. Consider:

  • Caching embeddings for recently queried images (if users tend to search the same images)
  • Pre-warming your vector database (keeping it in memory rather than disk-backed)
  • Running the embedding API call asynchronously alongside any UI loading

Handling Large Image Collections

When indexing millions of images, run your embedding jobs in parallel with rate limit awareness:

  • The Gemini API has rate limits (requests per minute and tokens per minute) — check your quota in Google AI Studio or Vertex AI
  • Use async Python (aiohttp or asyncio) with a semaphore to cap concurrent requests
  • Process images in batches, checkpoint progress, and build in retry logic for failed API calls

Extending the System: Text + Image Search and Filtering

One advantage of Gemini Embedding 2 being multimodal is that you can extend your search system to support text queries against the same image index.

Because text and images share the same vector space, you can embed a text query and search for visually relevant images:

def search_by_text(query_text: str, top_k: int = 5):
    result = genai.embed_content(
        model="models/gemini-embedding-exp-03-07",
        content=query_text,
        task_type="RETRIEVAL_QUERY",
        output_dimensionality=1024
    )
    
    query_vector = result["embedding"]
    
    results = client.search(
        collection_name="images",
        query_vector=query_vector,
        limit=top_k
    )
    
    return results

This lets a real estate buyer type “open-plan kitchen with wood countertops” and get back visually matching property photos from your index — no manual tagging required.

Combining Similarity Search with Metadata Filters

Most vector databases support filtering by metadata before or during the vector search. In Qdrant, you’d filter like this:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="images",
    query_vector=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="roofing")),
            FieldCondition(key="severity", match=MatchValue(value="high"))
        ]
    ),
    limit=10
)

This is essential in business applications where you don’t want a product similarity search returning results from a completely different product category, or a roofing search returning HVAC inspection photos.

Re-ranking with Cross-Encoders

For high-precision applications, consider a two-stage approach:

  1. First pass: Fast approximate nearest neighbor search returning top 50–100 candidates
  2. Second pass: Re-rank with a cross-encoder model (e.g., a fine-tuned CLIP or sentence transformer) that scores each pair directly

This adds latency but significantly improves result quality. For most business applications, the first pass alone is sufficient. Only add re-ranking if you’re measuring relevance carefully and finding the first pass isn’t precise enough.


Building This Without Code Using MindStudio

The implementation above works well for developers. But many business teams — property managers, e-commerce operators, inspection companies — don’t have a dedicated engineering team to build and maintain this kind of pipeline.

MindStudio offers a practical alternative. It’s a no-code platform for building AI-powered workflows, and it includes access to Google’s Gemini models (including Gemini’s multimodal capabilities) directly in its visual builder — no API keys or infrastructure setup required.

Here’s how a non-technical team could build a functional image similarity workflow in MindStudio:

  1. Set up an AI agent that accepts an uploaded image as input
  2. Connect Gemini — select it from MindStudio’s 200+ available models — to analyze and process the image
  3. Chain a Google Cloud Vision or Gemini Vision step to extract features, classifications, or descriptions from the image
  4. Store and retrieve results using MindStudio’s integrations with Airtable, Google Sheets, or any database — creating a lightweight reference catalog without a dedicated vector database
  5. Expose the workflow as a web app or API endpoint so field inspectors, real estate agents, or e-commerce managers can use it directly from a browser

For teams that don’t need high-volume vector search (say, under a few thousand reference images), this no-code approach can get you 80% of the way there in an afternoon instead of days.

MindStudio also handles the operational overhead — rate limiting, retries, model versioning — so the team can focus on the business problem rather than infrastructure. You can try it free at mindstudio.ai.

For developers who do want to build the full pipeline but want to automate surrounding workflows (notifying a Slack channel when a defect is found, updating a CRM when a similar listing is matched, sending an email report of inspection results), MindStudio’s Agent Skills Plugin lets any external agent call MindStudio’s pre-built integrations as simple method calls — keeping the core search logic in your own code while outsourcing the communication and data-sync steps.


Frequently Asked Questions

What is Gemini Embedding 2 and how does it work?

Gemini Embedding 2 is a multimodal embedding model from Google. It converts images and text into numerical vectors (embeddings) in a shared vector space. Images that are semantically similar end up with vectors that are mathematically close to each other. This enables similarity-based retrieval: you embed a query image, then find the closest vectors in a pre-built database. The model is available through the Google AI Gemini API and Vertex AI.

How accurate is image-to-image search using Gemini Embedding 2?

Accuracy depends on several factors: the quality and size of your reference image library, whether your images are semantically consistent (e.g., all product photos on white backgrounds vs. lifestyle shots), and the similarity threshold you set. In practice, well-tuned systems using Gemini Embedding 2 achieve retrieval accuracy comparable to fine-tuned CLIP models on many common tasks. For specialized domains — like rare defect types or niche product categories — supplementing with metadata filtering significantly improves relevance.

For development and small-scale projects (up to ~100K images), Qdrant or Chroma are free, easy to run locally, and well-documented. For production deployments:

  • Pinecone — fully managed, serverless option, easiest to scale
  • Weaviate — open-source with strong multimodal support and a managed cloud option
  • Qdrant Cloud — managed version of Qdrant, good performance and pricing
  • pgvector — if you’re already on PostgreSQL and want to avoid a separate service

Can I use this for real-time search, or is it only for batch processing?

Both. The indexing step (embedding your reference library) is typically a batch process you run once, then update incrementally as new images are added. The querying step (embedding a user’s query image and searching the index) can run in real time — total latency is typically 300–700ms, well within acceptable range for most user-facing applications.

How much does it cost to build and run this system?

The main costs are:

  • Gemini Embedding API: Roughly $0.0001–$0.0004 per image depending on model and tier
  • Vector database: Free for self-hosted Qdrant/Chroma; $50–$200/month for managed plans depending on collection size
  • Object storage: Typically $0.02/GB/month on Google Cloud Storage or AWS S3
  • Compute: Minimal if you’re using serverless hosting (Cloud Run, AWS Lambda)

A system handling 100,000 reference images and 10,000 queries per day would likely cost $50–$300/month total, depending on infrastructure choices.

What’s the difference between image embeddings and perceptual hashing?

Perceptual hashing (tools like pHash or ImageHash) detects near-identical images — copies, slight crops, minor compression differences. It’s great for deduplication but fails when images are semantically similar but not visually identical (e.g., two different photos of the same defect type).

Embedding-based search captures semantic similarity. It understands that a cracked tile in bright sunlight and a cracked tile in overcast conditions represent the same type of damage, even though the pixel-level similarity is low. For business applications where images come from varied real-world conditions, embeddings are far more useful than perceptual hashing.


Common Mistakes and How to Avoid Them

Using the Same Task Type for Indexing and Querying

This is one of the most common mistakes when setting up embedding-based search. Gemini Embedding 2 uses different task type hints for documents being indexed (RETRIEVAL_DOCUMENT) versus queries (RETRIEVAL_QUERY). Using the wrong task type — or using SEMANTIC_SIMILARITY for both — can degrade retrieval accuracy noticeably.

Always use RETRIEVAL_DOCUMENT when embedding your reference library and RETRIEVAL_QUERY when embedding user query images.

Not Normalizing Your Reference Library

If your reference library mixes very different image types — product photos with white backgrounds, lifestyle shots, thumbnails, high-resolution originals — your similarity scores will be noisy. Embeddings reflect whatever the model sees, including background, framing, and image quality.

For best results, standardize your reference images: consistent backgrounds where possible, consistent framing, similar resolution. For roofing, keep aerial shots separate from ground-level close-ups. For e-commerce, separate catalog images from user-generated content.

Choosing Too High a Dimensionality

Higher output dimensionality (e.g., 3072 vs. 1024) doesn’t always mean better results for your specific task. It means higher storage costs, slower indexing, and slower queries — with marginal or no accuracy improvement for many common use cases.

Start with 1024 dimensions. Only increase it if you benchmark a clear accuracy improvement for your specific data and use case.

Skipping Metadata Filtering

A pure similarity search with no filtering is rarely what you want in production. Without filters, a search for a damaged roof tile might return photos of damaged floor tiles, decorative stone, or bark texture — all semantically similar in the embedding space.

Store meaningful metadata with every embedding (category, subcategory, material type, date, source, etc.) and apply filters at query time. The combination of vector similarity and metadata filtering is what makes these systems genuinely useful in business contexts.

Not Monitoring Result Quality Over Time

Embedding models improve over time, but your index is static. If Google releases a new version of Gemini Embedding 2, embeddings generated by the old model aren’t directly comparable to those from the new model.

Keep track of which model version was used to generate each embedding. When you upgrade models, plan for a full re-indexing of your reference library. This is a known operational overhead of embedding-based systems.


Key Takeaways

  • Gemini Embedding 2 is a multimodal model that embeds both images and text in the same vector space, making it well-suited for building image-to-image search systems that can later be extended to support text queries.
  • The core pipeline is straightforward: embed your reference images, store vectors in a vector database with metadata, embed query images using RETRIEVAL_QUERY task type, and find nearest neighbors.
  • Business use cases are concrete and high-value: roofing inspection, real estate matching, e-commerce product search, and manufacturing quality control are all strong fits.
  • Metadata filtering is essential: vector similarity alone produces noisy results in production; combining it with structured filters dramatically improves relevance.
  • Scaling is manageable: a system handling 1M images at 1024 dimensions costs a few hundred dollars per month in infrastructure and delivers query results in under a second.
  • No-code paths exist: platforms like MindStudio let non-technical teams build Gemini-powered image workflows without managing API keys, vector databases, or infrastructure.

If you’re evaluating whether to build this kind of system, the short answer is: the barrier has dropped considerably. The embedding model is available via API, the vector databases are mature and affordable, and the use cases are proven. The main investment is getting your reference library organized and your metadata structured — which is a data challenge, not an AI challenge.

For teams that want to build this without engineering resources, MindStudio is worth exploring as a starting point.