What Is Matryoshka Representation Learning? How Flexible Embedding Sizes Work
Matryoshka representation learning lets you get full or reduced-size embeddings from one model. Learn how it works and when to use smaller embeddings for speed.
The Problem With Fixed-Size Embeddings
Every time you generate a text embedding today, you get back a vector with a fixed number of dimensions — 1536, 3072, 768, or whatever the model was designed to produce. You use all of them or none of them. There’s no in-between.
That rigidity creates real problems in production. If you’re running a large-scale semantic search system with millions of documents, those full-size vectors consume significant storage and slow down similarity searches. The obvious solution — use a smaller embedding model — usually means sacrificing meaningful accuracy. You’re forced to pick a point on the accuracy/efficiency curve and stay there forever.
Matryoshka Representation Learning changes this. It’s a training approach that lets a single embedding model produce useful representations at many different sizes — from the full dimension count down to a small fraction of it. You can run a fast, coarse search with 64-dimensional embeddings, then re-rank with 512-dimensional ones, all from the same model. This article explains how it works, where it’s available, and when it’s worth using.
What Matryoshka Representation Learning Actually Is
The name comes from Russian nesting dolls. In a traditional Matryoshka set, each doll contains a smaller, complete doll inside it. The innermost doll is still a doll — fully formed, just smaller. The analogy maps directly to MRL embeddings: the first 64 dimensions of a 1024-dimensional MRL embedding are themselves a meaningful 64-dimensional embedding. The first 128 dimensions form a useful 128-dimensional embedding. And so on.
This is fundamentally different from what happens when you truncate a standard embedding. If you take a normal 1536-dimensional OpenAI embedding and chop off everything after dimension 256, you’ll get worse results than a model specifically trained to produce 256-dimensional embeddings. The model never learned to organize information that way. Important features might be scattered anywhere across the full 1536 dimensions.
With matryoshka representation learning, the model is explicitly trained to concentrate the most important information in the earliest dimensions. By the time you reach later dimensions, those are adding refinement and nuance — helpful at full size, but not critical for many tasks.
The Original Research
MRL was introduced in a 2022 paper — “Matryoshka Representation Learning” — from researchers at the University of Washington and Google Research. The paper was presented at NeurIPS 2022. The core insight wasn’t that smaller embeddings are useful (that was already known), but that you could train a single model to produce good embeddings at multiple scales simultaneously, without a meaningful penalty at full size.
The paper demonstrated that MRL models could match or exceed the performance of independently trained fixed-size models across a range of downstream tasks. That’s a significant finding: you’re not paying a tax for the flexibility. In many benchmarks, MRL embeddings at reduced dimensions outperform non-MRL models specifically trained at those same dimensions.
Why the Nesting Property Doesn’t Come for Free
A standard embedding model is trained to minimize a single loss computed on the full output vector. Every dimension is treated equally in that process. There’s no pressure for the model to organize information hierarchically — the model just needs to get the answer right at full size.
MRL models break this assumption during training. The model is forced to produce good representations not just at the full dimension count, but at multiple smaller sizes too. This restructures how the model uses its output dimensions. Early dimensions end up encoding the most broadly useful features. Later dimensions handle finer details that only matter when you need high precision.
Think of it like writing a summary with nested detail levels: you write a one-sentence summary first, then a paragraph, then a full document. Each level is complete on its own, and each builds on what came before.
How MRL Training Works
The technical mechanism behind matryoshka representation learning is a multi-scale loss function. During a standard training step for a contrastive or retrieval model, you’d compute a loss once, based on the full embedding. MRL training computes that same loss multiple times — once for each target dimension size.
The Multi-Scale Loss Function
If your model outputs 1024-dimensional embeddings, you might define matryoshka scales at [64, 128, 256, 512, 1024]. At each training step:
- Generate the full 1024-dimensional embedding.
- Truncate it to each scale: 512, 256, 128, 64 dimensions.
- Compute the contrastive (or other) loss at each scale.
- Average or weight these losses together.
- Backpropagate through the combined loss.
Each scale’s loss sends gradient signals back through the model. The gradient from the 64-dimensional loss only flows through the first 64 output neurons, while the gradient from the full 1024-dimensional loss flows through all of them. Over time, this creates competitive pressure on the first 64 dimensions to do a lot of heavy lifting.
Weighted Matryoshka Loss
The standard approach weights each scale’s contribution equally, but there are variations. Some implementations use a weighted sum that emphasizes the full-size loss more heavily, since that typically corresponds to the model’s best performance target. Others weight smaller scales more heavily to force aggressive information compression.
The paper also describes a “relative” variant of MRL that adds a secondary training objective: ensuring that the ordering of nearest neighbors remains consistent as you move from smaller to larger embeddings. This helps with re-ranking use cases, where you first retrieve candidates with small embeddings and then re-score with larger ones.
What the Model Learns
Through this training process, the model develops a specific internal structure. The first few dimensions become generalist representations — they capture the most salient features that differentiate documents. Think broad topic category, general sentiment, domain-level features.
Later dimensions add specificity. They capture nuances that matter for fine-grained similarity — subtle semantic differences, writing style, domain-specific terminology. These dimensions are valuable when you need precise ranking but wasteful when approximate matching is fine.
This ordering isn’t explicitly prescribed by the training setup. It emerges naturally from the gradient dynamics. The earliest dimensions face the most pressure across all loss scales, so the model learns to encode the highest-value information there.
Embedding Size vs. Quality: What the Numbers Look Like
The practical question with matryoshka representation learning is always: how much quality do you actually lose when you use smaller embeddings? The answer depends on the model, the task, and what “quality” means for your use case.
MTEB Benchmark Numbers
The Massive Text Embedding Benchmark (MTEB) is the standard leaderboard for text embedding models. It covers retrieval, clustering, classification, semantic similarity, and other tasks across many datasets. MRL-trained models tend to show a specific pattern on MTEB:
- At full size, they match or slightly underperform compared to the best fixed-size models of equivalent parameter count. The multi-scale training adds a small regularization effect.
- At 50% of full size, quality drops by a few percentage points — usually 1–4 points on retrieval metrics like NDCG@10.
- At 25% of full size, the drop is more noticeable but often still acceptable for many applications — typically 5–10 points below full-size performance.
- At very small sizes (below 10% of full dimensions), quality degrades significantly for complex tasks.
For OpenAI’s text-embedding-3-large model (3072 dimensions natively), the model maintains strong performance down to 256 dimensions — roughly 8% of full size. That’s a significant compression with a modest accuracy cost.
Comparing MRL vs. Truncated Standard Embeddings
The more useful comparison isn’t MRL at reduced size versus MRL at full size — it’s MRL at reduced size versus a standard embedding model truncated to that same size.
When you truncate a standard embedding to 25% of its dimensions, you typically lose much more quality than when you truncate an MRL embedding to the same size. The MRL model was trained to make those first dimensions count. The standard model wasn’t.
Empirically, MRL embeddings at 128 dimensions often match or beat standard embeddings at 512 dimensions trained independently. That’s a 4x reduction in storage and computation for equivalent quality. This is the core efficiency argument for MRL.
When Small Is Good Enough
The right embedding size depends on your task:
Tasks where small embeddings work well:
- Coarse document retrieval where you’re fetching a top-100 candidate set for re-ranking
- Clustering large document collections by broad topic
- Deduplication where you’re catching near-identical content
- Real-time classification with tight latency budgets
Tasks where you should use full-size embeddings:
- Fine-grained semantic similarity where subtle differences matter
- Question-answering retrieval where missing the exact right paragraph has consequences
- Final re-ranking step after an initial coarse retrieval
- Multilingual tasks where fine distinctions in meaning are load-bearing
A practical approach: use 256-dimensional embeddings for fast approximate nearest neighbor search to pull 100–200 candidates, then re-score the top-K using full embeddings for final ranking. This gets you most of the speed benefit while preserving accuracy where it counts.
Where to Find MRL Embedding Models
Matryoshka representation learning has moved from a research technique to a practical tool available in most mainstream embedding APIs and open-source model hubs.
OpenAI’s text-embedding-3 Models
In January 2024, OpenAI released text-embedding-3-small and text-embedding-3-large — both trained using MRL. These replaced the older text-embedding-ada-002 model.
Using them is straightforward. You pass a dimensions parameter to the API:
from openai import OpenAI
client = OpenAI()
# Full size: 3072 dimensions
response = client.embeddings.create(
model="text-embedding-3-large",
input="Semantic search is useful for retrieval tasks."
)
# Reduced size: 256 dimensions
response = client.embeddings.create(
model="text-embedding-3-large",
input="Semantic search is useful for retrieval tasks.",
dimensions=256
)
OpenAI handles the truncation on their end — you’re not just receiving a truncated 3072-vector, you’re getting a normalized vector at the requested size. The text-embedding-3-small model supports up to 1536 dimensions, while text-embedding-3-large goes up to 3072.
One notable data point from OpenAI: their text-embedding-3-large model at 256 dimensions outperforms text-embedding-ada-002 at its full 1536 dimensions on MTEB. That’s a smaller embedding beating a larger one because of how the model was trained.
Open-Source MRL Models
The Hugging Face model hub has a growing collection of MRL-trained embedding models. Several notable options:
nomic-ai/nomic-embed-text-v1.5 — A 768-dimensional model trained with matryoshka loss. It supports dimensions from 64 to 768 and is Apache 2.0 licensed, making it usable in commercial applications without restrictions.
mixedbread-ai/mxbai-embed-large-v1 — A 1024-dimensional model from Mixedbread AI. Strong MTEB scores at multiple dimension settings.
BAAI/bge-m3 — From the Beijing Academy of Artificial Intelligence. Designed for multilingual retrieval with MRL support, covering 100+ languages.
tomaarsen/mpnet-base-nli-matryoshka — A sentence-transformers fine-tuned model that demonstrates MRL on a smaller base architecture.
Using these through the sentence-transformers library requires a simple truncation step:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
text = "Matryoshka embeddings support flexible dimensionality."
embedding = model.encode(text)
# Use first 256 dimensions
embedding_256 = embedding[:256]
# Normalize after truncation (important for cosine similarity)
embedding_256 = embedding_256 / np.linalg.norm(embedding_256)
Normalization after truncation matters. Raw truncated vectors aren’t normalized, so cosine similarity calculations will be off if you skip this step.
Cohere’s Embedding Models
Cohere’s embed-english-v3.0 and embed-multilingual-v3.0 also support variable output dimensions through a similar mechanism. Their API lets you specify the output size when making an embedding request, and the models were trained to support this efficiently.
How to Use MRL Embeddings in Practice
The theory is straightforward. The practical implementation involves a few choices that can significantly affect whether you get the benefits you’re expecting.
Choosing Your Embedding Size
Start by profiling your accuracy/latency needs, not by picking a number arbitrarily. Here’s a reasonable approach:
- Generate full-size embeddings for a representative sample of your data.
- Evaluate retrieval quality at multiple dimension settings: 25%, 50%, full size.
- Set a minimum acceptable quality threshold (e.g., NDCG@10 ≥ 0.82).
- Use the smallest dimension that meets that threshold.
This process takes maybe a few hours with a good evaluation set, and it tells you the actual trade-off curve for your specific task and domain. Generic benchmarks don’t perfectly predict task-specific performance.
For most standard retrieval tasks over English-language text, 256–512 dimensions is a reasonable starting point. For multilingual content, fine-grained similarity tasks, or domains where subtle semantic differences are critical, start at 50% of full size and see if you can go lower.
Integrating MRL Into a Vector Database
Most modern vector databases support arbitrary embedding dimensions. You configure the index dimension once at setup. When switching to smaller MRL embeddings, you’re just creating an index with a smaller dimension parameter:
import pinecone
# Standard full-size index
index = pinecone.create_index(
name="documents-full",
dimension=3072,
metric="cosine"
)
# MRL-optimized smaller index
index = pinecone.create_index(
name="documents-small",
dimension=256,
metric="cosine"
)
The storage savings compound significantly at scale. A million 3072-dimensional float32 vectors requires about 12 GB of storage. The same million vectors at 256 dimensions takes about 1 GB. For a 10-million document corpus, that’s the difference between 120 GB and 10 GB.
Query latency benefits are also substantial. Cosine similarity computation scales with dimension count. Smaller embeddings mean faster per-query computation, which matters when you’re searching across millions of vectors in real time.
The Two-Stage Retrieval Pattern
The most powerful practical application of MRL is two-stage retrieval. This pattern uses small embeddings for fast candidate retrieval and large embeddings for accurate re-ranking.
Stage 1: Fast retrieval
- Use 128–256 dimensional embeddings for your full index.
- Run approximate nearest neighbor search to pull top-200 candidates.
- This is fast and cheap — the index is small, queries are quick.
Stage 2: Precise re-ranking
- Generate full-size embeddings for only the 200 candidates retrieved in stage 1.
- Re-score each candidate against the query using full-dimensional cosine similarity.
- Return the top-K results by re-ranked score.
Stage 2 runs on 200 vectors instead of millions, so using full-size embeddings here costs almost nothing. The overall pipeline gets the speed of small embeddings with most of the accuracy of large ones.
This pattern works well because the recall at stage 1 only needs to be reasonable, not perfect. If you’re pulling 200 candidates for 10 final results, you can afford for the small-embedding retrieval to miss a few edge cases — the re-ranking step recovers most of those.
Adaptive Retrieval for Dynamic Workloads
A less common but powerful pattern: dynamically select embedding size based on query complexity or system load. For simple keyword-like queries (short, high-confidence), use 128 dimensions. For complex multi-clause queries where semantic nuance matters, use full-size embeddings.
This requires some query analysis logic — checking query length, entity density, presence of negations or qualifiers — but it can give you the right accuracy level for each query type without paying full cost for simple cases.
Matryoshka Representation Learning vs. Other Dimension Reduction Methods
MRL isn’t the only way to work with smaller embedding vectors. It’s worth understanding how it compares to other approaches.
PCA and Post-Hoc Dimensionality Reduction
Principal Component Analysis (PCA) is the most common alternative. You generate full-size embeddings, then apply PCA to compress them down to a target dimension. This approach has real disadvantages:
- PCA requires fitting on a representative dataset. If your data distribution shifts, the PCA projection may degrade.
- PCA-reduced embeddings often don’t perform as well at small sizes as MRL embeddings do, because PCA is maximizing variance explained, not retrieval quality.
- You have to store the PCA transformation matrix and apply it at query time, adding latency and complexity.
MRL sidesteps all of this. There’s no post-processing step, no transformation matrix, no fitting required. You truncate and normalize. That’s it.
Product Quantization
Product quantization (PQ) compresses embeddings by representing them as sequences of codebook indices rather than float32 values. It’s a different axis of optimization — it reduces the bit depth of storage rather than the number of dimensions.
PQ and MRL aren’t mutually exclusive. You can use product quantization on top of MRL-truncated embeddings for even more aggressive compression. The combination is common in production retrieval systems that need both small storage and fast search.
Autoencoder-Based Compression
You can train an autoencoder to compress embeddings from a fixed-size model down to a smaller latent space. This is more flexible than PCA but shares similar downsides: requires training a separate model, adds inference-time latency, and typically underperforms MRL at equivalent compression ratios.
The core advantage of matryoshka representation learning over all post-hoc methods: the model learns to organize information in dimension-size-friendly ways during training. Post-hoc methods are working against the model’s internal structure. MRL bakes the structure in from the start.
MRL for Non-Text Modalities
The original MRL paper focused on image embeddings as well as text. The technique is modality-agnostic — the multi-scale loss applies equally to any embedding model.
Image Retrieval
For image embeddings, MRL allows the same trade-offs: faster approximate image search with small embeddings, precise re-ranking with large ones. The paper showed strong results on ImageNet classification and retrieval benchmarks using a ResNet backbone trained with matryoshka loss.
Multimodal Models
CLIP-style models that learn joint image-text embeddings can also be trained with MRL. This enables flexible retrieval across modalities — search images with text, or search text with images — at multiple scale points. Some fine-tuned CLIP variants on Hugging Face include matryoshka training.
Audio and Other Sequences
Any sequence-to-vector embedding pipeline — audio models, time series encoders, molecular fingerprint models — can be trained with MRL. The technique doesn’t require anything specific to language or images. If you can define a contrastive or reconstruction loss on the embedding, you can add matryoshka scales to it.
Building AI Workflows With MRL Embeddings
Understanding MRL matters most when you’re building systems that actually use embeddings at scale — retrieval pipelines, RAG agents, semantic search applications, and recommendation systems. The design decisions you make around embedding size affect the speed, cost, and accuracy of everything downstream.
Where Embedding Choices Show Up in Real Applications
In a typical RAG pipeline, embeddings are used in two places: at indexing time (when you embed your document chunks and store them) and at query time (when you embed the user’s question and search for relevant chunks). The embedding model and dimension you choose affects:
- How much storage your vector index needs
- How long each query takes to return results
- How accurately the retrieved chunks match the user’s intent
- How much the embedding API costs per query
These aren’t abstract concerns. In a production RAG system handling thousands of daily queries across a large document corpus, the difference between 3072-dimensional and 256-dimensional embeddings can mean 10x lower retrieval latency and significantly reduced API costs — with acceptable accuracy if you’ve chosen the dimension size thoughtfully.
Connecting MRL to MindStudio
MindStudio is a no-code platform for building AI agents and workflows. If you’re building a retrieval-augmented agent in MindStudio — something that searches a knowledge base, answers questions, or processes documents — the embedding strategy you use directly affects how well that agent performs.
MindStudio supports 200+ AI models out of the box, including the OpenAI text-embedding-3 models that use matryoshka training. This means you can build agents that use flexible-size embeddings without writing custom infrastructure. You can configure retrieval workflows that pull candidates using smaller embeddings and re-rank with larger ones, or you can start with a full-size embedding model and tune down based on your latency requirements.
For teams building semantic search agents, document Q&A systems, or knowledge management tools in MindStudio, understanding MRL helps you make better decisions about which embedding model to select and how to structure the retrieval step. A smaller embedding dimension on a well-trained MRL model often delivers better results than a larger dimension on an older fixed-size model — which counterintuitively means better performance at lower cost.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What is Matryoshka Representation Learning?
Matryoshka Representation Learning (MRL) is a training technique for embedding models. It trains a single model to produce useful representations at multiple dimension sizes simultaneously. The name comes from Russian nesting dolls — just as smaller dolls nest inside larger ones, smaller sub-embeddings are nested within larger MRL embeddings. The first N dimensions of a larger MRL embedding form a complete, usable N-dimensional embedding on their own.
How much quality do you lose with smaller MRL embeddings?
It depends on the model and the task, but the losses are typically modest at reasonable compression levels. For OpenAI’s text-embedding-3-large, embedding at 256 dimensions (roughly 8% of full 3072 dimensions) still outperforms the older text-embedding-ada-002 at its full 1536 dimensions on MTEB benchmarks. For most retrieval tasks, 50% of full size shows only 1–4 percentage points of quality degradation. Very aggressive compression — below 10% of full dimensions — causes more significant quality loss and is best used only for coarse filtering stages.
Which embedding models support MRL?
Several widely used models include MRL training:
- OpenAI —
text-embedding-3-smallandtext-embedding-3-largeboth support thedimensionsparameter - Cohere —
embed-english-v3.0andembed-multilingual-v3.0 - Nomic AI —
nomic-embed-text-v1.5 - Mixedbread AI —
mxbai-embed-large-v1 - BAAI —
bge-m3(multilingual)
The list is growing. Most state-of-the-art embedding models released in 2024 and 2025 include some form of matryoshka training.
How does MRL compare to PCA for reducing embedding size?
MRL almost always outperforms post-hoc dimensionality reduction like PCA at equivalent compression ratios. PCA maximizes variance explained in the original embedding space, which doesn’t directly optimize for retrieval quality. MRL trains the model to produce high-quality representations at small sizes as a primary objective. Additionally, PCA requires fitting a transformation on data and applying it at inference time, adding complexity and latency. MRL requires only truncation and normalization — a trivial operation.
Should I always use the smallest possible MRL embedding?
No. Smaller embeddings mean faster search and cheaper storage, but they also mean lower accuracy on fine-grained tasks. The right size depends on your use case. Use benchmarking on your specific task and dataset to identify the smallest dimension that meets your quality threshold. For simple retrieval tasks over general-topic documents, 256–512 dimensions is often sufficient. For precise semantic similarity over specialized domains — legal, medical, scientific — you’ll likely want to use larger dimensions or the full embedding.
Can MRL be used for images and other modalities?
Yes. The original MRL paper demonstrated results on image classification and retrieval using ResNet-based models. The training technique is modality-agnostic: any model that produces an embedding vector can be trained with matryoshka loss. Multimodal CLIP-style models, audio encoders, and other sequence-to-vector models can all use MRL. The core requirement is just that you can define a loss function on the embedding and backpropagate through it.
Does using a smaller MRL embedding always cost less with the API?
For OpenAI’s API, yes — smaller dimensions values reduce the output size, which reduces token usage for the output. But the main cost savings from smaller embeddings are on your own infrastructure: vector database storage, index memory, and query latency. API embedding costs are typically based on input tokens, not output dimensions, so the computational savings are primarily on your side.
What is the difference between MRL and binary or int8 quantization?
These are different optimization techniques that can be combined. MRL reduces the number of dimensions in the embedding vector. Quantization reduces the precision of each dimension — storing values as int8 instead of float32, for example. You can apply quantization to full-size or MRL-truncated embeddings. Using MRL truncation plus int8 quantization together gives you both storage compression axes simultaneously, at the cost of some accuracy on both fronts.
Key Takeaways
-
Matryoshka representation learning trains a single embedding model to produce useful representations at multiple dimension sizes. The first N dimensions of a larger MRL embedding form a valid N-dimensional embedding on their own.
-
The quality/size trade-off is favorable. MRL embeddings at reduced sizes typically outperform standard embeddings of equivalent size — and often match full-size legacy models at a fraction of the dimensions.
-
The two-stage retrieval pattern — small embeddings for candidate retrieval, full-size embeddings for re-ranking — is the most practical way to use MRL in production. It gives you most of the accuracy with a fraction of the computational cost.
-
MRL is available in major APIs — including OpenAI’s text-embedding-3 models, Cohere’s v3 models, and numerous open-source options on Hugging Face — making it easy to adopt without custom training.
-
Post-hoc dimensionality reduction (PCA, autoencoders) is almost always worse than using a natively MRL-trained model at the same target dimension, because MRL trains the model to organize information hierarchically from the start.
If you’re building retrieval-augmented agents or any system that uses embeddings at scale, considering MRL is worth the time. Start by choosing an MRL-capable model, profile your accuracy at a few dimension settings, and pick the smallest size that meets your task requirements. The storage savings and latency improvements are real — and with the right setup, they come with minimal accuracy cost.
To build AI agents that use these techniques in practice, MindStudio gives you access to the major embedding models — including OpenAI’s MRL-trained options — through a no-code workflow builder. You can start free and have a working retrieval agent running in well under an hour.