What Is Index Share? How GLM 5.2 Achieves 2.9x Fewer Compute Operations at 1M Token Context

The Problem With Long-Context Attention (And Why It’s Expensive)

Running a large language model on a million-token context sounds impressive until you look at the compute bill. Standard transformer attention scales quadratically with sequence length — double the context, quadruple the compute. At one million tokens, that’s not a slowdown. It’s a wall.

GLM 5.2, developed by Zhipu AI, attacks this problem with a technique called Index Share. The result: 2.9x fewer compute operations at 1M token context compared to naive sparse attention implementations. That number matters not just for benchmarks, but for whether models with genuinely long context windows can ever be affordable to run in production.

This post explains what Index Share is, how it works mechanically, and why it makes a measurable difference at scale.

Why Long-Context Attention Is So Computationally Expensive

To understand Index Share, you first need to understand what makes attention expensive at long sequences.

The Quadratic Problem

Standard self-attention computes a similarity score between every token and every other token in the sequence. For a sequence of length n, that’s n² comparisons. At 1,000 tokens, that’s 1 million comparisons. At 1,000,000 tokens, that’s 1 trillion. The compute cost explodes.

This is why most models have practical context limits well below their advertised maximums. Even if the architecture technically supports 1M tokens, actually running inference at that length is prohibitively expensive without modifications.

Sparse Attention as the Standard Fix

The obvious solution is sparse attention: instead of attending to every token, each token only attends to a subset of the sequence. Common strategies include:

Local windows: attend only to nearby tokens
Strided patterns: attend to every kth token
Top-k selection: attend to the k most relevant tokens, wherever they are

Top-k selection is the most flexible — it lets the model attend to any tokens, not just nearby ones. But it introduces a new problem: figuring out which tokens to attend to.

The Hidden Cost of Selection

In top-k sparse attention, you need an “indexer” — a mechanism that scans the full sequence, scores every position, and selects the top candidates. That scoring step itself has linear complexity in the sequence length. At 1M tokens, running this indexer once per attention layer per forward pass adds up fast.

This is exactly where Index Share intervenes.

What Index Share Actually Does

Index Share is a specific optimization for sparse attention systems. The core idea is simple: compute the attention index once, then reuse it across multiple consecutive layers.

The Basic Mechanic

In a standard sparse attention model, each transformer layer runs its own indexer to determine which tokens to attend to. Every layer independently scores the sequence and selects its own top-k candidates.

Index Share groups layers into blocks — in GLM 5.2, each group is four layers. Instead of running the indexer four times (once per layer), the model runs it once for the first layer of the group, then shares those same indices with the remaining three layers.

The attention computation itself still runs independently in each layer. What’s being shared is just the selection decision — the list of which token positions are worth attending to.

Why This Works Theoretically

The assumption behind Index Share is that the most relevant positions in a long sequence don’t change dramatically from one layer to the next. A passage that’s contextually relevant in layer 12 is likely still relevant in layer 13, 14, and 15.

This holds up for most inference scenarios. Early layers and late layers might develop different representational focuses, but within a short run of consecutive layers, the relevant “anchors” in a long context tend to remain stable.

What Gets Reused vs. What Doesn’t

It’s worth being precise about what Index Share shares and what it doesn’t:

Shared across 4 layers:

The index set (which token positions were selected)
The gather operation (physically fetching those token vectors)

NOT shared — still computed independently per layer:

The actual attention weights (Q, K, V projections)
The attention matrix computation over selected tokens
The output projections

So each layer still does its own reasoning. It just doesn’t redo the expensive search for which tokens to reason about.

The 2.9x Figure: Where It Comes From

The 2.9x reduction in compute operations at 1M token context isn’t a theoretical upper bound — it reflects the practical overhead structure of sparse attention at very long sequences.

Breaking Down the Compute Budget

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

At 1M tokens, a large fraction of total attention compute time is consumed by the indexing step, not the attention computation itself. This is counterintuitive if you’re used to thinking about short contexts where the attention matrix dominates.

At short context lengths (say, 8K tokens), the indexer is fast — scanning 8K positions is cheap. But at 1M tokens, the indexer must score 1M positions, making it proportionally much more expensive relative to the subsequent attention computation over the small selected subset.

By running this indexer once per 4-layer group instead of four times, GLM 5.2 cuts roughly three-quarters of the indexing work. Combined with the already-sparse attention computation, total operations drop by approximately 2.9x compared to running standard sparse attention naively.

Why the Gains Are Larger at Longer Contexts

The savings from Index Share scale with context length. At 4K tokens, the indexing step is a minor cost. At 1M tokens, it dominates. This means Index Share is specifically designed for — and delivers the most value at — the extreme end of context length.

This is intentional. Long-context models are only useful if they can actually run at long contexts in reasonable time. A model that technically supports 1M tokens but requires 10 minutes per inference call isn’t practically deployable.

Accuracy Tradeoffs

Sharing indices across layers does introduce a small accuracy tradeoff. Letting each layer independently select its own top-k tokens is theoretically optimal. Forced sharing means some layers attend to tokens they might not have independently chosen.

Zhipu AI’s design choice — a group size of 4 layers — appears to be a sweet spot where the accuracy degradation is minimal but the compute savings are substantial. Larger groups would save more compute but risk meaningful quality loss. Smaller groups reduce the savings.

How GLM 5.2 Implements This in Practice

Index Share doesn’t exist in isolation. It’s one component of GLM 5.2’s overall long-context architecture.

Hierarchical Sparse Attention

GLM 5.2 uses a hierarchical attention design that combines multiple sparse patterns:

Local window attention: Every token attends to its nearby neighbors, preserving fine-grained local context
Strided global tokens: Periodically selected tokens attend broadly, enabling long-range communication
Top-k content-based selection: Dynamic selection of the most semantically relevant tokens anywhere in the sequence

Index Share applies specifically to the top-k content-based selection component — that’s where the indexer overhead is highest, and where reuse is most justified by the stability of relevance across adjacent layers.

Layer Group Structure

In GLM 5.2’s implementation, the model’s transformer layers are divided into groups of four. Within each group:

Layer 1: Runs the full indexer, computes the index set, runs attention
Layers 2–4: Receive the same index set, run attention without re-indexing

The indexer in layer 1 uses the current hidden states (not the original input), so it has access to the representations built up through previous layer groups. This means the shared index isn’t blind — it’s informed by what the model has already processed.

Memory Implications

Hermes Crash Course — free 1-hour live workshop

Sharing indices doesn’t significantly increase memory usage. The index set for one group of 4 layers is the same size as the index set for one layer — you’re keeping it alive a bit longer, but you’re not multiplying storage. This is an important practical point: some attention optimizations trade compute for memory, which can be problematic at long contexts. Index Share doesn’t do this.

Why Serving Costs Matter as Much as Benchmark Numbers

Model efficiency benchmarks often feel abstract. 2.9x fewer operations sounds good, but what does it mean in practice?

The Economics of Long-Context Inference

Running a model at 1M token context is expensive by any measure. At standard transformer compute costs, a 1M-token inference call might cost several dollars in GPU time. At that price, most real-world applications — customer support, document analysis, code review — become uneconomical for companies to deploy at scale.

A 2.9x reduction in compute translates directly to a proportional reduction in GPU time per call. If raw inference cost drops by roughly 2–3x, applications that were previously too expensive to run continuously become viable.

The Difference Between Research Models and Production Models

There’s a meaningful distinction between a model that can handle 1M tokens and one that can be deployed at 1M tokens. Research models demonstrate capability; production models are constrained by what an API provider can actually serve profitably.

Techniques like Index Share are what close that gap. They don’t just make models faster — they make them worth operating at long context lengths as a standard offering rather than a special-case feature.

Real-World Use Cases That Become Practical

With the compute reduction Index Share provides, long-context applications become more accessible:

Legal document review: Entire contracts, case files, and regulatory documents processed in a single context window
Codebase analysis: A full repository loaded as context for debugging or refactoring
Long-form research synthesis: Dozens of papers read and cross-referenced in one pass
Extended conversation memory: Agents that maintain coherent context across very long sessions

These use cases existed before — they just required expensive workarounds like chunking and retrieval. At affordable 1M-token inference, some of those workarounds become unnecessary.

Using Long-Context Models in Your Own Workflows

Understanding how Index Share works is useful context, but most developers and teams don’t need to implement it themselves. What they need is access to models that have done this work — and a way to put those models into production applications without building infrastructure from scratch.

Where MindStudio Fits

MindStudio gives you access to 200+ AI models — including long-context models — without managing API keys, rate limits, or model versioning separately. You can build agents that call on whichever model best fits your task, whether that’s a long-context model for document processing or a faster model for quick responses.

For teams building workflows that process long documents — contracts, research papers, codebase reviews, extended customer conversations — this matters. You can build an agent that:

Accepts a long document as input
Routes it to a long-context model
Synthesizes the result and triggers downstream actions (Slack notification, CRM update, email follow-up)

All of this can be set up in MindStudio’s visual builder in under an hour, without writing infrastructure code. The average agent build takes 15 minutes to an hour, and you can connect to 1,000+ business tools without custom integrations.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

If you’re working with long-context models as part of an automated pipeline — not just a one-off chat interface — that kind of tooling makes the difference between a prototype and a deployed product. You can try it free at mindstudio.ai.

Index Share in the Broader Context of Attention Research

Index Share isn’t an isolated invention. It fits into a broader research trajectory around making attention efficient at long contexts.

Several techniques address overlapping problems:

FlashAttention: Optimizes the memory access pattern during attention computation, reducing memory bandwidth cost without changing which tokens are attended to. Complementary to Index Share, not competing.
Sliding window attention (used in models like Mistral): Restricts attention to a fixed local window, losing long-range connectivity. Simpler but less flexible than top-k selection.
Linear attention approximations: Replace the softmax attention with kernel-based approximations that scale linearly. Higher approximation error, but very fast.
Ring Attention: Distributes long-sequence attention across multiple GPUs. Infrastructure solution rather than algorithmic.

Index Share is most comparable to other indexer-sharing or pattern-reuse strategies. What distinguishes it is the specific claim about accuracy preservation — the 4-layer grouping appears chosen to retain model quality while maximizing compute reduction.

The Trend Toward Longer Contexts

The research community has been steadily pushing context lengths upward. GPT-4 launched with 8K context; current frontier models commonly offer 128K or more. The race to 1M and beyond is underway.

Each jump in context length requires new efficiency techniques, because approaches that work at 128K don’t necessarily scale to 1M. Index Share is an example of an optimization that was specifically designed for the extreme end of this range, where indexer overhead becomes the dominant compute cost.

Frequently Asked Questions

Index Share is a sparse attention optimization where the “indexer” — the component that determines which tokens to attend to — runs once per group of four transformer layers instead of once per layer. The resulting index (the set of selected token positions) is shared across those four layers. Each layer still computes its own attention weights; only the selection step is reused. This reduces total compute operations at 1M token context by approximately 2.9x.

How does sparse attention work in long-context models?

Sparse attention reduces the quadratic cost of standard attention by having each token attend only to a selected subset of the sequence rather than all tokens. Common selection strategies include local windows (nearby tokens), strided patterns (every k-th token), and content-based top-k selection (the most semantically relevant tokens). The tradeoff is that selection itself adds overhead — which Index Share reduces by reusing selections across layers.

Based on Zhipu AI’s implementation, the quality tradeoff from sharing indices across 4 consecutive layers appears to be minimal. The justification is that the most contextually relevant positions in a long sequence remain stable across adjacent transformer layers. Larger groups (sharing across more layers) would save more compute but risk more quality degradation. Four layers appears to be the practical sweet spot.

Why does the compute savings get larger at longer contexts?

At short context lengths, the indexer step is fast — scanning a few thousand positions is cheap. The attention computation over selected tokens is the dominant cost. At very long contexts (1M tokens), the indexer must score 1M positions, making it proportionally expensive. Index Share eliminates most of that indexer cost, and since indexer cost scales with sequence length, the absolute savings grow as context length grows.

What kinds of applications actually need 1M token context?

The clearest use cases are those that benefit from having large corpora in context simultaneously: full legal contract review, entire codebase analysis, synthesizing dozens of research papers, processing long customer interaction histories, or any task where chunking and retrieval would lose important cross-document relationships. At practical inference costs, these become routine rather than expensive one-off operations.

They address different parts of the attention pipeline. FlashAttention optimizes how attention is computed — specifically the memory access pattern — to reduce bandwidth cost during the matrix operations. Index Share optimizes what is attended to by amortizing the selection step across multiple layers. The two techniques are complementary and can be combined in the same model.

Key Takeaways

Standard transformer attention scales quadratically with sequence length, making 1M-token contexts prohibitively expensive without modifications.
Sparse attention solves the quadratic problem but introduces indexer overhead — the cost of deciding which tokens to attend to — which becomes dominant at very long sequences.
Index Share amortizes indexer cost by computing the selection once per 4-layer group and reusing those indices across layers, without sharing the attention computation itself.
At 1M token context, this yields 2.9x fewer compute operations compared to naive sparse attention, making genuinely long-context inference economically viable.
The technique scales: savings are largest exactly where they’re most needed, at the longest context lengths.
For teams building workflows on top of long-context models, platforms like MindStudio let you access these models and wire them into production pipelines without managing infrastructure yourself.

What Is Index Share? How GLM 5.2 Achieves 2.9x Fewer Compute Operations at 1M Token Context

The Problem With Long-Context Attention (And Why It’s Expensive)