Cache-Aware Streaming ASR: How NVIDIA Nemotron 3.5 Cuts Transcription Latency

Why Real-Time Transcription Is Harder Than It Looks

Transcribing speech sounds straightforward until you need it fast. Cache-aware streaming ASR is the technique that makes real-time, low-latency transcription practical at scale — and NVIDIA’s Nemotron 3.5 is one of the clearest demonstrations of how far this approach has come.

The core problem with automatic speech recognition (ASR) has always been a tradeoff: offline models are accurate but slow, while streaming models are fast but sacrifice quality. Cache-aware streaming breaks that tradeoff by reusing intermediate encoder computations instead of throwing them away and starting over with each new audio chunk.

The result? Latency reductions of up to 17x compared to naive chunk-based streaming — without a meaningful hit to transcription accuracy. This article explains how that’s possible, what Nemotron 3.5 specifically brings to the table, and what it means for teams building voice-driven applications today.

The Core Problem: Streaming ASR Without Caching Is Wasteful

Before getting into Nemotron 3.5, it helps to understand what’s wrong with the simpler approach to streaming speech recognition.

Offline vs. Streaming ASR

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Offline ASR works by waiting for a complete audio utterance, then running the full audio through an encoder and decoder in one pass. The model has full context — every word, pause, and phoneme — before it produces a single token. This is why offline models tend to be accurate. But it’s also why they’re useless for real-time applications: you can’t show someone a transcription of a conversation while it’s still happening if you need to wait for it to finish.

Streaming ASR processes audio in small chunks — typically 80 to 480 milliseconds at a time — and produces partial transcriptions as audio arrives. This is the approach used in voice assistants, live captioning systems, and call center analytics.

The Reprocessing Problem

The challenge with chunk-based streaming is context. A transformer-based encoder needs to “see” enough audio history to make accurate predictions. Sounds at the end of a phrase often depend on what came before — prosody, phonetic coarticulation, and disambiguation all require some backward context.

Naive streaming models handle this by including a “lookback window” of previous audio in each chunk. So when processing chunk five, the model might re-encode chunks two, three, and four along with chunk five. This ensures the model has enough context, but it means the same audio gets processed multiple times. The compute cost scales with context size, latency grows, and throughput drops.

This is the problem that cache-aware streaming solves.

How Cache-Aware Streaming Works

Cache-aware streaming treats the encoder’s intermediate representations as persistent state rather than disposable output.

What Gets Cached

When a conformer-based encoder processes an audio chunk, it produces two types of intermediate states that are useful to preserve:

Attention key-value (KV) cache — the query, key, and value tensors produced by the self-attention layers. These encode what the model “remembers” about previous audio.
Convolution cache — the intermediate activations from convolutional layers, which capture local acoustic patterns across chunk boundaries.

In a standard streaming setup, these states are discarded after each chunk. The next chunk has to recompute context from scratch (or from the raw audio lookback). In cache-aware streaming, these states are stored and passed into the next forward pass as inputs.

The Forward Pass With Caching

Here’s what that changes, step by step:

Chunk N arrives. The encoder processes it normally, producing encoder outputs and storing KV and convolution states.
Chunk N+1 arrives. Instead of re-encoding chunks N-3, N-2, N-1 as raw audio, the model retrieves the cached states from those chunks and incorporates them directly into the attention computation.
Only chunk N+1’s audio is newly encoded. Previous context is already represented in the cache — no recomputation needed.
Cache is updated with the new states from chunk N+1, ready for chunk N+2.

The result is that each chunk requires a fraction of the computation compared to the naive approach. The attention still attends over a meaningful history of context, but the work of computing that context has already been done.

Left Context and Right Context

Two parameters define how much history and future audio the model uses:

Left context — how many previous chunks the model attends to via the cache. More left context = better accuracy, larger cache, slightly more memory overhead.
Right context — how many future chunks the model waits for before producing output. This is where the latency-accuracy tradeoff lives. Zero right context means output is produced immediately; even one future chunk introduces wait time but often improves accuracy at chunk boundaries.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Nemotron 3.5 supports configurable left and right context, which means you can tune the model’s behavior based on your application’s requirements.

NVIDIA Nemotron 3.5: What’s Different

NVIDIA’s Nemotron 3.5 is built on the FastConformer architecture, which is NVIDIA’s optimized variant of the standard Conformer model. Conformer models combine the sequence modeling of transformers with the local feature extraction of convolutional neural networks — a combination that consistently outperforms either architecture alone for speech recognition.

FastConformer Efficiency Gains

The “fast” in FastConformer comes from a modified subsampling strategy. Standard Conformer models apply 4x subsampling to audio features, which reduces sequence length by a factor of four before the transformer layers. FastConformer pushes this to 8x subsampling, cutting sequence length in half again.

This matters for streaming because shorter sequences mean fewer attention operations per chunk, which directly reduces compute time. The architectural change alone provides meaningful throughput improvements before any caching is applied.

Cache-Aware Streaming on Top

Nemotron 3.5 layers cache-aware streaming on top of the FastConformer base, and this combination produces the 17x latency reduction figure cited in NVIDIA’s benchmarks. To be precise about what that number means:

The baseline is chunk-based streaming without caching, using a large audio lookback window to maintain context
The comparison is cache-aware streaming with the same effective context size
The 17x figure refers to real-time factor (RTF) improvement — the ratio of processing time to audio duration

In practice, Nemotron 3.5 can achieve RTF values well below 1.0 on modern GPUs, meaning the model transcribes audio faster than it’s being spoken. That’s the threshold for genuinely real-time operation.

Accuracy Held Intact

One of the more notable aspects of cache-aware streaming in Nemotron 3.5 is how little accuracy degrades compared to the offline version of the same model. NVIDIA’s evaluations on standard benchmarks like LibriSpeech show word error rates (WER) that remain competitive with offline models when using reasonable left context windows (typically 70 to 140 frames of cached history).

The key insight is that the cache-aware approach doesn’t change what information the model has access to — it just changes how that information is accessed. The same history is available; it’s just stored efficiently rather than recomputed.

Chunk Size and Latency: The Practical Tradeoff

Chunk size is the most direct lever for controlling latency in a streaming ASR system.

Choosing Chunk Size

Smaller chunks produce output faster but may catch the model at phonetically ambiguous points. Larger chunks give the model more signal before making predictions but add delay.

Common chunk sizes for production streaming ASR:

Chunk Size	Approximate Latency	Best For
80–160 ms	Very low (~100 ms)	Voice assistants, gaming
160–320 ms	Low (~200–400 ms)	Live captioning, conferencing
320–480 ms	Moderate (~400–600 ms)	Call center analytics, transcription review

Nemotron 3.5 supports all of these configurations. The cache mechanism works regardless of chunk size — the tradeoff is just about how often new audio triggers a forward pass.

Right Context as a Fine-Tuning Knob

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

Adding even one or two chunks of right context (lookahead) meaningfully improves accuracy at word boundaries. For applications where a 200–400 ms additional delay is acceptable — live meeting transcription, for example — enabling right context is usually worth it. For voice assistant interactions where perceived responsiveness matters, zero right context is often preferable even at a small accuracy cost.

Where Cache-Aware Streaming Gets Used

The latency and throughput properties of cache-aware streaming ASR make it appropriate for a specific class of applications.

Live Captioning and Accessibility

Live captions for video calls, conferences, and broadcasts require word-level output within a few hundred milliseconds to remain synchronized with speech. Cache-aware streaming enables this at lower compute cost, which matters when you’re running transcription across thousands of simultaneous streams.

Voice-Driven Interfaces

Any application where users speak commands or queries — voice assistants, dictation tools, voice search — benefits from low-latency ASR. Perceived responsiveness is tied directly to transcription latency, and cache-aware streaming helps keep that latency in a range that feels natural.

Call Center and Conversation Analytics

Real-time call transcription enables live agent assist tools — systems that surface relevant knowledge base articles or suggest responses as a conversation unfolds. These systems need transcription to be fast enough to be useful before the conversation has moved on, which typically means under 500 ms end-to-end.

Subtitle Generation at Scale

Video platforms that need to generate subtitles on upload or during live streams can use streaming ASR to begin producing output before a full video is processed. This reduces time-to-publish for captioned content and is more efficient than waiting for batch processing jobs.

Deploying Nemotron 3.5 in Practice

NVIDIA distributes Nemotron 3.5 through the NeMo framework, which provides Python APIs for both offline and streaming inference. Here’s what a basic streaming deployment involves.

Infrastructure Requirements

Cache-aware streaming ASR requires keeping model state in memory between chunk forward passes. This means:

Stateful inference — each audio stream needs its own cache, so you need per-stream state management on your inference server
GPU memory — the KV cache for multiple simultaneous streams adds memory overhead; this needs to be accounted for when sizing hardware
Low-latency audio ingestion — the audio pipeline feeding chunks to the model needs to be fast enough that the model isn’t waiting on the audio layer

For most production deployments, NVIDIA Triton Inference Server is the recommended backend, with the NeMo streaming ASR backend handling state management.

Configuring the Cache

When loading a cache-aware streaming model from NeMo, the key parameters to set are:

chunk_size — frames of audio per forward pass
left_chunks — number of previous chunks to retain in cache
right_chunks — number of lookahead chunks (set to 0 for minimum latency)

These can be set at model load time and affect the balance between latency, accuracy, and memory usage.

Monitoring RTF in Production

Real-time factor is the primary operational metric for streaming ASR. If RTF approaches or exceeds 1.0, the model is falling behind real-time audio and latency will accumulate. GPU utilization, queue depth, and chunk processing time are the signals to watch.

Building Voice Transcription Workflows With MindStudio

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

Understanding how cache-aware streaming ASR works at the model level is one thing. Connecting it to the rest of your application stack — routing transcripts to downstream tools, triggering actions based on spoken content, or building interfaces around voice input — is where most of the integration work actually lives.

MindStudio’s visual workflow builder makes it practical to connect ASR output to downstream logic without writing a lot of glue code. With access to 200+ AI models and 1,000+ integrations with tools like Slack, Notion, HubSpot, and Google Workspace, you can build agents that act on transcribed content — not just produce it.

A few examples of what this looks like in practice:

Meeting transcript to action items — an agent that receives a call transcript, extracts tasks and decisions, and writes them to a project management tool
Voice-triggered workflow automation — a transcription pipeline where certain keywords or phrases trigger downstream actions (escalate a ticket, send a summary, update a CRM record)
Multilingual transcription and routing — an agent that detects language from transcribed text and routes the output to the appropriate team or tool

MindStudio handles the orchestration layer — connecting model outputs to business tools, managing state across steps, and providing a UI for non-technical users to configure and monitor workflows — without requiring you to build that infrastructure yourself. You can try MindStudio free at mindstudio.ai.

If you’re building voice-enabled AI agents, MindStudio’s guide to building AI agents covers the foundational workflow patterns that apply here.

Frequently Asked Questions

What is cache-aware streaming ASR?

Cache-aware streaming ASR is a method for real-time speech recognition that stores intermediate encoder states (the “cache”) between audio chunk forward passes. Instead of reprocessing previous audio context each time a new chunk arrives, the model reads from the cache. This dramatically reduces compute per chunk and lowers transcription latency, often by an order of magnitude compared to naive chunk-based streaming.

How does Nemotron 3.5 differ from standard Conformer-based ASR models?

Nemotron 3.5 is built on FastConformer, which uses 8x audio subsampling instead of the standard 4x. This halves the sequence length that transformer layers process, reducing compute per forward pass. Combined with cache-aware streaming, this architecture achieves significantly better real-time performance than standard Conformer models without a meaningful accuracy penalty.

What chunk size should I use for real-time transcription?

Chunk size depends on your latency budget. Chunks of 80–160 ms give the lowest latency (around 100 ms) and work well for voice interfaces. Chunks of 160–320 ms balance latency and accuracy for live captioning. Larger chunks (320–480 ms) are appropriate when accuracy at word boundaries matters more than immediate output, such as in call analytics. Most production deployments start at 160 ms and tune from there.

What is the difference between streaming and offline ASR accuracy?

Offline ASR has full context — the entire utterance — when making predictions. Streaming ASR only has what’s arrived so far, plus any lookahead configured. The accuracy gap depends on how much left context the model can access (via the cache) and whether right context is enabled. With reasonable configurations, Nemotron 3.5’s streaming mode achieves word error rates within a few percentage points of its offline equivalent on standard benchmarks.

What does “real-time factor” mean for ASR?

Real-time factor (RTF) is the ratio of processing time to audio duration. An RTF of 0.1 means the model processes audio ten times faster than it’s produced — highly efficient. An RTF of 1.0 means the model processes audio exactly as fast as it arrives — the minimum for real-time operation. An RTF above 1.0 means the model falls behind. Cache-aware streaming significantly lowers RTF compared to naive streaming by cutting per-chunk compute requirements.

Can cache-aware streaming ASR run on CPUs, or does it require GPUs?

Cache-aware streaming ASR can technically run on CPUs, but GPUs are strongly preferred for production workloads. The attention and convolution operations in FastConformer benefit significantly from GPU parallelism. On modern mid-range GPUs, Nemotron 3.5 achieves RTF values well below 0.1, allowing a single GPU to handle many simultaneous audio streams. CPU inference is viable for development and low-volume use cases but will struggle to maintain real-time performance at scale.

Key Takeaways

Cache-aware streaming reuses encoder states instead of recomputing context from raw audio — this is the core reason it achieves up to 17x lower latency than naive chunk-based streaming.
FastConformer’s 8x subsampling gives Nemotron 3.5 a base efficiency advantage before any caching is applied, making the combined architecture particularly well-suited for real-time workloads.
Chunk size, left context, and right context are the three tuning parameters that let you configure the latency-accuracy tradeoff for your specific application.
Stateful inference is the infrastructure requirement that distinguishes streaming ASR deployments from offline batch transcription — each audio stream needs its own persistent cache.
Accuracy loss compared to offline models is small when sufficient left context is available, making cache-aware streaming practical for production transcription, not just demos.

If you’re building applications on top of ASR — routing transcripts, triggering workflows, or connecting voice input to business tools — MindStudio offers a no-code layer for the orchestration work, so you can focus on what the transcription enables rather than how to wire it together. Check out how teams are using MindStudio to automate multi-step AI workflows without managing infrastructure from scratch.