What Is NVIDIA Nemotron 3.5 ASR? The Streaming Speech-to-Text Model for AI Agents

A New Standard for Real-Time Speech Recognition in AI Agents

Voice is one of the fastest-growing input modalities for AI systems — and one of the hardest to get right. The gap between “speech recognition that kind of works” and “speech recognition that works reliably in production” is wide. NVIDIA Nemotron 3.5 ASR is designed to close that gap.

This is a 600M parameter streaming automatic speech recognition (ASR) model built specifically for real-time, agent-ready transcription. It supports 40 languages, delivers low-latency output through cache-aware streaming, and includes word boosting for domain-specific vocabulary — features that matter enormously when you’re building AI agents that need to hear and respond accurately in the real world.

This article covers what NVIDIA Nemotron 3.5 ASR is, how it works technically, where it fits into modern AI agent architectures, and how platforms like MindStudio make it practical to deploy without infrastructure overhead.

What Is NVIDIA Nemotron 3.5 ASR?

NVIDIA Nemotron 3.5 ASR is a production-grade automatic speech recognition model released by NVIDIA as part of its broader Nemotron model family. It’s designed for real-time transcription use cases — particularly those involving AI agents, voice assistants, customer service automation, and any workflow where spoken input needs to be converted to text quickly and accurately.

The model is available through NVIDIA NIM (NVIDIA Inference Microservices), NVIDIA’s platform for deploying optimized AI models as microservices. NIM handles the serving infrastructure, so developers interact with the model through a simple API.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

A few key specs at a glance:

Parameters: ~600 million
Architecture: FastConformer-based with CTC (Connectionist Temporal Classification) decoding
Languages: 40+ supported
Streaming: Yes — cache-aware streaming with configurable chunk sizes
Word boosting: Yes — runtime vocabulary augmentation for domain-specific terms
Timestamps: Word- and segment-level timestamp prediction
Long-form audio: Handles extended recordings without truncation issues

The “3.5” version represents a meaningful improvement over earlier Nemotron ASR iterations — both in multilingual coverage and in the streaming capabilities that make it viable for live, agent-facing deployments.

How Cache-Aware Streaming Works

Most speech recognition models process audio in one of two modes: offline (the full audio clip is sent at once, then transcribed) or streaming (audio is processed in real-time chunks as it arrives). Offline mode is more accurate but introduces unacceptable latency for live use cases. Simple streaming models reduce latency but often sacrifice accuracy — particularly at chunk boundaries where context is lost.

NVIDIA Nemotron 3.5 ASR uses cache-aware streaming to solve this tradeoff.

The Problem Cache-Aware Streaming Solves

When you split audio into small chunks for real-time processing, each chunk loses context about what came before it. The model can’t “see” earlier audio, so words near chunk boundaries often get garbled or misrecognized. Standard streaming models compensate with larger chunk sizes — which helps accuracy but hurts latency.

Cache-aware streaming keeps a rolling cache of processed context from prior chunks. When a new chunk arrives, the model has access to encoded representations of previous audio — not the raw audio itself, but learned internal states. This lets it maintain temporal context across chunk boundaries without reprocessing old audio.

Why This Matters for AI Agents

AI agents that take voice input need low latency. A customer service agent, a real-time meeting transcription tool, or a voice-controlled workflow automation can’t wait 3–5 seconds for a sentence to finish processing. Cache-aware streaming in Nemotron 3.5 ASR enables:

Sub-second chunk processing at configurable chunk sizes
Consistent accuracy even at low latency settings
Stable partial transcripts that agents can begin acting on before the full utterance completes

This is the architecture difference that separates models built for production deployment from those built for benchmarks.

Word Boosting: Teaching the Model Your Vocabulary

Generic speech recognition models struggle with proper nouns, brand names, technical jargon, and industry-specific terminology. A model trained on general conversational data will often transcribe “Salesforce” as “sales force,” mishear product names, or completely botch rare medical or legal terms.

NVIDIA Nemotron 3.5 ASR includes a word boosting feature that addresses this at inference time — no fine-tuning required.

How Word Boosting Works

Word boosting is a bias mechanism applied during the decoding phase. You provide a list of words or phrases you want the model to favor, along with a boost score. During transcription, the decoder assigns higher probability to those terms when the audio is ambiguous between several options.

In practice, this looks like passing a list alongside your audio input:

boost_phrases: ["Nemotron", "FastConformer", "NVIDIA NIM", "CTC decoding"]
boost_score: 10.0

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

The model uses these as soft hints — it won’t force incorrect transcription, but when the audio is genuinely ambiguous, it tilts toward your boosted vocabulary.

Real-World Applications

Word boosting is particularly valuable in:

Healthcare: Boosting drug names, anatomical terms, procedure codes
Legal: Case identifiers, legal terminology, firm names
Customer support: Product SKUs, feature names, service tier names
Finance: Ticker symbols, fund names, regulatory terms
Enterprise software: Internal tool names, team names, technical acronyms

This removes a major deployment blocker for domain-specific voice AI. Teams no longer need to fine-tune a custom model every time they introduce new terminology — they adjust the boost list at runtime.

Multilingual Support: 40 Languages in a Single Model

Nemotron 3.5 ASR supports transcription in 40+ languages without requiring separate model instances per language. The model handles automatic language detection or accepts explicit language configuration.

Supported languages span major world languages including English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Chinese (Mandarin), Japanese, Korean, Arabic, Hindi, and many others.

Why a Single Multilingual Model Matters

Running separate ASR models per language creates infrastructure complexity — more endpoints to manage, more compute to provision, and more latency when language identification is uncertain. A single multilingual model simplifies all of this.

For global enterprises running AI agents across regions, or for any product with an international user base, this consolidation is operationally significant. One NIM endpoint, one integration, one set of monitoring — regardless of which of the 40 supported languages your users speak.

Language Detection and Switching

Nemotron 3.5 ASR can perform language identification automatically if the target language isn’t specified. For use cases where speakers might switch between languages mid-session — multilingual customer support, for instance — this flexibility is essential.

The FastConformer Architecture Under the Hood

The model is built on FastConformer, NVIDIA’s optimized Conformer architecture. Understanding why this architecture matters helps explain the model’s performance characteristics.

Conformer: Combining CNN and Transformer

Standard Transformer models process entire sequences globally but can miss fine-grained local acoustic patterns. Standard CNNs are great at local patterns but struggle with long-range dependencies. The Conformer architecture (introduced by Google in 2020) combines both: self-attention for global context, depthwise separable convolutions for local acoustic features.

FastConformer’s Improvements

FastConformer reduces computational cost by introducing strided subsampling early in the network, shrinking the sequence length before the expensive attention layers. This yields:

8x reduction in sequence length passed to the attention blocks
Significantly faster inference at comparable accuracy
Better suitability for streaming, where compute latency is critical

At 600M parameters, Nemotron 3.5 ASR is large enough to handle complex acoustic environments and multilingual data, but optimized enough to run efficiently through NVIDIA NIM’s serving infrastructure.

CTC Decoding

The model uses CTC (Connectionist Temporal Classification) as its output mechanism. CTC doesn’t require aligned training labels — it learns alignment implicitly. For streaming applications, CTC has a key advantage: it produces outputs incrementally as audio is processed, rather than waiting for an end-of-sequence signal.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

This makes CTC well-suited for the real-time use cases Nemotron 3.5 ASR targets. Combined with the cache-aware streaming architecture, CTC decoding ensures the model can emit partial transcripts reliably as audio chunks arrive.

Where Nemotron 3.5 ASR Fits in AI Agent Pipelines

Speech recognition is a front-end component in most voice-enabled AI agent architectures. Here’s where Nemotron 3.5 ASR slots into a typical pipeline:

User speaks → ASR (Nemotron 3.5) → Text → LLM reasoning → Action

But the details matter. In agentic contexts, the ASR layer isn’t just a dumb transcription step — it affects everything downstream.

Latency Compounds Across the Pipeline

An AI agent pipeline involves multiple sequential steps: transcription, LLM inference, tool calls, response generation, and sometimes text-to-speech for output. Each step adds latency. If your ASR model is slow, everything downstream is delayed.

Nemotron 3.5 ASR’s streaming capability means the LLM can begin processing as soon as the first coherent chunk is transcribed — it doesn’t need to wait for the speaker to finish their entire utterance. In practice, this enables perceived response times that feel much closer to human conversation.

Accuracy Affects Agent Reliability

An agent that mishears “cancel the order” as “cancel the border” will take incorrect actions. In high-stakes contexts — healthcare, finance, customer operations — transcription errors cascade into real problems. Nemotron 3.5 ASR’s combination of architecture quality, word boosting, and multilingual training addresses the accuracy side of this equation directly.

Timestamps Enable Temporal Reasoning

Nemotron 3.5 ASR outputs word-level and segment-level timestamps. For agents that need to operate on specific parts of a recording — summarizing a meeting, extracting action items from a call, flagging compliance-relevant statements — timestamps make it possible to anchor agent outputs to exact moments in the audio.

How to Access NVIDIA Nemotron 3.5 ASR

The model is available through NVIDIA NIM, which exposes it as a REST API endpoint. The general workflow:

Get an NVIDIA API key through the NVIDIA developer portal
Call the NIM endpoint with your audio payload (supports streaming WebSocket connections for real-time use cases)
Receive transcription results with word-level data, timestamps, and confidence scores

For local deployment, NVIDIA also supports self-hosted NIM containers, which run on NVIDIA GPU infrastructure and give teams full control over data residency and scaling.

The API follows standard patterns familiar to developers who’ve worked with OpenAI Whisper or Google Speech-to-Text — chunked audio input, JSON transcript output, configurable parameters for language and boosting.

Building Voice AI Workflows with MindStudio

If you want to put Nemotron 3.5 ASR — or any ASR model — to work in an actual AI agent workflow without managing infrastructure yourself, MindStudio is a practical path forward.

MindStudio is a no-code platform for building AI agents and automated workflows. It gives you access to 200+ AI models — including speech, language, and vision models — through a visual builder, without API keys or separate accounts for each model.

For voice AI use cases, this means you can:

Connect an ASR model as the input layer of a workflow
Pass the transcript to an LLM for reasoning, summarization, or action
Trigger downstream steps — send emails, update CRM records, create support tickets, post to Slack — using 1,000+ pre-built integrations
Build a full voice-enabled agent in an afternoon, not a sprint

Cursor

ChatGPT

Figma

Linear

GitHub

Vercel

Supabase

goremy.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

For example, a customer support team could build a workflow that: transcribes incoming call recordings → extracts key issues and sentiment → creates a structured support ticket in Salesforce → routes to the right team — all without writing a line of code.

If you’re a developer who wants more control, MindStudio also supports custom JavaScript and Python functions, so you can drop into code where needed while still using the platform for orchestration and integrations.

You can start building for free at mindstudio.ai.

Nemotron 3.5 ASR vs. Other ASR Models

It’s worth briefly situating Nemotron 3.5 ASR relative to other common options.

Model	Streaming	Multilingual	Word Boosting	Deployment
NVIDIA Nemotron 3.5 ASR	Yes (cache-aware)	40+ languages	Yes	NIM API / self-hosted
OpenAI Whisper (large-v3)	No (offline only)	99 languages	No	Self-hosted / API
Google Speech-to-Text	Yes	125+ languages	Yes (via phrase hints)	Google Cloud
AWS Transcribe	Yes	38 languages	Yes (custom vocabulary)	AWS
Deepgram Nova-2	Yes	Limited	Yes	Deepgram API

Whisper is widely used and excellent at offline transcription, but its lack of native streaming support limits it for real-time agent applications. Google and AWS offer competitive streaming products but come with cloud vendor lock-in. Deepgram is strong for English-focused use cases.

Nemotron 3.5 ASR’s differentiator is the combination of enterprise-grade streaming, broad multilingual coverage, and the NVIDIA ecosystem — particularly for teams already running NVIDIA GPU infrastructure or building on NVIDIA’s AI platform stack.

Frequently Asked Questions

What is NVIDIA Nemotron 3.5 ASR?

NVIDIA Nemotron 3.5 ASR is a 600M parameter automatic speech recognition model built for real-time transcription. It supports 40+ languages, uses cache-aware streaming for low-latency output, and includes word boosting to handle domain-specific vocabulary. It’s available through NVIDIA NIM as a REST API or as a self-hosted container.

What is cache-aware streaming in ASR?

Cache-aware streaming is a technique where the model stores encoded internal states from previously processed audio chunks. When new audio arrives, the model uses this cache to maintain context across chunk boundaries — enabling real-time transcription with accuracy close to offline processing, without the latency penalty of waiting for complete audio before starting transcription.

How does word boosting work in Nemotron 3.5 ASR?

Word boosting lets you provide a list of words or phrases at inference time that you want the model to prioritize. During decoding, the model applies a configurable bias score toward these terms when the audio is acoustically ambiguous. This is useful for domain-specific vocabulary — product names, technical terms, proper nouns — without requiring fine-tuning.

What languages does NVIDIA Nemotron 3.5 ASR support?

The model supports 40+ languages, including English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Mandarin Chinese, Japanese, Korean, Arabic, Hindi, and others. It can detect language automatically or accept explicit language configuration per request.

Is Nemotron 3.5 ASR suitable for real-time AI agents?

Yes — it’s specifically designed for agent-facing use cases. The cache-aware streaming architecture enables low-latency partial transcription, so downstream LLMs and action systems can begin processing before a user finishes speaking. Combined with accurate multilingual transcription and runtime word boosting, it addresses the core requirements of production voice AI agents.

How does Nemotron 3.5 ASR compare to Whisper?

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

OpenAI Whisper is a strong offline ASR model with broad language coverage, but it doesn’t natively support streaming transcription — audio must be fully captured before transcription begins. Nemotron 3.5 ASR is purpose-built for streaming and real-time use cases. For applications where latency matters (live agents, real-time call analysis, voice-controlled workflows), Nemotron 3.5 ASR is the more practical choice.

Key Takeaways

NVIDIA Nemotron 3.5 ASR is a 600M parameter streaming ASR model built for production AI agent deployments
Cache-aware streaming enables real-time, low-latency transcription without sacrificing accuracy at chunk boundaries
Word boosting solves the domain vocabulary problem at inference time — no fine-tuning required
40+ language support in a single model simplifies infrastructure for multilingual deployments
NVIDIA NIM handles serving infrastructure, exposing the model through a standard REST API
For teams building voice-enabled agents and workflows without deep infrastructure work, platforms like MindStudio make it practical to connect ASR models to LLMs, business tools, and automated actions — start building free at mindstudio.ai

A New Standard for Real-Time Speech Recognition in AI Agents

What Is NVIDIA Nemotron 3.5 ASR?

Everyone else built a construction worker.We built the contractor.

How Cache-Aware Streaming Works

The Problem Cache-Aware Streaming Solves

Why This Matters for AI Agents

Word Boosting: Teaching the Model Your Vocabulary

How Word Boosting Works

Plans first. Then code.

Real-World Applications

Multilingual Support: 40 Languages in a Single Model

Why a Single Multilingual Model Matters

Language Detection and Switching

The FastConformer Architecture Under the Hood

Conformer: Combining CNN and Transformer

FastConformer’s Improvements

CTC Decoding

Remy doesn't write the code. It manages the agents who do.

Where Nemotron 3.5 ASR Fits in AI Agent Pipelines

Latency Compounds Across the Pipeline

Accuracy Affects Agent Reliability

Timestamps Enable Temporal Reasoning

How to Access NVIDIA Nemotron 3.5 ASR

Building Voice AI Workflows with MindStudio

Seven tools to build an app. Or just Remy.

Nemotron 3.5 ASR vs. Other ASR Models

Frequently Asked Questions

What is NVIDIA Nemotron 3.5 ASR?

What is cache-aware streaming in ASR?

How does word boosting work in Nemotron 3.5 ASR?

What languages does NVIDIA Nemotron 3.5 ASR support?

Is Nemotron 3.5 ASR suitable for real-time AI agents?

How does Nemotron 3.5 ASR compare to Whisper?

Remy is new. The platform isn't.

Key Takeaways

Related Articles

What Is a 26M Parameter Function Calling Model? Cactus Needle Explained

How to Build AI Agents Using Different LLM Providers

How to Build AI Agents Powered by Private Knowledge Bases

What Is RAG and How Do AI Agents Use It

What Is Milvus? The Open-Source Vector Database for AI Agent Memory

Mistral's Open-Weight TTS Model Explained: A Voice Cloning Primer

Everyone else built a construction worker.
We built the contractor.