Skip to main content
MindStudio
Pricing
Blog About
My Workspace

What Is NVIDIA Nemotron 3.5 ASR? The Streaming Speech-to-Text Model for AI Agents

NVIDIA Nemotron 3.5 ASR is a 600M parameter streaming model supporting 40 languages. Learn how cache-aware streaming and word boosting make it agent-ready.

MindStudio Team RSS
What Is NVIDIA Nemotron 3.5 ASR? The Streaming Speech-to-Text Model for AI Agents

A New Standard for Real-Time Speech Recognition in AI Agents

Voice is one of the fastest-growing input modalities for AI systems — and one of the hardest to get right. The gap between “speech recognition that kind of works” and “speech recognition that works reliably in production” is wide. NVIDIA Nemotron 3.5 ASR is designed to close that gap.

This is a 600M parameter streaming automatic speech recognition (ASR) model built specifically for real-time, agent-ready transcription. It supports 40 languages, delivers low-latency output through cache-aware streaming, and includes word boosting for domain-specific vocabulary — features that matter enormously when you’re building AI agents that need to hear and respond accurately in the real world.

This article covers what NVIDIA Nemotron 3.5 ASR is, how it works technically, where it fits into modern AI agent architectures, and how platforms like MindStudio make it practical to deploy without infrastructure overhead.


What Is NVIDIA Nemotron 3.5 ASR?

NVIDIA Nemotron 3.5 ASR is a production-grade automatic speech recognition model released by NVIDIA as part of its broader Nemotron model family. It’s designed for real-time transcription use cases — particularly those involving AI agents, voice assistants, customer service automation, and any workflow where spoken input needs to be converted to text quickly and accurately.

The model is available through NVIDIA NIM (NVIDIA Inference Microservices), NVIDIA’s platform for deploying optimized AI models as microservices. NIM handles the serving infrastructure, so developers interact with the model through a simple API.

Other agents ship a demo. Remy ships an app.

UI
React + Tailwind ✓ LIVE
API
REST · typed contracts ✓ LIVE
DATABASE
real SQL, not mocked ✓ LIVE
AUTH
roles · sessions · tokens ✓ LIVE
DEPLOY
git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

A few key specs at a glance:

  • Parameters: ~600 million
  • Architecture: FastConformer-based with CTC (Connectionist Temporal Classification) decoding
  • Languages: 40+ supported
  • Streaming: Yes — cache-aware streaming with configurable chunk sizes
  • Word boosting: Yes — runtime vocabulary augmentation for domain-specific terms
  • Timestamps: Word- and segment-level timestamp prediction
  • Long-form audio: Handles extended recordings without truncation issues

The “3.5” version represents a meaningful improvement over earlier Nemotron ASR iterations — both in multilingual coverage and in the streaming capabilities that make it viable for live, agent-facing deployments.


How Cache-Aware Streaming Works

Most speech recognition models process audio in one of two modes: offline (the full audio clip is sent at once, then transcribed) or streaming (audio is processed in real-time chunks as it arrives). Offline mode is more accurate but introduces unacceptable latency for live use cases. Simple streaming models reduce latency but often sacrifice accuracy — particularly at chunk boundaries where context is lost.

NVIDIA Nemotron 3.5 ASR uses cache-aware streaming to solve this tradeoff.

The Problem Cache-Aware Streaming Solves

When you split audio into small chunks for real-time processing, each chunk loses context about what came before it. The model can’t “see” earlier audio, so words near chunk boundaries often get garbled or misrecognized. Standard streaming models compensate with larger chunk sizes — which helps accuracy but hurts latency.

Cache-aware streaming keeps a rolling cache of processed context from prior chunks. When a new chunk arrives, the model has access to encoded representations of previous audio — not the raw audio itself, but learned internal states. This lets it maintain temporal context across chunk boundaries without reprocessing old audio.

Why This Matters for AI Agents

AI agents that take voice input need low latency. A customer service agent, a real-time meeting transcription tool, or a voice-controlled workflow automation can’t wait 3–5 seconds for a sentence to finish processing. Cache-aware streaming in Nemotron 3.5 ASR enables:

  • Sub-second chunk processing at configurable chunk sizes
  • Consistent accuracy even at low latency settings
  • Stable partial transcripts that agents can begin acting on before the full utterance completes

This is the architecture difference that separates models built for production deployment from those built for benchmarks.


Word Boosting: Teaching the Model Your Vocabulary

Generic speech recognition models struggle with proper nouns, brand names, technical jargon, and industry-specific terminology. A model trained on general conversational data will often transcribe “Salesforce” as “sales force,” mishear product names, or completely botch rare medical or legal terms.

NVIDIA Nemotron 3.5 ASR includes a word boosting feature that addresses this at inference time — no fine-tuning required.

How Word Boosting Works

Word boosting is a bias mechanism applied during the decoding phase. You provide a list of words or phrases you want the model to favor, along with a boost score. During transcription, the decoder assigns higher probability to those terms when the audio is ambiguous between several options.

In practice, this looks like passing a list alongside your audio input:

boost_phrases: ["Nemotron", "FastConformer", "NVIDIA NIM", "CTC decoding"]
boost_score: 10.0

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY
Designed the data model
Picked an auth scheme — sessions + RBAC
Wired up Stripe checkout
Deployed to production
Live at yourapp.msagent.ai

The model uses these as soft hints — it won’t force incorrect transcription, but when the audio is genuinely ambiguous, it tilts toward your boosted vocabulary.

Real-World Applications

Word boosting is particularly valuable in:

  • Healthcare: Boosting drug names, anatomical terms, procedure codes
  • Legal: Case identifiers, legal terminology, firm names
  • Customer support: Product SKUs, feature names, service tier names
  • Finance: Ticker symbols, fund names, regulatory terms
  • Enterprise software: Internal tool names, team names, technical acronyms

This removes a major deployment blocker for domain-specific voice AI. Teams no longer need to fine-tune a custom model every time they introduce new terminology — they adjust the boost list at runtime.


Multilingual Support: 40 Languages in a Single Model

Nemotron 3.5 ASR supports transcription in 40+ languages without requiring separate model instances per language. The model handles automatic language detection or accepts explicit language configuration.

Supported languages span major world languages including English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Chinese (Mandarin), Japanese, Korean, Arabic, Hindi, and many others.

Why a Single Multilingual Model Matters

Running separate ASR models per language creates infrastructure complexity — more endpoints to manage, more compute to provision, and more latency when language identification is uncertain. A single multilingual model simplifies all of this.

For global enterprises running AI agents across regions, or for any product with an international user base, this consolidation is operationally significant. One NIM endpoint, one integration, one set of monitoring — regardless of which of the 40 supported languages your users speak.

Language Detection and Switching

Nemotron 3.5 ASR can perform language identification automatically if the target language isn’t specified. For use cases where speakers might switch between languages mid-session — multilingual customer support, for instance — this flexibility is essential.


The FastConformer Architecture Under the Hood

The model is built on FastConformer, NVIDIA’s optimized Conformer architecture. Understanding why this architecture matters helps explain the model’s performance characteristics.

Conformer: Combining CNN and Transformer

Standard Transformer models process entire sequences globally but can miss fine-grained local acoustic patterns. Standard CNNs are great at local patterns but struggle with long-range dependencies. The Conformer architecture (introduced by Google in 2020) combines both: self-attention for global context, depthwise separable convolutions for local acoustic features.

FastConformer’s Improvements

FastConformer reduces computational cost by introducing strided subsampling early in the network, shrinking the sequence length before the expensive attention layers. This yields:

  • 8x reduction in sequence length passed to the attention blocks
  • Significantly faster inference at comparable accuracy
  • Better suitability for streaming, where compute latency is critical

At 600M parameters, Nemotron 3.5 ASR is large enough to handle complex acoustic environments and multilingual data, but optimized enough to run efficiently through NVIDIA NIM’s serving infrastructure.

CTC Decoding

The model uses CTC (Connectionist Temporal Classification) as its output mechanism. CTC doesn’t require aligned training labels — it learns alignment implicitly. For streaming applications, CTC has a key advantage: it produces outputs incrementally as audio is processed, rather than waiting for an end-of-sequence signal.

REMY IS NOT
  • a coding agent
  • no-code
  • vibe coding
  • a faster Cursor
IT IS
a general contractor for software

The one that tells the coding agents what to build.

This makes CTC well-suited for the real-time use cases Nemotron 3.5 ASR targets. Combined with the cache-aware streaming architecture, CTC decoding ensures the model can emit partial transcripts reliably as audio chunks arrive.


Where Nemotron 3.5 ASR Fits in AI Agent Pipelines

Speech recognition is a front-end component in most voice-enabled AI agent architectures. Here’s where Nemotron 3.5 ASR slots into a typical pipeline:

User speaks → ASR (Nemotron 3.5) → Text → LLM reasoning → Action

But the details matter. In agentic contexts, the ASR layer isn’t just a dumb transcription step — it affects everything downstream.

Latency Compounds Across the Pipeline

An AI agent pipeline involves multiple sequential steps: transcription, LLM inference, tool calls, response generation, and sometimes text-to-speech for output. Each step adds latency. If your ASR model is slow, everything downstream is delayed.

Nemotron 3.5 ASR’s streaming capability means the LLM can begin processing as soon as the first coherent chunk is transcribed — it doesn’t need to wait for the speaker to finish their entire utterance. In practice, this enables perceived response times that feel much closer to human conversation.

Accuracy Affects Agent Reliability

An agent that mishears “cancel the order” as “cancel the border” will take incorrect actions. In high-stakes contexts — healthcare, finance, customer operations — transcription errors cascade into real problems. Nemotron 3.5 ASR’s combination of architecture quality, word boosting, and multilingual training addresses the accuracy side of this equation directly.

Timestamps Enable Temporal Reasoning

Nemotron 3.5 ASR outputs word-level and segment-level timestamps. For agents that need to operate on specific parts of a recording — summarizing a meeting, extracting action items from a call, flagging compliance-relevant statements — timestamps make it possible to anchor agent outputs to exact moments in the audio.


How to Access NVIDIA Nemotron 3.5 ASR

The model is available through NVIDIA NIM, which exposes it as a REST API endpoint. The general workflow:

  1. Get an NVIDIA API key through the NVIDIA developer portal
  2. Call the NIM endpoint with your audio payload (supports streaming WebSocket connections for real-time use cases)
  3. Receive transcription results with word-level data, timestamps, and confidence scores

For local deployment, NVIDIA also supports self-hosted NIM containers, which run on NVIDIA GPU infrastructure and give teams full control over data residency and scaling.

The API follows standard patterns familiar to developers who’ve worked with OpenAI Whisper or Google Speech-to-Text — chunked audio input, JSON transcript output, configurable parameters for language and boosting.


Building Voice AI Workflows with MindStudio

If you want to put Nemotron 3.5 ASR — or any ASR model — to work in an actual AI agent workflow without managing infrastructure yourself, MindStudio is a practical path forward.

MindStudio is a no-code platform for building AI agents and automated workflows. It gives you access to 200+ AI models — including speech, language, and vision models — through a visual builder, without API keys or separate accounts for each model.

For voice AI use cases, this means you can:

  • Connect an ASR model as the input layer of a workflow
  • Pass the transcript to an LLM for reasoning, summarization, or action
  • Trigger downstream steps — send emails, update CRM records, create support tickets, post to Slack — using 1,000+ pre-built integrations
  • Build a full voice-enabled agent in an afternoon, not a sprint

Remy doesn't write the code. It manages the agents who do.

R
Remy
Product Manager Agent
Leading
Design
Engineer
QA
Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

For example, a customer support team could build a workflow that: transcribes incoming call recordings → extracts key issues and sentiment → creates a structured support ticket in Salesforce → routes to the right team — all without writing a line of code.

If you’re a developer who wants more control, MindStudio also supports custom JavaScript and Python functions, so you can drop into code where needed while still using the platform for orchestration and integrations.

You can start building for free at mindstudio.ai.


Nemotron 3.5 ASR vs. Other ASR Models

It’s worth briefly situating Nemotron 3.5 ASR relative to other common options.

ModelStreamingMultilingualWord BoostingDeployment
NVIDIA Nemotron 3.5 ASRYes (cache-aware)40+ languagesYesNIM API / self-hosted
OpenAI Whisper (large-v3)No (offline only)99 languagesNoSelf-hosted / API
Google Speech-to-TextYes125+ languagesYes (via phrase hints)Google Cloud
AWS TranscribeYes38 languagesYes (custom vocabulary)AWS
Deepgram Nova-2YesLimitedYesDeepgram API

Whisper is widely used and excellent at offline transcription, but its lack of native streaming support limits it for real-time agent applications. Google and AWS offer competitive streaming products but come with cloud vendor lock-in. Deepgram is strong for English-focused use cases.

Nemotron 3.5 ASR’s differentiator is the combination of enterprise-grade streaming, broad multilingual coverage, and the NVIDIA ecosystem — particularly for teams already running NVIDIA GPU infrastructure or building on NVIDIA’s AI platform stack.


Frequently Asked Questions

What is NVIDIA Nemotron 3.5 ASR?

NVIDIA Nemotron 3.5 ASR is a 600M parameter automatic speech recognition model built for real-time transcription. It supports 40+ languages, uses cache-aware streaming for low-latency output, and includes word boosting to handle domain-specific vocabulary. It’s available through NVIDIA NIM as a REST API or as a self-hosted container.

What is cache-aware streaming in ASR?

Cache-aware streaming is a technique where the model stores encoded internal states from previously processed audio chunks. When new audio arrives, the model uses this cache to maintain context across chunk boundaries — enabling real-time transcription with accuracy close to offline processing, without the latency penalty of waiting for complete audio before starting transcription.

How does word boosting work in Nemotron 3.5 ASR?

Word boosting lets you provide a list of words or phrases at inference time that you want the model to prioritize. During decoding, the model applies a configurable bias score toward these terms when the audio is acoustically ambiguous. This is useful for domain-specific vocabulary — product names, technical terms, proper nouns — without requiring fine-tuning.

What languages does NVIDIA Nemotron 3.5 ASR support?

The model supports 40+ languages, including English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Mandarin Chinese, Japanese, Korean, Arabic, Hindi, and others. It can detect language automatically or accept explicit language configuration per request.

Is Nemotron 3.5 ASR suitable for real-time AI agents?

Yes — it’s specifically designed for agent-facing use cases. The cache-aware streaming architecture enables low-latency partial transcription, so downstream LLMs and action systems can begin processing before a user finishes speaking. Combined with accurate multilingual transcription and runtime word boosting, it addresses the core requirements of production voice AI agents.

How does Nemotron 3.5 ASR compare to Whisper?

Get set up on Hermes in 1 hour
The free Hermes Agent crash courseReserve your spot

OpenAI Whisper is a strong offline ASR model with broad language coverage, but it doesn’t natively support streaming transcription — audio must be fully captured before transcription begins. Nemotron 3.5 ASR is purpose-built for streaming and real-time use cases. For applications where latency matters (live agents, real-time call analysis, voice-controlled workflows), Nemotron 3.5 ASR is the more practical choice.


Key Takeaways

  • NVIDIA Nemotron 3.5 ASR is a 600M parameter streaming ASR model built for production AI agent deployments
  • Cache-aware streaming enables real-time, low-latency transcription without sacrificing accuracy at chunk boundaries
  • Word boosting solves the domain vocabulary problem at inference time — no fine-tuning required
  • 40+ language support in a single model simplifies infrastructure for multilingual deployments
  • NVIDIA NIM handles serving infrastructure, exposing the model through a standard REST API
  • For teams building voice-enabled agents and workflows without deep infrastructure work, platforms like MindStudio make it practical to connect ASR models to LLMs, business tools, and automated actions — start building free at mindstudio.ai

Related Articles

Presented by MindStudio

No spam. Unsubscribe anytime.