Skip to main content
MindStudio
Pricing
Blog About
My Workspace
Text Generation Model

Nemotron 3 Super 120B

NVIDIA's open-weight hybrid Mamba-Transformer MoE with 120B total / 12B active parameters, excelling in agentic reasoning, coding, and long-context tasks up to 1M tokens.

Publisher Nvidia
Type Text
Context Window 1,000,000 tokens
Training Data March 2026
Input $0.10/MTok
Output $0.50/MTok
Provider DeepInfra

Hybrid MoE with 1M token context window

Nemotron 3 Super 120B is an open-weight large language model released by NVIDIA in March 2026. It uses a hybrid LatentMoE architecture that combines Mamba-2, Mixture-of-Experts, and Attention layers, activating only 12 billion of its 120 billion total parameters per token. This design allows the model to handle demanding tasks while using significantly less compute than a dense model of comparable parameter count.

The model is built for agentic workflows, long-context reasoning, and high-throughput deployments. It supports a context window of up to 1 million tokens and achieves a RULER-100 retrieval score of 91.75 at that length. Nemotron 3 Super 120B also includes a configurable thinking mode for step-by-step reasoning, supports seven languages including English, French, German, Italian, Japanese, Spanish, and Chinese, and is available as an open-weight model suitable for both cloud API and self-hosted use.

What Nemotron 3 Super 120B supports

Long Context Window

Processes up to 1 million tokens of context in a single request, with a reported RULER-100 retrieval accuracy of 91.75 at that length.

Agentic Reasoning

Designed for multi-step autonomous workflows including coding agents, planning, and tool use, with benchmark results on SWE-Bench, Terminal Bench, and TauBench.

Configurable Thinking Mode

Supports an optional reasoning trace mode where the model generates step-by-step thinking before producing a final answer, useful for math and logic tasks.

Code Generation

Handles code writing, debugging, and autonomous software engineering tasks, with evaluation results on SWE-Bench and Terminal Bench.

Efficient MoE Inference

Activates only 12B of 120B total parameters per token using a LatentMoE architecture, reducing compute requirements compared to dense models at the same parameter scale.

Multilingual Support

Supports text generation in seven languages: English, French, German, Italian, Japanese, Spanish, and Chinese.

Tool Calling

Supports structured tool-use and function-calling workflows, making it suitable for RAG pipelines and multi-step agent integrations.

Instruction Following

Trained to follow complex, multi-part instructions and is evaluated on benchmarks including GPQA and HMMT for instruction-driven reasoning tasks.

Ready to build with Nemotron 3 Super 120B?

Get Started Free

Benchmark scores

Scores represent accuracy — the percentage of questions answered correctly on each test.

Benchmark What it tests Score
AIME 2025 American math olympiad problems (2025) 90.2%
GPQA Diamond PhD-level science questions (biology, physics, chemistry) 82.7%
SWE-bench Verified Real GitHub issues requiring multi-file code fixes 60.5%

Common questions about Nemotron 3 Super 120B

What is the context window for Nemotron 3 Super 120B?

Nemotron 3 Super 120B supports a context window of up to 1 million tokens. NVIDIA reports a RULER-100 retrieval accuracy of 91.75 at the full 1M token length.

How many parameters does the model actually use during inference?

The model has 120 billion total parameters but activates only 12 billion per token during inference, thanks to its LatentMoE architecture combining Mamba-2, MoE, and Attention layers.

Is Nemotron 3 Super 120B open-weight?

Yes, Nemotron 3 Super 120B is released as an open-weight model. The model weights are available on Hugging Face, and NVIDIA updated the license after initial release to remove certain restrictive clauses that had drawn community concern.

What languages does Nemotron 3 Super 120B support?

The model supports seven languages: English, French, German, Italian, Japanese, Spanish, and Chinese.

What is the training data cutoff for this model?

Based on the available metadata, the model was released in March 2026. A specific training data cutoff date is not stated in the provided metadata; refer to the official technical report for details.

What hardware is Nemotron 3 Super 120B optimized for?

The model is designed with NVIDIA Blackwell architecture in mind. Community benchmarks have demonstrated NVFP4 inference running on a single RTX Pro 6000 Blackwell GPU.

What people think about Nemotron 3 Super 120B

Community reception on r/LocalLLaMA and r/singularity has been generally positive, with users highlighting the model's 1M token context window, fast inference due to its 12B active parameter design, and suitability for local deployment on Blackwell hardware. The hybrid SSM LatentMoE architecture and open-weight availability were frequently cited as notable attributes.

A significant early concern was the model's original license, which contained clauses users described as restrictive; this was resolved when NVIDIA updated the license shortly after release. Discussions also noted the absence of vision capabilities as a trade-off compared to some contemporaries, and users debated use cases where long context and speed matter more than multimodal support.

View more discussions →

Parameters & options

Max Temperature 1
Max Response Size 16,384 tokens
Reasoning Select
Default: false
DisabledEnabled

Start building with Nemotron 3 Super 120B

No API keys required. Create AI-powered workflows with Nemotron 3 Super 120B in minutes — free.