NVIDIA/Mixture of Experts

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

chatcodingreasoningThinkingTool Use

31.6B

Parameters (3B active)

256K

Context length

Benchmarks

Quantizations

1.1M

HF downloads

Architecture

MoE

Released

2025-12-04

Layers

KV Heads

Head Dim

128

Family

nemotron

Model Card

View on HuggingFace

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Model Overview

Model Developer: NVIDIA Corporation

Model Dates:

September 2025 - December 2025

Data Freshness:

The post-training data has a cutoff date of November 28, 2025.
The pre-training data has a cutoff date of June 25, 2025.

Description

Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.

The model employs a hybrid Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 and MoE layers, along with 6 Attention layers. Each MoE layer includes 128 experts plus 1 shared expert, with 6 experts activated per token. The model has 3.5B active parameters and 30B parameters in total.

The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen.

This model is ready for commercial use.

What is Nemotron?

NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents.

To get started, you can use our quickstart guide below.

Quantizations & VRAM

Q4_K_M4.5 bpw

18.3 GB

VRAM required

94%

Quality

Q6_K6.5 bpw

26.2 GB

VRAM required

97%

Quality

Q8_08 bpw

32.1 GB

VRAM required

100%

Quality

FP1616 bpw

63.7 GB

VRAM required

100%

Quality

Benchmarks (6)

IFEval83.2

BBH62.5

MATH55.4

MMLU-PRO47.8

MUSR22.3

GPQA18.9

Run with Ollama

$ollama run nemotron:31b

HuggingFace Ollama Library GGUF Downloads Build Hardware

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

AMD RX 7900 XT

20 GB VRAM • 800 GB/s