HuggingFace镜像/Nemotron-Cascade-2-30B-A3B

Nemotron-Cascade-2-30B-A3B

简介

我们很高兴推出 Nemotron-Cascade-2-30B-A3B，这是一款开放的 300 亿参数 MoE 模型，激活参数为 30 亿，具备强大的推理能力和智能体能力。该模型基于 Nemotron-3-Nano-30B-A3B-Base 进行后训练。Nemotron-Cascade-2-30B-A3B 在 2025 年国际数学奥林匹克（IMO）和国际信息学奥林匹克（IOI）中均获得金牌。它支持思考和指令（非思考）两种模式。

基准测试结果

基准测试	Nemotron-3-Nano-30B-A3B	Nemotron-3-Super-120B-A12B	Qwen3.5-35B-A3B	Nemotron-Cascade-2-30B-A3B
数学
IMO 2025	-	-	-	🏅 35 分
IMO AnswerBench	70.4‡	77.2‡	74.8‡	79.3
IMO ProofBench	-	-	-	72.9
AIME 2025	89.1	90.2	91.9‡	92.4 (98.6)†
AIME 2026	89.9‡	89.8‡	91.1‡	90.9 (95.0)†
HMMT Feb25	84.6‡	93.7	89.0	94.6
代码推理
IOI 2025	-	-	348.6‡	🏅 439.3
ICPC 世界总决赛 2025	-	-	-	🏅 10/12
LiveCodeBench v6 (2408-2505)	68.3	78.7	74.6	87.2 (88.4)†
LiveCodeBenchPro 25Q2 (简单)	54.5‡	81.7‡	81.1‡	87.0 (89.3)†
LiveCodeBenchPro 25Q2 (中等)	3.50‡	23.2‡	17.8‡	27.6 (36.8)†
SciCode	33.3	42.1	38.0	36.4
知识与 STEM
MMLU-Redux	-	-	93.3	86.3
MMLU-Pro	78.3	83.7	85.3	79.8
GPQA-Diamond	73.0	79.2	84.2	76.1
HLE (无工具)	10.6	18.3	22.4	17.7
对齐与指令遵循
ArenaHard v2 (平均)	67.7	-	65.4‡	83.5
– 困难提示词	72.1	73.9	64.5‡	88.2
– 创意写作	63.2	-	66.3‡	78.7
IFBench (提示词)	71.5	72.6	70.2	82.9
Scale AI 多任务挑战	38.5	55.2	60.0	45.3
长上下文与上下文学习
AA-LCR	35.9	58.3	58.5	39.1
LongBench v2	39.6	-	59.0	40.3
NIAH@1M (RULER 子集)	94.8	98.3	94.3‡	99.0
CL-Bench	12.0‡	-	15.5‡	12.2
智能体能力
BFCL v4	53.8	-	67.3	52.9
𝜏²-Bench	49.0	61.2	81.2	58.9
Terminal Bench 2.0	8.5	31.0	40.5	21.1
SWE Verified (OpenHands)	38.8	60.5	69.2	50.2
多语言能力
MMLU-ProX	59.5	79.4	81.0	72.5
WMT24++ (en -> xx)	86.2	86.7	87.6‡	84.1

* † 括号中的数字指工具集成推理（TIR）结果。
* ‡ 对于基线模型，若官方数据不可用，我们将使用推荐设置进行评估；否则，直接报告官方数据。

快速入门

Nemotron-Cascade-2-30B-A3B 遵循 ChatML 模板，支持思考模式和指令模式（非思考模式）。推理内容用 </think> 和 </think> 标签包裹。要激活指令模式（非思考模式），需在助手回复的开头添加 </think>superscript:。
Nemotron-Cascade-2-30B-A3B 支持最长 100 万 token 的上下文长度。
Nemotron-Cascade-2-30B-A3B 目前不支持 OpenCode；主要支持 OpenHands 用于智能体编码和软件工程任务。
为减少多轮对话中的上下文长度，当前一轮用户输入涉及思考模式时，仅将模型输出的最终总结添加到对话历史中。
注意，工具响应没有单独定义 tool 角色；而是将其放在 user 角色下，并用 <tool_response> 和 </tool_response> 包裹。
建议将采样参数设置为 temperature = 1.0 和 top_p = 0.95。

vLLM 配置

需要 vLLM 版本 >= 0.17.1。以下操作将在 http://localhost:8000/v1 创建 API 端点：

标准版：使用以下命令创建最大上下文长度为 100 万 token 的 API 端点。

vllm serve nvidia/Nemotron-Cascade-2-30B-A3B --port 8000 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 262144 --reasoning-parser nemotron_v3 --mamba-ssm-cache-dtype float32 --port 8000 --trust_remote_code

工具调用：使用以下命令启用工具支持。

vllm serve nvidia/Nemotron-Cascade-2-30B-A3B --port 8000 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 262144 --reasoning-parser nemotron_v3 --mamba-ssm-cache-dtype float32 --port 8000 --trust_remote_code --enable-auto-tool-choice --tool-call-parser qwen3_coder

对话模板

from transformers import AutoTokenizer

model_name = 'nvidia/Nemotron-Cascade-2-30B-A3B'
tokenizer = AutoTokenizer.from_pretrained(model_name)

'''
single-turn example
'''
messages = [
  {"role": "system", "content": "You are a helpful and harmless assistant.\n\nYou are not allowed to use any tools"},
  {"role": "user", "content": "calculate 1+1?"}
]

# thinking mode
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.\n\nYou are not allowed to use any tools<|im_end|>\n<|im_start|>user\ncalculate 1+1?<|im_end|>\n<|im_start|>assistant\n'

'''
multi-turn example
'''
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant.\n\nYou are not allowed to use any tools"},
    {"role": "user", "content": "calculate 1+1?"},
    {"role": "assistant", "content": "\nTo calculate \$1 + 1\$:\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**:  \n   \$1 + 1 = 2\$.\n\n**Result**: \$\\boxed{2}\$",},
    {"role": "user", "content": "what about 2+2"}
]

# thinking mode
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.\n\nYou are not allowed to use any tools<|im_end|>\n<|im_start|>user\ncalculate 1+1?<|im_end|>\n<|im_start|>assistant\n\nTo calculate \$1 + 1\$:\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**:  \n   \$1 + 1 = 2\$.\n\n**Result**: \$\\boxed{2}\$<|im_end|>\n<|im_start|>user\nwhat about 2+2<|im_end|>\n<|im_start|>assistant\n\nTo calculate \$1 + 1\$:\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**:  \n   \$1 + 1 = 2\$.\n\n**Result**: \$\\boxed{2}\$<|im_end|>\n<|im_start|>user\nwhat about 2+2<|im_end|>\n<|im_start|>assistant\n'

Python 工具使用

model_name = 'nvidia/Nemotron-Cascade-2-30B-A3B'
tokenizer = AutoTokenizer.from_pretrained(model_name)

SYSTEM_PROMPT = """# Tools

You have access to the following functions:

<tools>
<function>
<name>stateful_python_code_exec</name>
<description>Call this function to execute Python code in a stateful Jupyter notebook environment. Python will respond with the output of the execution or time out after 120.0 seconds.</description>
<parameters>
<parameter>
<name>code</name>
<type>string</type>
<description>Code to execute</description>
</parameter>
<required>["code"]</required>
</parameters>
</function>
</tools>

If you choose to call a function ONLY reply in the following format with NO suffix:

<tool_call>
<function=example_function_name>
<parameter=example_parameter_1>
value_1
</parameter>
<parameter=example_parameter_2>
This is the value for the second parameter
that can span
multiple lines
</parameter>
</function>
</tool_call>

<IMPORTANT>
Reminder:
- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags
- Required parameters MUST be specified
- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after
- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls
</IMPORTANT>"""

messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "user", "content": "Solve the following math problem. Put your answer inside \\boxed{}.\n\nIn a school with 2008 students, each student is a member of certain committees. Each committee has at most 1004 members, and every two students are in at least one common committee. Determine the smallest possible number of committees in the school."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
print(prompt)

智能体使用方式

model_name = 'nvidia/Nemotron-Cascade-2-30B-A3B'
tokenizer = AutoTokenizer.from_pretrained(model_name)

SYSTEM_PROMPT = """You are a customer service agent that helps the user.  The policy that determines how you should respond to requests from users is described below between the <policy> and </policy> tags.

In each turn you can either:
- Send a message to the user.
- Make a tool call.
You cannot do both at the same time.

<policy>
_NEED_TO_ADD_POLICY_HERE_
</policy>

Try to be helpful and always follow the policy.

# Tools

You have access to the following functions:

<tools>
<function>
<name>_NEED_TO_ADD_FUNCTION_NAME_1_</name>
<description>_FUNCTION_DESCRIPTION_</description>
<parameters>
<parameter>
<name>_NEED_TO_ADD_PARAMETER_NAME_1_</name>
<type>_PARAMETER_TYPE_</type>
<description>_PARAMETER_DESCRIPTION_</description>
<title>_PARAMETER_TITLE_</title>
</parameter>
<parameter>
<name>_NEED_TO_ADD_PARAMETER_NAME_2_</name>
<type>_PARAMETER_TYPE_</type>
<description>_PARAMETER_DESCRIPTION_</description>
<title>_PARAMETER_TITLE_</title>
</parameter>
...... (_MORE_PARAMETERS_TO_ADD_)
<parameters>
</function>
...... (_MORE_FUNCTIONS_TO_ADD_)
</tools>
"""

messages = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "user", "content": "Hello, I'm calling regarding my upcoming stay at your hotel. My guest ID is G90920 and booking ID is B11246 for a Deluxe room on June 5th. I'm traveling with three 6-month-old triplets and need to request three infant cribs for our room. It's currently 30 hours before check-in—could you please confirm if this is feasible and if there are quiet room options available for families with infants?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
print(prompt)

发布日期

2026年3月19日

许可协议

您对本模型的使用受 NVIDIA Open Model License 约束。

引用

@article{Nemotron_Cascade_2,
  title={Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation},
  author={Yang, Zhuolin and Liu, Zihan and Chen, Yang and Dai, Wenliang and Wang, Boxin and Lin, Sheng-Chieh and Lee, Chankyu and Chen, Yangyi and Jiang, Dongfu and He, Jiafan and Pi, Renjie and Lam, Grace and Lee, Nayeon and Bukharin, Alexander and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  year={2026}
}