Nemotron-Labs-Diffusion-14B

模型概述

Nemotron-Labs-Diffusion 是一款三模态语言模型，通过在推理过程中简单切换同一模型的注意力模式，即可同时支持自回归（AR）解码和基于扩散（diffusion）的并行解码。这两种模式的协同作用催生了第三种模式，称为自推测（self-speculation）：同一模型利用共享的KV缓存执行基于扩散的并行草稿生成和AR验证，实现了高接受长度和高解码效率。通过简单改变注意力模式实现的无缝模式切换，使单个模型能在不同的部署场景下，于各种并发级别均保持高效率。

主要亮点

业界领先的3B、8B、14B密集型语言模型系列（基础版、指令版和视觉-语言变体版），支持AR、扩散和自推测，专注于解码效率。
将生成过程从内存受限状态推向计算受限状态。模型权重只需加载一次，即可在生成过程中重复用于计算多个token。
自推测模式利用扩散进行草稿生成，AR进行验证，为多token预测（MTP）方法提供了更优选择：
- 与SGLang中的Qwen3-8B-Eagle3相比，接受长度提升3倍，速度提升2.2倍。
- 在保持相同精度的情况下，每前向传播可处理的token数量是Qwen3-8B（无MTP）的5.9倍。
跨平台的实际设备加速效果：
- DGX Spark（8B，并发度1）：采用w4a16时，速度达112 tok/sec，相比AR的41.8 tok/sec提升2.7倍。
- GB200（8B，并发度1）：速度达850 tok/sec，相比AR的253 tok/sec和Eagle3的360 tok/sec提升3.3倍。定制CUDA内核可进一步将速度提升至1015 tok/sec（4倍）。
扩散光速加速分析表明，通过更优的采样策略，单用户的吞吐量有望在当前最佳水平上再翻倍——这是未来的研究方向。

许可协议/使用条款

本模型的使用受 NVIDIA Nemotron Open Model License 管辖。

环境

transformers>=5.0.0

与我们的模型对话

from transformers import AutoModel, AutoTokenizer
import torch

repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

history = []

user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})

prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')

## Chat in AR Mode
out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)

## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)

## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)

tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")

使用线性自推测 + LoRA 增强型草稿模型进行推理

在线性自推测模式下，可以将可选的 LoRA 适配器应用于扩散草稿模型，以进一步增加接受长度：

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

repo = "nvidia/Nemotron-Labs-Diffusion-14B"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

# Attach the linear_spec LoRA adapter.
model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
# Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally).
base = model.model

history = [{"role": "user", "content": "Solve: What is 15% of 240?"}]
prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

out_ids, nfe = base.linear_spec_generate(
    prompt_ids, max_new_tokens=512, block_length=32,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True))
print(f"[NFE={nfe}]")

伦理考量

NVIDIA 认为可信 AI 是一项共同责任，我们已制定相关政策和实践，以支持各类 AI 应用的开发。当开发者按照我们的服务条款下载或使用本模型时，应与内部模型团队合作，确保该模型满足相关行业和使用场景的要求，并应对不可预见的产品滥用情况。有关本模型伦理考量的更多详细信息，请参阅偏见、可解释性、安全与安保以及隐私子卡片。

如发现模型质量、风险、安全漏洞或 NVIDIA AI 相关问题，请在此报告。

引用

@techreport{fu2026nemotronlabsdiffusion,
  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
  institution = {NVIDIA},
  year        = {2026},
  note        = {Technical report}
}