Nemotron-Labs-Diffusion 是一款三模态语言模型,通过在推理过程中简单切换同一模型的注意力模式,即可同时支持自回归(AR)解码和基于扩散(diffusion)的并行解码。这两种模式的协同作用催生了第三种模式,称为自推测(self-speculation):同一模型利用共享的KV缓存执行基于扩散的并行草稿生成和AR验证,实现了高接受长度和高解码效率。通过简单改变注意力模式实现的无缝模式切换,使单个模型能在不同的部署场景下,于各种并发级别均保持高效率。
本模型的使用受 NVIDIA Nemotron Open Model License 管辖。
transformers>=5.0.0from transformers import AutoModel, AutoTokenizer
import torch
repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)
history = []
user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
## Chat in AR Mode
out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)
## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)
## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")在线性自推测模式下,可以将可选的 LoRA 适配器应用于扩散草稿模型,以进一步增加接受长度:
import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
repo = "nvidia/Nemotron-Labs-Diffusion-14B"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)
# Attach the linear_spec LoRA adapter.
model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
# Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally).
base = model.model
history = [{"role": "user", "content": "Solve: What is 15% of 240?"}]
prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
out_ids, nfe = base.linear_spec_generate(
prompt_ids, max_new_tokens=512, block_length=32,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True))
print(f"[NFE={nfe}]")NVIDIA 认为可信 AI 是一项共同责任,我们已制定相关政策和实践,以支持各类 AI 应用的开发。当开发者按照我们的服务条款下载或使用本模型时,应与内部模型团队合作,确保该模型满足相关行业和使用场景的要求,并应对不可预见的产品滥用情况。有关本模型伦理考量的更多详细信息,请参阅偏见、可解释性、安全与安保以及隐私子卡片。
如发现模型质量、风险、安全漏洞或 NVIDIA AI 相关问题,请在此报告。
@techreport{fu2026nemotronlabsdiffusion,
title = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
author = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
institution = {NVIDIA},
year = {2026},
note = {Technical report}
}