简介

Nanbeige4.1-3B 基于 Nanbeige4-3B-Base 构建而成，是我们此前推理模型 Nanbeige4-3B-Thinking-2511 的增强迭代版本。该模型通过监督微调（SFT）和强化学习（RL）进一步优化后训练得到。作为小参数规模下极具竞争力的开源模型，Nanbeige4.1-3B 充分证明了紧凑型模型能够同时实现强大的推理能力、偏好对齐以及高效智能体行为。

具体而言，Nanbeige4.1-3B 展现出以下核心优势：

强大的推理能力：Nanbeige4.1-3B 能够在单次前向传播过程中通过持续且连贯的推理解决复杂的多步骤问题，并在 LiveCodeBench-Pro、IMO-Answer-Bench 和 AIME 2026 I 等挑战性任务上稳定输出正确的最终答案。
稳健的偏好对齐：Nanbeige4.1-3B 实现了出色的对齐性能，不仅优于 Qwen3-4B-2507、Nanbeige4-3B-2511 等同规模模型，在 Arena-Hard-v2 和 Multi-Challenge 等评测集上，其表现甚至显著超越了 Qwen3-30B-A3B、Qwen3-32B 等更大参数量的模型。
智能体能力：Nanbeige4.1-3B 是首个原生支持深度搜索任务的通用小模型，能够稳定支持超过 500 轮工具调用的复杂问题求解。它填补了小模型生态系统中长期存在的空白——以往小模型通常要么针对通用推理优化，要么针对智能体场景优化，但很少能同时在两方面都表现出色。

技术报告：链接

性能表现

我们在涵盖通用推理和深度搜索能力的广泛多样的基准测试集上对 Nanbeige4.1-3B 进行了评估。

通用推理任务

在包括代码、数学、科学、对齐和工具使用等通用推理任务的基准测试中，Nanbeige4.1-3B 不仅显著优于 Qwen3-4B 等同规模模型，而且在整体性能上也超越了 Qwen3-30B-A3B-2507 和 Qwen3-32B 等更大参数量的模型。

基准测试集	Qwen3-4B-2507	Qwen3-8B	Qwen3-14B	Qwen3-32B	Qwen3-30B-A3B-2507	Nanbeige4-3B-2511	Nanbeige4.1-3B
代码
Live-Code-Bench-V6	57.4	49.4	55.9	55.7	66.0	46.0	76.9
Live-Code-Bench-Pro-Easy	40.2	41.2	33.0	42.3	60.8	40.2	81.4
Live-Code-Bench-Pro-Medium	5.3	3.5	1.8	3.5	3.5	5.3	28.1
数学
AIME 2026 I	81.46	70.42	76.46	75.83	87.30	84.1	87.40
HMMT Nov	68.33	48.33	56.67	57.08	71.25	66.67	77.92
IMO-Answer-Bench	48.00	36.56	41.81	43.94	54.34	38.25	53.38
科学
GPQA	65.8	62.0	63.38	68.4	73.4	82.2	83.8
HLE (Text-only)	6.72	5.28	7.00	9.31	11.77	10.98	12.60
对齐
Arena-Hard-v2	34.9	26.3	36.9	56.0	60.2	60.0	73.2
Multi-Challenge	41.14	36.30	36.97	38.72	49.40	41.20	52.21
工具使用
BFCL-V4	44.87	42.20	45.14	47.90	48.6	53.8	56.50
Tau2-Bench	45.9	42.06	44.96	45.26	47.70	41.77	48.57

深度搜索任务

作为一款通用小型模型，Nanbeige4.1-3B 在 10B 参数规模下实现了与专业智能体相当的深度搜索性能。与现有通常几乎不具备深度搜索能力的小型通用模型相比，Nanbeige4.1-3B 较以往的小型通用模型有了质的飞跃。

深度搜索与智能体基准测试

模型	xBench-DeepSearch-2505	xBench-DeepSearch-2510	Browse-Comp	Browse-Comp-ZH	GAIA (Text-only)	HLE	SEAL-0
搜索专用小型智能体
MiroThinker-v1.0-8B	61	–	31.1	40.2	66.4	21.5	40.4
AgentCPM-Explore-4B	70	–	25.0	29.0	63.9	19.1	40.0
大型基础模型（带工具）
GLM-4.6-357B	70	–	45.1	49.5	71.9	30.4	–
Minimax-M2-230B	72	–	44.0	48.5	75.7	31.8	–
DeepSeek-V3.2-671B	71	–	67.6	65.0	63.5	40.8	38.5
小型基础模型（带工具）
Qwen3-4B-2507	34	5	1.57	7.92	28.33	11.13	15.74
Qwen3-8B	31	2	0.79	5.15	19.53	10.24	6.34
Qwen3-14B	34	9	2.36	7.11	30.23	10.17	12.64
Qwen3-32B	39	8	3.15	7.34	30.17	9.26	8.15
Qwen3-30B-A3B-2507	25	10	1.57	4.12	31.63	14.81	9.24
我们的模型（带工具）
Nanbeige4-3B-2511	33	11	0.79	3.09	19.42	13.89	12.61
Nanbeige4.1-3B	75	39	19.12	31.83	69.90	22.29	41.44

快速开始

对于推理超参数，我们建议采用以下设置：

温度（Temperature）：0.6
核采样（Top-p）：0.95
重复惩罚（Repeat penalty）：1.0
最大新生成 tokens（Max New Tokens）：131072

在聊天场景下：

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)
messages = [
  {'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]
prompt = tokenizer.apply_chat_template(
  messages,
  add_generation_prompt=True,
  tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)

针对工具使用场景：

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)
messages = [
    {'role': 'user',  'content': 'Help me check the weather in Beijing now'}
]
tools = [{'type': 'function',
  'function': {'name': 'SearchWeather',
   'description': 'Find out the current weather in a place on a certain day.',
   'parameters': {'type': 'dict',
    'properties': {'location': {'type': 'string',
      'description': 'A city in China.'},
    'required': ['location']}}}}]
prompt = tokenizer.apply_chat_template(
  messages,
  tools,
  add_generation_prompt=True,
  tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), max_new_tokens=512, eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)

针对深度搜索场景：

推理框架：miroflow-framework！
将分词器配置切换为tokenizer_config_search.json
工具配置：

服务器	描述	提供的工具
tool-python	执行环境和文件管理（E2B沙箱）	create_sandbox、run_command、run_python_code、upload_file_from_local_to_sandbox、download_file_from_sandbox_to_local、download_file_from_internet_to_sandbox
search_and_scrape_webpage	通过Serper API进行谷歌搜索	google_search
jina_scrape_llm_summary	使用基于LLM的信息提取与Jina进行网页抓取	scrape_and_extract_info

总结模型：Qwen3-14B-thinking
温度参数：1.0
注意：这些工具中已明确禁用对HuggingFace的访问。

局限性

尽管我们在训练过程中高度重视模型的安全性，努力确保其输出符合伦理和法律要求，但由于模型的规模和概率性本质，它可能无法完全避免生成意外输出。这些输出可能包含偏见或歧视等有害内容。请勿传播此类内容。对于因传播不当信息所导致的后果，我们不承担任何责任。

引用

如果您发现我们的模型有用或希望在您的项目中使用，请按以下方式引用：

@misc{yang2026nanbeige413bsmallgeneralmodel,
      title={Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts}, 
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Xiyun Xu and Yang Song and Yiming Jia and Yuntao Wen and Yunzhi Xu and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2026},
      eprint={2602.13367},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13367}, 
}

联系方式

如有任何问题，请提交 issue 或通过 nanbeige@kanzhun.com 与我们联系。

简介