Qwen3-235B-A22B-Thinking-2507

主要亮点

在过去三个月中，我们持续提升 Qwen3-235B-A22B 的思考能力，同时改进推理的质量与深度。我们荣幸地推出 Qwen3-235B-A22B-Thinking-2507，主要增强如下：

推理任务性能显著提升，涵盖逻辑推理、数学、科学、编程及通常需要人类专业知识的学术基准测试——在开源思考模型中实现最先进成果。
通用能力大幅增强，包括指令遵循、工具使用、文本生成及与人类偏好的对齐。
256K 长上下文理解能力优化。

注意：本版本延长了思考时长。强烈建议用于高度复杂的推理任务。

image/jpeg

模型概述

Qwen3-235B-A22B-Thinking-2507 具有以下特性：

类型：因果语言模型
训练阶段：预训练与后训练
参数数量：总计2350亿，激活220亿
非嵌入参数数量：2340亿
层数：94层
注意力头数（GQA）：Q为64个，KV为4个
专家数量：128个
激活专家数量：8个
上下文长度：原生支持262,144 tokens

注意：本模型仅支持思考模式。

此外，为确保模型进入思考状态，默认对话模板会自动包含 </think>。因此，模型输出中仅包含 </think> 而无显式开头 </think> 标签属于正常现象。

更多详情，包括基准测试评估、硬件要求和推理性能，请参阅我们的博客、GitHub 和文档。

性能表现

	Deepseek-R1-0528	OpenAI O4-mini	OpenAI O3	Gemini-2.5 Pro	Claude4 Opus Thinking	Qwen3-235B-A22B Thinking	Qwen3-235B-A22B-Thinking-2507
知识
MMLU-Pro	85.0	81.9	85.9	85.6	-	82.8	84.4
MMLU-Redux	93.4	92.8	94.9	94.4	94.6	92.7	93.8
GPQA	81.0	81.4*	83.3*	86.4	79.6	71.1	81.1
SuperGPQA	61.7	56.4	-	62.3	-	60.7	64.9
推理
AIME25	87.5	92.7*	88.9*	88.0	75.5	81.5	92.3
HMMT25	79.4	66.7	77.5	82.5	58.3	62.5	83.9
LiveBench 20241125	74.7	75.8	78.3	82.4	78.2	77.1	78.4
HLE	17.7#	18.1*	20.3	21.6	10.7	11.8#	18.2#
代码
LiveCodeBench v6（25.02-25.05）	68.7	71.8	58.6	72.5	48.9	55.7	74.1
CFEval	2099	1929	2043	2001	-	2056	2134
OJBench	33.6	33.3	25.4	38.9	-	25.6	32.5
对齐
IFEval	79.1	92.4	92.1	90.8	89.7	83.4	87.8
Arena-Hard v2$	72.2	59.3	80.8	72.5	59.1	61.5	79.7
Creative Writing v3	86.3	78.8	87.7	85.9	83.8	84.6	86.1
WritingBench	83.2	78.4	85.3	83.1	79.1	80.3	88.3
智能体
BFCL-v3	63.8	67.2	72.4	67.2	61.8	70.8	71.9
TAU2-Retail	64.9	71.0	76.3	71.3	-	40.4	71.9
TAU2-Airline	60.0	59.0	70.0	60.0	-	30.0	58.0
TAU2-Telecom	33.3	42.0	60.5	37.4	-	21.9	45.6
多语言
MultiIF	63.5	78.0	80.3	77.8	-	71.9	80.6
MMLU-ProX	80.6	79.0	83.3	84.7	-	80.0	81.0
INCLUDE	79.4	80.8	86.6	85.1	-	78.7	81.0
PolyMATH	46.9	48.7	49.7	52.2	-	54.7	60.1

* 对于 OpenAI O4-mini 和 O3，我们采用中等推理力度，标有 * 的分数除外，这些分数使用高推理力度生成。

# 根据 HLE 的官方评估标准，标有 # 的分数指非多模态模型，仅在纯文本子集上进行评估。

$ 为保证可复现性，我们报告由 GPT-4.1 评估的胜率。

& 对于高难度任务（包括 PolyMATH 以及所有推理和代码任务），我们使用 81,920 tokens 的输出长度。其他所有任务的输出长度设为 32,768。

快速入门

Qwen3-MoE 的代码已集成到最新版的 Hugging Face transformers 中，建议您使用最新版本的 transformers。

若使用 transformers<4.51.0，您将遇到以下错误：

KeyError: 'qwen3_moe'

以下是一段代码片段，展示了如何使用模型根据给定输入生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening

[!NOTE] 包含 Unsloth 聊天模板修复！
对于 llama.cpp，请使用 --jinja 参数

Unsloth Dynamic 2.0 实现了卓越的准确性，性能超越其他主流量化方法。