QwQ-32B

简介

QwQ是Qwen系列的推理模型。与传统的指令调优模型相比，具备思考与推理能力的QwQ能够在下游任务（尤其是难题）中实现显著的性能提升。QwQ-32B是中等规模的推理模型，其性能足以与DeepSeek-R1、o1-mini等最先进的推理模型相媲美。

本仓库包含QwQ 32B模型，其主要特性如下：

类型：因果语言模型
训练阶段：预训练与后训练（监督微调及强化学习）
架构：采用RoPE、SwiGLU、RMSNorm和Attention QKV偏置的transformers
参数数量：325亿
非嵌入层参数数量：310亿
层数：64
注意力头数量（GQA）：Q头40个，KV头8个
上下文长度：完整支持131,072 tokens
- 对于长度超过8,192 tokens的提示词，必须按照本节所述启用YaRN。

注意：为获得最佳体验，请在部署QwQ模型前查阅使用指南。

您可以尝试我们的演示Demo，或通过QwenChat使用QwQ模型。

更多详情，请参阅我们的博客、GitHub和文档。

环境要求

QwQ基于Qwen2.5构建，其代码已集成到最新版Hugging Face transformers中。建议您使用最新版本的transformers。

若使用transformers<4.37.0，将遇到以下错误：

KeyError: 'qwen2'

快速入门

以下提供一个使用apply_chat_template的代码片段，展示如何加载分词器和模型以及如何生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

使用指南

为实现最佳性能，我们建议采用以下设置：

强制思考输出：确保模型以“\

简介

本仓库包含QwQ 32B模型，其主要特性如下：

类型：因果语言模型

训练阶段：预训练与后训练（监督微调及强化学习）

架构：采用RoPE、SwiGLU、RMSNorm和Attention QKV偏置的transformers

参数数量：325亿

非嵌入层参数数量：310亿

层数：64

注意力头数量（GQA）：Q头40个，KV头8个

上下文长度：完整支持131,072 tokens

对于长度超过8,192 tokens的提示词，必须按照本节所述启用YaRN。