Qwen3-8B-MLX-4bit

昇腾 NPU 适配验证报告

本仓库为 MLX 4-bit 量化格式 (bits=4, group_size=128) 的 Qwen3-8B 模型，专为 Apple Silicon (MLX 框架) 优化。

要在 华为昇腾 NPU 上使用 vLLM-Ascend 部署，需先将权重从 MLX 4-bit 格式转换为标准 HuggingFace BF16 格式。

环境配置

环境	详情
操作系统	aarch64 Linux (Kunpeng 920)
NPU 设备	2× Ascend 910B2 (Ascend910_9362, 64GB HBM)
CANN 版本	8.5.1
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
vLLM / vLLM-Ascend	0.18.0 / 0.18.0rc1
Transformers	4.57.6

推理精度对比 (NPU vs CPU)

基线说明：使用 PyTorch CPU (float32) 作为调试基线。官方未提供 Qwen3-8B MLX 版本的 GPU/CPU 基线，本次对比旨在验证 NPU 与标准 PyTorch 推理框架的数值一致性。该 CPU 对比结果仅用于调试，不作为最终精度对齐结论。

测试设置

参数	值
解码方式	Greedy (do_sample=False, temperature=0.0)
NPU 精度	bfloat16 (torch_npu)
CPU 精度	float32 (PyTorch)
测试 prompt	2 组，覆盖问答和知识类场景
生成长度	5 tokens (用于 logits 级精确对比)
词汇表大小	151,936

数值误差统计

指标	Prompt Logits (last token)	Generated Logits (10 steps)
最大绝对误差 (Max abs diff)	0.301065	0.419620
平均绝对误差 (Mean abs diff)	0.063890	0.041722
RMSE (均方根误差)	0.074515	0.062187
Cosine Similarity (余弦相似度)	0.999749	0.999823
Token prediction argmax 匹配率	—	100.0% (10/10)

逐 Token 对比详情

Prompt	Step	NPU Token ID	CPU Token ID	Max Logit Diff	Argmax Match
"Hello, what is 2+3?"	1	7281 (▁Also)	7281	0.210984	✓
	2	11 (,)	11	0.171724	✓
	3	1128 (▁what)	1128	0.220845	✓
	4	374 (▁is)	374	0.115605	✓
	5	220 (▁)	220	0.101952	✓
"The capital of France is"	1	12095 (▁Paris)	12095	0.301065	✓
	2	13 (.)	13	0.191311	✓
	3	576 (▁The)	576	0.419620	✓
	4	6722 (▁capital)	6722	0.293782	✓
	5	315 (▁of)	315	0.236840	✓

生成文本对比

Prompt	NPU (bf16) 输出	CPU (fp32) 输出
"Hello, what is 2+3?"	`" Also, what is "`	`" Also, what is "`
"The capital of France is"	`" Paris. The capital of"`	`" Paris. The capital of"`

精度结论

Token 预测匹配率: 100.0% — NPU 与 CPU 在 greedy 解码下的输出完全一致
Cosine Similarity: 0.999749 — logits 分布高度一致
平均相对误差: ~0.0004% — 远低于 1% 的验收阈值
误差来源: NPU 使用 bfloat16（7 位尾数），CPU 使用 float32（23 位尾数），这是预期的精度截断差，不影响生成质量
最终判定: ✅ PASS — NPU 推理精度与 PyTorch 标准推理框架一致

推理性能基准测试

输入长度	输出长度	平均延迟 (秒)	吞吐量 (令牌/秒)
32	32	1.700	18.8
32	64	3.358	19.1
64	32	1.684	19.0
64	64	3.384	18.9
128	32	1.758	18.2

性能分析:

平均吞吐量: ~18.8 令牌/秒 (Qwen3-4B, Ascend 910B2 单卡)
延迟主要与输出长度成正比，与输入长度关系较小 (预填充时间占比低)
采用 torch_npu eager 模式推理，未使用 vLLM 的 PagedAttention 优化

模型转换指南

转换方式选择

方式	说明	依赖
A — MLX 反量化	在 macOS 上运行 MLX 反量化脚本，输出 BF16	macOS + MLX
B — 下载标准权重	直接从 ModelScope 下载标准 Qwen3-8B 权重	网络 (~16GB)

方式 A: MLX 反量化 (推荐)

在 macOS (Apple Silicon) 上执行：

# 1. 安装依赖
pip install mlx-lm transformers safetensors torch

# 2. 运行反量化脚本 (已包含在本仓库)
python dequantize_mlx.py \
    --input /path/to/Qwen3-8B-MLX-4bit \
    --output /path/to/Qwen3-8B-BF16

方式 B: 下载标准权重

在任意机器上执行：

pip install modelscope
python -c "
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen3-8B', cache_dir='./Qwen3-8B-BF16')
print('下载完成:', model_dir)
"

部署到 vLLM-Ascend（转换后）

# 安装 vLLM-Ascend
pip install vllm vllm-ascend

# 启动推理服务 (Atlas A2/A3 单卡)
vllm serve /path/to/Qwen3-8B-BF16 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

# 测试推理
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-8B-BF16",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

config.json 说明

本仓库的 config.json 已移除 MLX 特有的 quantization 和 quantization_config 字段，可直接用于转换后的 BF16 模型。

验证脚本

本仓库提供以下验证脚本：

脚本	说明
`dequantize_mlx.py`	MLX 4-bit → BF16 反量化脚本
`accuracy_test.py`	NPU vs CPU 精度对齐测试 (带 logits 级数值对比)
`results/`	验证结果目录 (精度数据 + 性能数据)

Qwen3 Highlights

Qwen3 是 Qwen 系列的最新一代大型语言模型，提供了一套全面的密集型和混合专家（MoE）模型。经过广泛训练，Qwen3 在推理、指令遵循、智能体能力和多语言支持方面取得了突破性进展，其主要特点如下：

模型内无缝切换思考模式（适用于复杂逻辑推理、数学和编码）与非思考模式（适用于高效的通用对话），确保在各种场景下的最佳性能。
推理能力显著增强，在数学、代码生成和常识逻辑推理方面超越了之前的 QwQ（思考模式）和 Qwen2.5 instruct 模型（非思考模式）。
卓越的人类偏好对齐，在创意写作、角色扮演、多轮对话和指令遵循方面表现出色，提供更自然、引人入胜和沉浸式的对话体验。
智能体能力专长，支持在思考和非思考模式下与外部工具精确集成，并在复杂的智能体任务中实现开源模型中的领先性能。
支持 100 多种语言和方言，具备强大的多语言指令遵循和翻译能力。

Model Overview

Qwen3-8B 具有以下特点：

类型：因果语言模型
训练阶段：预训练和后训练
参数数量：82 亿
非嵌入参数数量：69.5 亿
层数：36
注意力头数（GQA）：Q 为 32，KV 为 8
上下文长度：原生 32,768，通过 YaRN 扩展至 131,072 tokens

有关基准测试、硬件要求和推理性能等更多详细信息，请参阅我们的 blog、GitHub 和 Documentation。

快速入门

Qwen3 的代码已包含在 transformers (≥4.52.4) 和 mlx_lm (≥0.25.2) 的最新版本中，建议您使用 transformers 和 mlx_lm 的最新版本。旧版本（例如 transformers<4.51.0）可能会引发如下错误：

KeyError: 'qwen3'

安装或升级这两个软件包：

pip install --upgrade transformers mlx_lm

以下是一个代码片段，展示了如何使用模型根据给定输入生成内容。

from mlx_lm import load, generate

model, tokenizer = load("Qwen/Qwen3-8B-MLX-4bit")
prompt = "Hello, please introduce yourself and tell me what you can do."

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True
    )

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,
    max_tokens=1024
)

print(response)

在思考模式与非思考模式之间切换

[!TIP] enable_thinking 开关在 SGLang 和 vLLM 创建的 API 中同样可用。 SGLang 用户和 vLLM 用户请分别参考我们文档中的 SGLang 和 vLLM 部分。

`enable_thinking=True`

默认情况下，Qwen3 已启用思考能力，与 QwQ-32B 类似。这意味着模型会运用其推理能力来提升生成回复的质量。例如，当在 tokenizer.apply_chat_template 中显式设置 enable_thinking=True 或将其保留为默认值时，模型将进入思考模式。

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

在此模式下，模型会生成包裹在 </think>...superscript: 块中的思考内容，随后给出最终回复。

[!NOTE] 对于思考模式，请使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0（generation_config.json 中的默认设置）。请勿使用贪婪解码，因为这可能导致性能下降和无限重复。有关更详细的指导，请参阅最佳实践部分。

`enable_thinking=False`

我们提供了一个硬性开关，可严格禁用模型的思考行为，使其功能与早期的 Qwen2.5-Instruct 模型保持一致。此模式在必须禁用思考以提升效率的场景中尤为实用。

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)

在此模式下，模型不会生成任何思考内容，也不会包含 </think>...superscript: 块。

[!NOTE] 对于非思考模式，我们建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。如需更详细的指导，请参考最佳实践部分。

高级用法：通过用户输入切换思考与非思考模式

我们提供了一种软切换机制，当 enable_thinking=True 时，允许用户动态控制模型的行为。具体而言，您可以在用户提示或系统消息中添加 /think 和 /no_think，以逐轮切换模型的思考模式。在多轮对话中，模型将遵循最新的指令。

以下是一个多轮对话示例：

from mlx_lm import load, generate


class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-8B-MLX-4bit"):
        self.model, self.tokenizer = load(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        response = generate(
            self.model,
            self.tokenizer,
            prompt=text,
            verbose=True,
            max_tokens=32768
        )
        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response


# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many 'r's are in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many 'r's are in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}")
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

[!NOTE] 为确保 API 兼容性，当 enable_thinking=True 时，无论用户使用 /think 还是 /no_think，模型始终会输出一个用 </think>...</RichMediaReference> 包裹的区块。不过，若思考功能被禁用，该区块内的内容可能为空。当 enable_thinking=False 时，软开关失效。无论用户输入任何 /think 或 /no_think 标签，模型都不会生成思考内容，也不会包含 <RichMediaReference>...superscript: 区块。

智能体使用

Qwen3 在工具调用能力方面表现出色。我们建议使用 Qwen-Agent 以充分发挥 Qwen3 的智能体能力。Qwen-Agent 内部封装了工具调用模板和工具调用解析器，大幅降低了编码复杂度。

要定义可用工具，您可以使用 MCP 配置文件、使用 Qwen-Agent 的集成工具，或自行集成其他工具。

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    "model": "Qwen3-8B-MLX-4bit",

    # Use the endpoint provided by Alibaba Model Studio:
    # "model_type": "qwen_dashscope",
    # "api_key": os.getenv("DASHSCOPE_API_KEY"),

    # Use a custom endpoint compatible with OpenAI API:
    "model_server": "http://localhost:8000/v1",  # api_base
    "api_key": "EMPTY",

    # Other parameters:
    # "generate_cfg": {
    #     # Add: When the response content is `this is the answer;
    #     # Do not add: When the response has been separated by reasoning_content and content.
    #     "thought_in_content": True,
    # },
}

# Define Tools
tools = [
    {
        "mcpServers": {  # You can specify the MCP configuration file
            "time": {
                "command": "uvx",
                "args": ["mcp-server-time", "--local-timezone=Asia/Shanghai"],
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"],
            },
        }
    },
    "code_interpreter",  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [
    {
        "role": "user",
        "content": "https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen",
    }
]

for responses in bot.run(messages=messages):
    pass

print(responses)

长文本处理

Qwen3 原生支持最长 32,768 tokens 的上下文长度。对于总长度（包括输入和输出）显著超过此限制的对话，我们建议使用 RoPE 缩放技术来有效处理长文本。我们已通过 YaRN 方法验证了模型在最长 131,072 tokens 上下文长度下的性能。

目前已有多个推理框架支持 YaRN，例如本地使用的 transformers 和 llama.cpp，以及用于部署的 vllm 和 sglang。通常，在支持的框架中启用 YaRN 有两种方法：

修改模型文件：在 config.json 文件中，添加 rope_scaling 字段：

{
    ...,
    "rope_scaling": {
        "rope_type": "yarn",
        "factor": 4.0,
        "original_max_position_embeddings": 32768
    }
}

[!IMPORTANT] 如果遇到以下警告
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
请升级 transformers>=4.51.0。

[!NOTE] 所有知名的开源框架均实现了静态 YaRN，这意味着缩放因子不随输入长度变化，可能会影响短文本的性能。我们建议仅在需要处理长上下文时才添加 rope_scaling 配置。也建议根据需要修改 factor。例如，如果您的应用程序的典型上下文长度为 65,536 tokens，将 factor 设置为 2.0 会更好。

[!NOTE] config.json 中的默认 max_position_embeddings 设置为 40,960。此分配包括为输出预留 32,768 tokens 和为典型提示预留 8,192 tokens，足以满足大多数短文本处理场景。如果平均上下文长度不超过 32,768 tokens，我们不建议在此情况下启用 YaRN，因为这可能会降低模型性能。

[!TIP] 阿里云 Model Studio 提供的端点默认支持动态 YaRN，无需额外配置。

最佳实践

为实现最佳性能，我们建议采用以下设置：

采样参数：
- 对于思考模式（enable_thinking=True），使用 Temperature=0.6、TopP=0.95、TopK=20 和 MinP=0。请勿使用贪婪解码，这可能导致性能下降和无限重复。
- 对于非思考模式（enable_thinking=False），建议使用 Temperature=0.7、TopP=0.8、TopK=20 和 MinP=0。
- 对于支持的框架，可将 presence_penalty 参数在 0 到 2 之间进行调整，以减少无限重复。但使用较高值偶尔可能导致语言混合以及模型性能轻微下降。
足够的输出长度：对于大多数查询，建议使用 32,768 个 token 的输出长度。在数学和编程竞赛等高度复杂问题的基准测试中，建议将最大输出长度设置为 38,912 个 token。这为模型生成详细全面的响应提供了充足空间，从而提升整体性能。
标准化输出格式：在进行基准测试时，建议通过提示词标准化模型输出。
- 数学问题：在提示词中包含“请逐步推理，并将最终答案放在 \boxed{} 内。”
- 多项选择题：在提示词中添加以下 JSON 结构以标准化响应：“请在 answer 字段中仅用选项字母展示您的选择，例如："answer": "C"。”
历史记录中不含思考内容：在多轮对话中，历史模型输出应仅包含最终输出部分，无需包含思考内容。这在提供的 Jinja2 对话模板中已实现。但对于未直接使用 Jinja2 对话模板的框架，需由开发者确保遵循此最佳实践。

引用

如果您发现我们的工作对您有所帮助，欢迎引用我们。

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}