Qwen3-4B-Thinking-2507-FP8 on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 Qwen3-4B-Thinking-2507-FP8 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。

关键说明：该模型原始权重为 FP8 量化格式，但当前 vLLM-Ascend 不支持 FP8 在昇腾 NPU 上直接推理。本文档采用 FP8→BF16 反量化方案后部署，单卡 32GB NPU 即可流畅运行。

模型规格：

项目	内容
架构	`Qwen3ForCausalLM`
参数量	4B (3.6B non-embedding)
上下文长度	262,144 tokens (256K)
量化格式	FP8 `e4m3fn` (block_size=128)
反量化后格式	BF16
反量化后大小	~8.8 GB

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`

NPU：1 逻辑卡 (910B4, 32GB HBM)
模型路径：/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16
服务端口：8000

3. FP8→BF16 反量化

3.1 下载原始权重

# ModelScope
python3 -c "from modelscope import snapshot_download; snapshot_download('Qwen/Qwen3-4B-Thinking-2507-FP8', local_dir='/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8')"

# 或 HuggingFace 镜像
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-4B-Thinking-2507-FP8 --local-dir /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8

3.2 执行反量化

原始 FP8 权重以 weight (float8_e4m3fn) + weight_scale_inv (float16) 配对存储，block_size 为 [128, 128]。反量化公式：

dequantized = weight.float() * weight_scale_inv

已验证通过的反量化脚本如下：

#!/usr/bin/env python3
import json, os, shutil
import torch
from safetensors.torch import load_file, save_file

SRC_DIR = "/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8"
DST_DIR = "/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16"
BLOCK_SIZE = (128, 128)

def dequantize_blockwise(weight_fp8, scale_inv, block_size=(128, 128)):
    out_features, in_features = weight_fp8.shape
    out_blocks = out_features // block_size[0]
    in_blocks = in_features // block_size[1]
    weight_blocks = weight_fp8.reshape(out_blocks, block_size[0], in_blocks, block_size[1])
    weight_blocks = weight_blocks.permute(0, 2, 1, 3)
    dequantized = weight_blocks.float() * scale_inv.unsqueeze(-1).unsqueeze(-1)
    dequantized = dequantized.permute(0, 2, 1, 3).reshape(out_features, in_features)
    return dequantized.to(torch.bfloat16)

os.makedirs(DST_DIR, exist_ok=True)
for fname in os.listdir(SRC_DIR):
    if fname != "model.safetensors":
        shutil.copy2(os.path.join(SRC_DIR, fname), os.path.join(DST_DIR, fname))

with open(os.path.join(DST_DIR, "config.json"), "r") as f:
    config = json.load(f)
if "quantization_config" in config:
    del config["quantization_config"]
config["torch_dtype"] = "bfloat16"
with open(os.path.join(DST_DIR, "config.json"), "w") as f:
    json.dump(config, f, indent=2)

state_dict = load_file(os.path.join(SRC_DIR, "model.safetensors"))
keys_to_remove = []
for key in list(state_dict.keys()):
    if key.endswith("_scale_inv"):
        continue
    if state_dict[key].dtype == torch.float8_e4m3fn:
        scale_key = f"{key}_scale_inv"
        if scale_key in state_dict:
            state_dict[key] = dequantize_blockwise(state_dict[key], state_dict[scale_key], BLOCK_SIZE)
            keys_to_remove.append(scale_key)
for key in keys_to_remove:
    del state_dict[key]

save_file(state_dict, os.path.join(DST_DIR, "model.safetensors"))
print("Dequantization complete!")

反量化结果：

指标	数值
FP8 张量数	`252`
反量化后格式	`bfloat16`
原始大小	`5.19 GB`
反量化后大小	`8.82 GB`
体积增幅	`70%`

4. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

vllm serve /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --served-model-name qwen3-4b-thinking-2507-fp8 \
  --max-model-len 65536 \
  --max-num-seqs 32 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"enable_cpu_binding":true}'

4.1 关键参数说明

参数	值	说明
`--tensor-parallel-size`	1	Tensor 并行大小，4B 模型单卡足够
`--max-model-len`	65536	最大上下文长度 (模型原生 256K，单卡保守设置 64K)
`--max-num-seqs`	32	每 DP 组最大请求数
`--gpu-memory-utilization`	0.85	HBM 利用率
`--compilation-config`	FULL_DECODE_ONLY	图编译模式
`--trust-remote-code`	-	Qwen3 架构需要
`VLLM_WORKER_MULTIPROC_METHOD=spawn`	-	确保子进程正确继承 NPU 环境

5. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b-thinking-2507-fp8",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

验证结果：

/v1/models 返回 200，模型名正确，max_model_len=65536
/v1/chat/completions 返回 200，中文回答正常
Thinking/推理模式正常，`` 推理过程，随后给出格式化的最终答案。

6. 精度验证

使用 verify_accuracy.py 对 NPU 推理结果与 CPU 基线进行 FP32 精度对比。

6.1 执行验证

cd /opt/atomgit/models/qwen3-4b-thinking-2507-fp8
python3 verify_accuracy.py --model-path /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16

验证流程：

CPU 基线推理（FP32）
NPU 推理（FP32）
计算 logits 与 hidden states 的精度指标

6.2 验证结果

指标	Logits	Hidden States
max_abs_error	0.000109	0.000710
mean_abs_error	0.000014	0.000006
relative_error	0.0009%	0.0039%
cosine_similarity	1.000000	1.000000
threshold	1.0%	1.0%
结果	PASS	PASS

结论：NPU 与 CPU 基线在 FP32 下高度一致，cosine_similarity = 1.0，relative_error < 0.004%，验证通过。

7. 性能参考

测试条件：128 input / 128 output / 32 prompts / request-rate=4，连续运行取稳定态数据。

7.1 Serve 模式 (在线吞吐)

指标	数值
`request_throughput`	`3.15 req/s`
`output_token_throughput`	`403.56 tok/s`
`total_token_throughput`	`807.11 tok/s`
`mean_ttft_ms`	`105.49 ms`
`median_ttft_ms`	`99.27 ms`
`p99_ttft_ms`	`156.83 ms`
`mean_tpot_ms`	`18.67 ms`
`median_tpot_ms`	`18.82 ms`
`p99_tpot_ms`	`20.08 ms`
`peak_concurrent_requests`	`15`

7.2 单请求延迟参考

指标	数值
输入长度	`128 tokens`
输出长度	`128 tokens`
平均 TTFT	`~105 ms`
平均总延迟	`~1.9 s`

压测时建议显式指定：

--tokenizer /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16

8. 注意事项

8.1 FP8 反量化是当前昇腾部署的必要步骤

vLLM-Ascend 当前版本 (0.18.0rc1) 尚不支持 FP8 量化在 NPU 上直接推理。若直接加载 FP8 权重会报错：

fp8 quantization is currently not supported in npu

当前环境的处理方式是：预先将 FP8 反量化为 BF16，再启动 vLLM 服务。反量化过程无精度损失（FP8→FP32→BF16），且 4B 模型反量化后仅约 8.8GB，单卡 32GB NPU 完全足够。

8.2 单卡部署与多卡扩展

本文档验证环境为 单卡 910B4 (32GB)。若需更大 max-model-len（如 128K 或 256K），建议：

增加 NPU 数量并设置 --tensor-parallel-size > 1
或适当降低 --gpu-memory-utilization 和 --max-num-seqs

8.3 `VLLM_WORKER_MULTIPROC_METHOD=spawn` 是必要配置

在部分昇腾容器环境中，vLLM 的 EngineCore 子进程可能因无法正确继承 NPU 设备上下文而初始化失败（aclInit error code 107001）。设置 VLLM_WORKER_MULTIPROC_METHOD=spawn 可确保子进程重新初始化 NPU 环境，规避该问题。

8.4 模型模板与推理参数

该模型为 Thinking-only 变体，聊天模板会自动注入 `` 内容，并分别输出 Thinking 长度与最终答案长度。

10.2 功能验证脚本 `accuracy_check.py`

依赖：requests

python accuracy_check.py

检查项：

/v1/models 是否返回正确模型名
基础对话是否正常
Thinking 模式是否生成 `

Qwen3-4B-Thinking-2507-FP8 on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 Qwen3-4B-Thinking-2507-FP8 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。

模型规格：

项目	内容
架构	`Qwen3ForCausalLM`
参数量	4B (3.6B non-embedding)
上下文长度	262,144 tokens (256K)
量化格式	FP8 `e4m3fn` (block_size=128)
反量化后格式	BF16
反量化后大小	~8.8 GB

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`

NPU：1 逻辑卡 (910B4, 32GB HBM)
模型路径：/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16
服务端口：8000

3. FP8→BF16 反量化

3.1 下载原始权重

# ModelScope
python3 -c "from modelscope import snapshot_download; snapshot_download('Qwen/Qwen3-4B-Thinking-2507-FP8', local_dir='/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8')"

# 或 HuggingFace 镜像
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-4B-Thinking-2507-FP8 --local-dir /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8

3.2 执行反量化

原始 FP8 权重以 weight (float8_e4m3fn) + weight_scale_inv (float16) 配对存储，block_size 为 [128, 128]。反量化公式：

dequantized = weight.float() * weight_scale_inv

已验证通过的反量化脚本如下：

#!/usr/bin/env python3
import json, os, shutil
import torch
from safetensors.torch import load_file, save_file

SRC_DIR = "/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8"
DST_DIR = "/opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16"
BLOCK_SIZE = (128, 128)

def dequantize_blockwise(weight_fp8, scale_inv, block_size=(128, 128)):
    out_features, in_features = weight_fp8.shape
    out_blocks = out_features // block_size[0]
    in_blocks = in_features // block_size[1]
    weight_blocks = weight_fp8.reshape(out_blocks, block_size[0], in_blocks, block_size[1])
    weight_blocks = weight_blocks.permute(0, 2, 1, 3)
    dequantized = weight_blocks.float() * scale_inv.unsqueeze(-1).unsqueeze(-1)
    dequantized = dequantized.permute(0, 2, 1, 3).reshape(out_features, in_features)
    return dequantized.to(torch.bfloat16)

os.makedirs(DST_DIR, exist_ok=True)
for fname in os.listdir(SRC_DIR):
    if fname != "model.safetensors":
        shutil.copy2(os.path.join(SRC_DIR, fname), os.path.join(DST_DIR, fname))

with open(os.path.join(DST_DIR, "config.json"), "r") as f:
    config = json.load(f)
if "quantization_config" in config:
    del config["quantization_config"]
config["torch_dtype"] = "bfloat16"
with open(os.path.join(DST_DIR, "config.json"), "w") as f:
    json.dump(config, f, indent=2)

state_dict = load_file(os.path.join(SRC_DIR, "model.safetensors"))
keys_to_remove = []
for key in list(state_dict.keys()):
    if key.endswith("_scale_inv"):
        continue
    if state_dict[key].dtype == torch.float8_e4m3fn:
        scale_key = f"{key}_scale_inv"
        if scale_key in state_dict:
            state_dict[key] = dequantize_blockwise(state_dict[key], state_dict[scale_key], BLOCK_SIZE)
            keys_to_remove.append(scale_key)
for key in keys_to_remove:
    del state_dict[key]

save_file(state_dict, os.path.join(DST_DIR, "model.safetensors"))
print("Dequantization complete!")

反量化结果：

指标	数值
FP8 张量数	`252`
反量化后格式	`bfloat16`
原始大小	`5.19 GB`
反量化后大小	`8.82 GB`
体积增幅	`70%`

4. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

vllm serve /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --served-model-name qwen3-4b-thinking-2507-fp8 \
  --max-model-len 65536 \
  --max-num-seqs 32 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"enable_cpu_binding":true}'

4.1 关键参数说明

参数	值	说明
`--tensor-parallel-size`	1	Tensor 并行大小，4B 模型单卡足够
`--max-model-len`	65536	最大上下文长度 (模型原生 256K，单卡保守设置 64K)
`--max-num-seqs`	32	每 DP 组最大请求数
`--gpu-memory-utilization`	0.85	HBM 利用率
`--compilation-config`	FULL_DECODE_ONLY	图编译模式
`--trust-remote-code`	-	Qwen3 架构需要
`VLLM_WORKER_MULTIPROC_METHOD=spawn`	-	确保子进程正确继承 NPU 环境

5. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b-thinking-2507-fp8",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

验证结果：

/v1/models 返回 200，模型名正确，max_model_len=65536
/v1/chat/completions 返回 200，中文回答正常
Thinking/推理模式正常，`` 推理过程，随后给出格式化的最终答案。

6. 精度验证

使用 verify_accuracy.py 对 NPU 推理结果与 CPU 基线进行 FP32 精度对比。

6.1 执行验证

cd /opt/atomgit/models/qwen3-4b-thinking-2507-fp8
python3 verify_accuracy.py --model-path /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16

验证流程：

CPU 基线推理（FP32）
NPU 推理（FP32）
计算 logits 与 hidden states 的精度指标

6.2 验证结果

指标	Logits	Hidden States
max_abs_error	0.000109	0.000710
mean_abs_error	0.000014	0.000006
relative_error	0.0009%	0.0039%
cosine_similarity	1.000000	1.000000
threshold	1.0%	1.0%
结果	PASS	PASS

结论：NPU 与 CPU 基线在 FP32 下高度一致，cosine_similarity = 1.0，relative_error < 0.004%，验证通过。

7. 性能参考

测试条件：128 input / 128 output / 32 prompts / request-rate=4，连续运行取稳定态数据。

7.1 Serve 模式 (在线吞吐)

指标	数值
`request_throughput`	`3.15 req/s`
`output_token_throughput`	`403.56 tok/s`
`total_token_throughput`	`807.11 tok/s`
`mean_ttft_ms`	`105.49 ms`
`median_ttft_ms`	`99.27 ms`
`p99_ttft_ms`	`156.83 ms`
`mean_tpot_ms`	`18.67 ms`
`median_tpot_ms`	`18.82 ms`
`p99_tpot_ms`	`20.08 ms`
`peak_concurrent_requests`	`15`

7.2 单请求延迟参考

指标	数值
输入长度	`128 tokens`
输出长度	`128 tokens`
平均 TTFT	`~105 ms`
平均总延迟	`~1.9 s`

压测时建议显式指定：

--tokenizer /opt/atomgit/weights/Qwen3-4B-Thinking-2507-FP8-BF16

8. 注意事项

8.1 FP8 反量化是当前昇腾部署的必要步骤

vLLM-Ascend 当前版本 (0.18.0rc1) 尚不支持 FP8 量化在 NPU 上直接推理。若直接加载 FP8 权重会报错：

fp8 quantization is currently not supported in npu

8.2 单卡部署与多卡扩展

本文档验证环境为 单卡 910B4 (32GB)。若需更大 max-model-len（如 128K 或 256K），建议：

增加 NPU 数量并设置 --tensor-parallel-size > 1
或适当降低 --gpu-memory-utilization 和 --max-num-seqs

8.3 `VLLM_WORKER_MULTIPROC_METHOD=spawn` 是必要配置

8.4 模型模板与推理参数

该模型为 Thinking-only 变体，聊天模板会自动注入 `` 内容，并分别输出 Thinking 长度与最终答案长度。

10.2 功能验证脚本 `accuracy_check.py`

依赖：requests

python accuracy_check.py

检查项：

/v1/models 是否返回正确模型名
基础对话是否正常
Thinking 模式是否生成 `

Qwen3-4B-Thinking-2507-FP8 on vLLM-Ascend 0.18.0rc1

1. 简介

2. 验证环境

3. FP8→BF16 反量化

3.1 下载原始权重

3.2 执行反量化

4. 服务启动

4.1 关键参数说明

5. Smoke 验证

6. 精度验证

6.1 执行验证

6.2 验证结果

7. 性能参考

7.1 Serve 模式 (在线吞吐)

7.2 单请求延迟参考

8. 注意事项

8.1 FP8 反量化是当前昇腾部署的必要步骤

8.2 单卡部署与多卡扩展

8.3 VLLM_WORKER_MULTIPROC_METHOD=spawn 是必要配置

8.4 模型模板与推理参数

10.2 功能验证脚本 accuracy_check.py

Qwen3-4B-Thinking-2507-FP8 on vLLM-Ascend 0.18.0rc1

1. 简介

2. 验证环境

3. FP8→BF16 反量化

3.1 下载原始权重

3.2 执行反量化

4. 服务启动

4.1 关键参数说明

5. Smoke 验证

6. 精度验证

6.1 执行验证

6.2 验证结果

7. 性能参考

7.1 Serve 模式 (在线吞吐)

7.2 单请求延迟参考

8. 注意事项

8.1 FP8 反量化是当前昇腾部署的必要步骤

8.2 单卡部署与多卡扩展

8.3 VLLM_WORKER_MULTIPROC_METHOD=spawn 是必要配置

8.4 模型模板与推理参数

10.2 功能验证脚本 accuracy_check.py

8.3 `VLLM_WORKER_MULTIPROC_METHOD=spawn` 是必要配置

10.2 功能验证脚本 `accuracy_check.py`

8.3 `VLLM_WORKER_MULTIPROC_METHOD=spawn` 是必要配置

10.2 功能验证脚本 `accuracy_check.py`