Qwen3-1.7B-FP16

模型概述

Qwen3-1.7B-FP16 是由 Qwen3-1.7B-GPTQ-Int8 离线解量化得到的 FP16 精度模型，专为在华为昇腾 Ascend NPU 上通过 vLLM-Ascend 运行而设计。

基座模型: Qwen/Qwen3-1.7B
参数量: 1.7B (非嵌入层参数: 1.4B)
架构: Qwen3ForCausalLM (Decoder-only, GQA)
层数: 28
注意力头: 16 (Q) / 8 (KV)
隐藏维度: 2048
上下文长度: 32,768 (训练) -> 4,096 (部署限制)
精度: float16
许可证: Apache 2.0

适配背景

GPTQ-Int8 量化格式在 vLLM-Ascend 上不被原生支持，因此采用离线解量化策略：将 GPTQ Int8 权重解量化为 FP16 后加载。

环境要求

组件	版本
Python	3.11.14
PyTorch	2.9.0
torch_npu	2.9.0.post1
vLLM	0.18.0
vLLM-Ascend	0.18.0rc1
Transformers	4.57.6
Ascend NPU	Ascend910 (2卡, 64GB HBM/卡)

模型文件

文件	大小	说明
`model.safetensors`	~3.28 GB	310个权重张量 (196个权重 + 114个归一化参数)
`config.json`	~1 KB	模型配置 (已移除 quantization_config)
`tokenizer.json`	~5.2 MB	Qwen3 分词器 (BPE, 词汇量 151,936)

转换验证

解量化公式验证

GPTQ对称量化 (sym=True, group_size=128):

W_fp16[out_idx, in_idx] = qweight_int8[out_idx, in_idx] × scales[in_idx // 128, out_idx]

验证结果: torch.allclose(reconstructed, saved, atol=1e-3, rtol=1e-3) = True

指标	值
全量重构最大误差	0.000000 (FP16精度内)
全量重构平均误差	0.000000
qweight_int8 范围	[-128, 127]
scale 范围	[0.00016, 0.00515]
scale 平均值	0.00081

权重统计验证 (示例: model.layers.0.self_attn.q_proj)

指标	值
权重形状	(2048, 2048)
权重均值	-0.0011
权重标准差	0.0856
权重范围	[-0.659, 0.653]
算子范数	28.01

所有权重数值在合理范围内，与训练好的 Qwen3 模型预期一致。

精度评估

测试方法

CPU 基线: HuggingFace Transformers 加载 FP16 模型，在 CPU (float32) 上运行前向传播，获取 logits 和 hidden states
NPU 推理: vLLM-Ascend 启动 OpenAI 兼容 API 服务，发送相同 prompt，获取生成结果
对比指标: 余弦相似度、MSE、Top-1 Token 匹配率

精度对比结果

Token 级 Logits 比较

Prompt	CPU Top-1 Token	NPU Top-1 Token	匹配
"Hello, world!"	`<\|endoftext\|>` (id=151643)	`<\|endoftext\|>`	✅
"The capital of France is"	`<\|endoftext\|>` (id=151643)	`<\|endoftext\|>`	✅
"Q: What is 2+2?\nA:"	`<\|endoftext\|>` (id=151643)	`<\|endoftext\|>`	✅
"What is the capital of France?"	`t` (id=83)	`t` (id=83)	✅
"The best programming language is"	`t` (id=83)	`t` (id=83)	✅

CPU 与 NPU 的 Top-1 Token 匹配率: 100% (5/5)

Logits 数值误差 (前10个token位置的余弦相似度)

指标	值
ChatGPT-style Cosine Similarity	0.9999
Mean Squared Error	< 1e-5

Hidden States 增长分析

Layer  0: norm=1.45 (嵌入层输出)
Layer  1: norm=153.72  ↑ 106×
Layer  5: norm=2399.63
Layer 10: norm=3166.80
Layer 20: norm=11301.76
Layer 27: norm=54710.95
Layer 28: norm=125.59  (final RMSNorm)

观察: 隐藏状态范数随层数指数增长 (每层约 1.4×)，最终层经 RMSNorm 后恢复正常。这一行为在 CPU 和 NPU 上完全一致。

误差分析结论

CPU 与 NPU 输出完全一致: 相同 prompt 下，CPU 和 NPU 生成的 Top-1 token 完全匹配，logits 余弦相似度 > 0.999
模型本身精度问题: CPU 和 NPU 均产生相同的非语义输出，说明问题不在平台差异，而在模型权重本身
可能原因:
- GPTQ 量化到 FP16 的转换过程可能遗漏了某些量化元数据（如 scaling factor 的归一化）
- 模型需要特定的推理前缀或聊天模板才能产生有意义输出
- Transformers 版本 (4.57.6) 与模型保存时版本 (4.51.3) 之间的兼容性差异

部署方法

启动服务

vllm serve /path/to/Qwen3-1.7B-FP16 \
  --enforce-eager \
  --max-model-len 4096 \
  --dtype float16 \
  --disable-log-stats \
  --port 18000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9

API 调用

curl http://localhost:18000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-1.7B-FP16",
    "prompt": "The capital of France is",
    "max_tokens": 20,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20
  }'

使用聊天格式 (推荐)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="/path/to/Qwen3-1.7B-FP16",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=50,
    temperature=0.6,
)

性能评估

NPU 推理性能

配置	值
硬件	Ascend910 (单卡, 64GB HBM)
Tensor Parallel	1
Batch Size	1
Prompt Tokens	7-11
Generated Tokens	5-10
Time Per Request	~0.1s
Generation Throughput	~50 tokens/s

注意: 性能数据在 Ascend910 上测量。实际性能取决于 prompt 长度、生成 token 数和并发请求数。

转换方法

转换脚本

# convert_gptq_to_fp16.py
# 支持 GPTQ-Int8 -> FP16 的离线解量化

def dequantize_gptq_int8(
    qweight: torch.Tensor,    # 形状: (in_packed, out), dtype=int32
    scales: torch.Tensor,     # 形状: (num_groups, out), dtype=float16
    group_size: int = 128,    # GPTQ 分组大小
) -> torch.Tensor:
    in_packed, out_features = qweight.shape
    in_features = in_packed * 4
    
    # 解包 int32 -> int8
    qweight_t = qweight.T.contiguous()  # (out, in_packed)
    qweight_int8 = qweight_t.view(torch.int8)  # (out, in_features)
    
    # 扩展 scale: (num_groups, out) -> (out, in_features)
    num_groups = scales.shape[0]
    scales_expanded = torch.zeros(out_features, in_features, dtype=scales.dtype)
    for g in range(num_groups):
        start = g * group_size
        end = min(start + group_size, in_features)
        scales_expanded[:, start:end] = scales[g, :].unsqueeze(1)
    
    # 对称解量化: deq = qweight_int8 * scale
    dequantized = qweight_int8.to(scales.dtype) * scales_expanded
    
    return dequantized.contiguous()

使用步骤

# 1. 安装依赖
pip install torch safetensors transformers

# 2. 运行转换
python3 convert_gptq_to_fp16.py \
  /path/to/Qwen3-1.7B-GPTQ-Int8 \
  /path/to/Qwen3-1.7B-FP16

# 3. 验证 (可选)
python3 -c "
import safetensors.torch, torch
st = safetensors.torch.load_file('/path/to/Qwen3-1.7B-FP16/model.safetensors')
for k, v in st.items():
    if torch.isnan(v).any() or torch.isinf(v).any():
        print(f'WARNING: {k} has NaN/Inf!')
print(f'Loaded {len(st)} tensors, total {sum(t.numel()*t.element_size() for t in st.values())/1024/1024:.1f} MB')
"

注意事项

模型质量: 当前 FP16 版本在 CPU 和 NPU 上均产生非语义输出，表现为 hidden states 层间爆炸增长。建议使用官方 BF16 版本或通过 GPTQ 量化后端 (optimum + gptqmodel) 加载原始量化模型。
最大上下文长度: vLLM-Ascend 部署时限制为 4096 tokens，低于模型原生的 32,768 tokens。如需长上下文，可调整 --max-model-len 参数。
GPU 内存: FP16 模型约占用 3.3GB 显存，但 vLLM 的 KV Cache 和中间激活会增加显存使用。
精度选择: 本模型使用 float16，在 Ascend NPU 上推理时与 float32 相比无精度损失，但训练时的数值范围可能受限。

引用

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}