VibeThinker-1.5B on vLLM-Ascend 0.18.0rc1 #+NPU

1. 简介

本文档记录 WeiboAI/VibeThinker-1.5B 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。

VibeThinker-1.5B 是微博 AI 推出的 1.5B 参数推理模型，基于 Qwen2ForCausalLM 架构，28 层 Transformer，hidden_size=1536，BF16 精度，权重大小约 3.3GB。该模型使用思维链（Chain-of-Thought）进行推理，适用于数学推理等任务。

从模型配置看，VibeThinker-1.5B 走 Qwen2ForCausalLM 推理链路，和 Qwen2-1.5B 的部署方式兼容，可以快速迁移验证。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`modelscope`	`1.35.3`

NPU：2 逻辑卡 (Ascend910_9362)
模型路径：/opt/atomgit/models/VibeThinker-1.5B
服务端口：8000

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export ASCEND_RT_VISIBLE_DEVICES=0
export VLLM_USE_MODELSCOPE=false
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# 方式一：enforce-eager 模式（推荐，首次验证使用）
vllm serve /opt/atomgit/models/VibeThinker-1.5B \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 1 \
  --tensor-parallel-size 1 \
  --served-model-name VibeThinker-1.5B \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --trust-remote-code \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --enforce-eager

# 方式二：PIECEWISE 编译模式（性能更优）
vllm serve /opt/atomgit/models/VibeThinker-1.5B \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 1 \
  --tensor-parallel-size 1 \
  --served-model-name VibeThinker-1.5B \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --trust-remote-code \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85

关键参数说明：

参数	值	说明
`--tensor-parallel-size`	1	1.5B 模型单卡即可运行
`--max-model-len`	8192	最大上下文长度，推理需要足够空间
`--gpu-memory-utilization`	0.85	KV 缓存占用 85% NPU 内存
`--enforce-eager`	-	禁用图编译，首次验证推荐
`--dtype`	bfloat16	模型默认精度

4. 模型下载

# 从 ModelScope 下载模型权重
modelscope download --model WeiboAI/VibeThinker-1.5B \
  --local_dir /opt/atomgit/models/VibeThinker-1.5B

5. 推理使用

5.1 单次推理

python inference.py --prompt "Compute 2+2. Put your answer in \\boxed{}."

5.2 批量推理

python inference.py --input prompts.txt --output results.json

5.3 交互模式

python inference.py --interactive

5.4 多轮对话

python inference.py --interactive --multi-turn

5.5 流式输出

python inference.py --prompt "Explain quantum entanglement." --stream

5.6 自定义 API 地址

python inference.py --api-base http://192.168.1.100:8000/v1 --prompt "Hello"

6. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VibeThinker-1.5B",
    "messages": [
      {"role": "user", "content": "Compute 2+2. Put your answer in \\boxed{}."}
    ],
    "temperature": 0,
    "max_tokens": 1024
  }'

验证结果：

/v1/models 返回 200
/v1/chat/completions 返回 200，模型输出包含 \boxed{4}
模型自动进入思维链推理模式（输出以 <think 标签开始）

7. 性能参考

测试条件：单卡 Ascend910_9362，enforce-eager 模式，max_model_len=8192。

指标	数值
模型加载时间	~1.08s (safetensors)
模型权重占用	2.97 GB
可用 KV 缓存	48.93 GiB
KV 缓存 token 数	1,832,192 tokens
单次推理延迟 (128+100 tokens)	~783 tokens / ~57s
最大并发数 (4096 tokens/request)	~448x

注：性能数据基于 enforce-eager 模式，使用 PIECEWISE 编译模式可获得更优性能。

8. 精度评测

8.1 评测方法

使用 GPQA（Graduate-Level Google-Proof Q&A）数据集做精度评测。

下载数据集：modelscope download --dataset modelscope/gpqa
使用 eval_gpqa.py 脚本，32 并发逐题发送请求
模型使用思维链推理，每题 max_tokens=4096
从模型输出中提取 A/B/C/D 答案选项
与正确答案对比计算准确率

python eval_gpqa.py \
  --dataset /tmp/GPQA/gpqa_main.csv \
  --model VibeThinker-1.5B \
  --output ./gpqa_results.json \
  --workers 32

8.2 评测结果

指标	数值
数据集	`GPQA`
总题数	`448`
正确数	`112`
准确率	`25.0%`
GPU/CPU 基线	`25.8%`

8.3 NPU vs GPU/CPU 对比

指标	NPU	GPU/CPU	差异
准确率	25.0%	25.8%	0.8%

结论：NPU 与 GPU/CPU 准确率差异 0.8% < 1%，通过精度验证。

8.4 按学科分布

学科	正确/总数	准确率
Biology	21/78	26.9%
Chemistry	43/183	23.5%
Physics	48/187	25.7%

8.5 注意事项

GPQA 为研究生级别多选题，4 个选项随机打乱
448 题使用 32 并发评测，总耗时约 25 分钟
模型使用思维链推理，每次推理消耗 token 数较多（平均 ~800-1500 tokens/题）
答案提取基于 A/B/C/D 字母匹配
1.5B 小模型在 GPQA 研究生级别题目上的表现受模型能力限制

9. 架构兼容性分析

9.1 模型配置

特性	说明	昇腾兼容性
架构类型	`Qwen2ForCausalLM`	✅ 已知支持
注意力机制	Full Attention	✅ 昇腾支持
精度	BF16	✅ 昇腾支持
层数	28 层	✅ 支持
hidden_size	1536	✅ 支持
num_attention_heads	12	✅ 支持
num_key_value_heads	2 (GQA)	✅ 支持
vocab_size	151936	✅ 支持
思维链推理	CoT (thinking)	✅ 支持
编译模式	PIECEWISE (ACL Graph)	✅ 支持

9.2 支持状态

VibeThinker-1.5B 使用标准 Qwen2ForCausalLM 架构，已被 vLLM-Ascend 完整支持，无特殊不兼容算子。

10. 注意事项

思维链推理：VibeThinker-1.5B 默认使用思维链推理模式，输出包含 <think 标签的推理过程。需要设置较大的 max_tokens（建议 2048+）以获得完整回答。
max_model_len 选择：模型配置 max_position_embeddings=131072，但 1.5B 模型在单卡上建议使用 8192 或 4096 以获得更多 KV 缓存空间。
内存占用：模型权重仅 2.97GB，但 KV 缓存需要大量内存。单卡 64GB 可支持 180 万+ tokens 的 KV 缓存。
模型下载：使用 modelscope download --model WeiboAI/VibeThinker-1.5B 下载，约 3.3GB。

验证日期：2026-05-16 验证工具版本：vllm-ascend v0.18.0rc1 验证人：Ascend NPU 适配验证