Baichuan-13B-Base on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 Baichuan-13B-Base 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。Baichuan-13B-Base 是百川智能发布的 130 亿参数基座模型，使用 ALIBI 位置编码，40 层 Transformer，hidden_size=5120。

vLLM 内置 baichuan.py 模型实现，BaichuanForCausalLM 自动识别 13B 走 ALIBI 路径，无需代码修改即可在昇腾 NPU 上运行。

注意事项：原始 tokenization_baichuan.py 与 transformers >= 4.40 不兼容（vocab_size 属性初始化顺序问题），需将 sp_model 初始化移到 super().__init__() 之前。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1`

NPU：2 x Ascend910_9362（每卡 61.3 GB HBM）
模型路径：/opt/atomgit/Baichuan-13B-Base-weights
服务端口：8000

3. 服务启动

启动前可先检查端口：

lsof -i :8000 || true

已验证通过的启动命令：

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve /opt/atomgit/Baichuan-13B-Base-weights \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --seed 1024 \
  --served-model-name baichuan-13b \
  --max-num-seqs 32 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"enable_cpu_binding":true}'

Tokenizer 补丁

原始 tokenization_baichuan.py 与 transformers >= 4.40 不兼容。需修改模型目录下的 tokenization_baichuan.py，将 sp_model 初始化移到 super().__init__() 之前：

# 修改前（报错）
self.sp_model_kwargs = ...
bos_token = AddedToken(...)
...
super().__init__(...)
self.sp_model = spm.SentencePieceProcessor(...)
self.sp_model.Load(vocab_file)

# 修改后（正常）
self.sp_model_kwargs = ...
self.vocab_file = vocab_file
self.sp_model = spm.SentencePieceProcessor(...)
self.sp_model.Load(vocab_file)
bos_token = AddedToken(...)
...
super().__init__(...)

4. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baichuan-13b",
    "prompt": "The capital of China is",
    "max_tokens": 32,
    "temperature": 0
  }'

验证结果：

/v1/models 返回 200
/v1/completions 返回 200，生成文本 Beijing.

5. 性能参考

测试条件：200 input / 200 output / concurrency=1 / num_prompts=200。

指标	数值
`Successful requests`	`200`
`Benchmark duration`	`203.53 s`
`Request throughput`	`0.98 req/s`
`Output token throughput`	`196.53 tok/s`
`Peak output token throughput`	`473.00 tok/s`
`Total token throughput`	`393.07 tok/s`
`Mean TTFT`	`74.35 ms`
`Median TTFT`	`75.04 ms`
`P99 TTFT`	`93.27 ms`
`Mean TPOT`	`17.74 ms`
`Median TPOT`	`17.63 ms`
`P99 TPOT`	`19.79 ms`
`Mean ITL`	`17.74 ms`
`Median ITL`	`17.19 ms`
`P99 ITL`	`35.73 ms`

6. 精度评测

使用自建评测脚本对 GPQA Diamond 做精度评测。

指标	数值
数据集	`GPQA Diamond`
样本数	`198`
基础精度	`9`
正确数	`17 / 198`
评测方式	vLLM API completions, temperature=0, 选项随机打乱
评测脚本	`eval_accuracy_gpqa_2round.py`

说明：Baichuan-13B-Base 是基座模型（未经指令微调），GPQA Diamond 是高难度研究生级别推理题集，基座模型在此数据集上准确率较低属正常现象。GPU/CPU 基线对比请参考下方基线成绩。

7. 注意事项

Tokenizer 兼容性：原始 tokenization_baichuan.py 的 __init__ 方法中 sp_model 在 super().__init__() 之后初始化，但 transformers >= 4.40 的 super().__init__() 会调用 get_vocab() -> vocab_size，此时 sp_model 尚未初始化导致 AttributeError。解决方案是将 sp_model 初始化移到 super().__init__() 之前。
ALIBI 注意力：Baichuan-13B 使用 ALIBI 位置编码而非 RoPE，vLLM 内置支持，无需额外配置。
权重格式：模型使用 PyTorch .bin 格式而非 SafeTensors，加载速度较慢但功能正常。
模型为 Base 模型：Baichuan-13B-Base 是基座模型，未经过指令微调，对话/问答能力有限。建议使用 Completion API 而非 Chat API。
ACLGraph：使用 FULL_DECODE_ONLY 模式配合 CPU 绑核优化。