Qwen3.6-27B 在 vLLM-Ascend 0.18.0rc1 上的部署

1. 简介

本文档记录 Qwen3.6-27B 在 vLLM-Ascend 0.18.0rc1 环境下的快速部署与验证结果。整体部署方式可直接参考官方 Qwen3.5-27B 教程，Qwen3.6-27B 额外验证以下两点：

MTP 方法更改为 qwen3_next_mtp
可选请求级配置 preserve_thinking

从模型配置来看，Qwen3.6-27B 当前仍沿用 Qwen3_5ForConditionalGeneration 推理链路，与 Qwen3.5-27B 的部署方式基本兼容，可实现快速迁移验证。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`

NPU：2 逻辑卡
模型路径：/mnt/weight/Qwen3.6-27B
服务端口：8000

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export ASCEND_RT_VISIBLE_DEVICES=14,15
export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve /mnt/weight/Qwen3.6-27B \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 1 \
  --tensor-parallel-size 2 \
  --seed 1024 \
  --served-model-name qwen3.6-27b \
  --max-num-seqs 32 \
  --max-model-len 133000 \
  --max-num-batched-tokens 8096 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --speculative_config '{"method":"qwen3_next_mtp","num_speculative_tokens":2,"disable_padded_drafter_batch":false}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93,96]}' \
  --additional-config '{"enable_cpu_binding":true}'

4. 冒烟验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

验证结果：

/v1/models 返回 200
/v1/chat/completions 返回 200
reasoning 字段可正常返回

preserve_thinking 作为可选配置也已验证生效，示例：

{
  "extra_body": {
    "chat_template_kwargs": {
      "preserve_thinking": true
    }
  }
}

验证现象：

preserve_thinking=false 时 prompt_tokens=71
preserve_thinking=true 时 prompt_tokens=100

5. 性能参考

测试条件：8k input / 1k output / concurrency=8，连续两次，以下取第二次数据。

指标	数值
`duration`	`121.858 s`
`request_throughput`	`0.263 req/s`
`output_throughput`	`262.601 tok/s`
`total_token_throughput`	`2363.408 tok/s`
`mean_ttft_ms`	`3036.530 ms`
`median_ttft_ms`	`1563.622 ms`
`p99_ttft_ms`	`11261.911 ms`
`mean_tpot_ms`	`26.849 ms`
`median_tpot_ms`	`26.519 ms`
`p99_tpot_ms`	`35.933 ms`
`spec_decode_acceptance_rate`	`93.297%`
`spec_decode_acceptance_length`	`2.866`

压测时建议显式指定：

--tokenizer /mnt/weight/Qwen3.6-27B

6. 精度评测

使用 EvalScope 对 AIME26 进行了 3 轮精度评测。

指标	数值
数据集	`AIME26`
评测工具	`EvalScope`
轮数	`3`
单轮样本数	`30`
平均分	`93.3`
最佳轮次	`round3`
最佳分数	`96.7`
参考分	`94.1`

7. 注意事项

qwen3_next_mtp 是当前环境中最容易出现问题的地方。

如果仅将 Qwen3.5-27B 的 qwen3_5_mtp 修改为 qwen3_next_mtp，但仍使用默认的自动图桶配置，服务可能会在 ACL Graph 捕获阶段失败。实际失败特征如下：

关键报错：KeyError: 90
位置：/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py
最终报错：RuntimeError: Engine core initialization failed

原因并非权重或 transformers 版本问题，而是当 ACL Graph + MTP + num_speculative_tokens > 1 时，默认图桶可能无法覆盖实际的解码场景。

当前环境下的可行处理方法是显式指定：

cudagraph_capture_sizes=[3,6,9,12,...,96]

另外，preserve_thinking 属于请求级参数，不需要修改服务启动命令。Agent 多轮场景建议验证后按需开启，纯吞吐场景可保持默认关闭。