SmolLM2-135M-Instruct on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 HuggingFaceTB/SmolLM2-135M-Instruct 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。整体部署方式与官方 Qwen3.5-27B / Qwen3.6-27B 教程一致，SmolLM2-135M-Instruct 作为 LlamaForCausalLM 标准架构小模型（135M 参数，bf16 权重），适配路径上重点验证两个点：

vLLM-Ascend 对 HF transformers LlamaForCausalLM 原生承接，模型代码零改动
单 NPU chip 即可承载（HBM 占用远低于单卡 64GB），用于 vllm-ascend 链路的快速冒烟与压测基线

从模型配置看，SmolLM2-135M-Instruct 是经典 Llama 配方：30 层 / hidden_size=576 / GQA (9 head / 3 kv head) / RoPE θ=100000 / tied embedding，可直接复用 Qwen2 / Qwen3 系列的 vllm-ascend 部署经验。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0`
`transformers`	`4.57.6`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`CANN`	`8.5.1`

NPU：1 逻辑卡（chip 0，physical Ascend910 phy-id 10，SOC ascend910_9391）
模型路径：/tmp/models/SmolLM2-135M-Instruct
服务端口：8000
容器：华为云 ModelArts pod（aarch64，/opt/atomgit HOME）

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}
export ASCEND_RT_VISIBLE_DEVICES=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve /tmp/models/SmolLM2-135M-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --served-model-name smollm2-135m \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.30 \
  --no-enable-prefix-caching \
  --dtype float16

参数说明：

--tensor-parallel-size 1：135M 小模型单 chip 足矣，避免 HCCL overhead
--max-model-len 8192：与 config.json 中 max_position_embeddings 对齐
--gpu-memory-utilization 0.30：模型 + KV cache 占用极小，无需独占整卡
--dtype float16：原始权重 bf16，fp16 推理与 CPU fp32 baseline argmax 完全一致（精度足够，见 §4 旁注）
未使用 --reasoning-parser / --speculative_config：SmolLM2 非 reasoning 模型，也未训练 MTP 头
未启用 ACL Graph 自定义图桶：vllm-ascend 默认行为对短上下文小模型已足够，未踩到 Qwen3.6-27B 文档中提到的 KeyError: 90 路径

4. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smollm2-135m",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

验证结果：

/v1/models 返回 200，模型 ID smollm2-135m，max_model_len=8192
/v1/chat/completions 返回 200，中文输出连贯（受限于 135M 容量，复杂问题易语义打转，属模型本身能力上限，非适配 bug）
usage 字段正常：prompt_tokens=45、completion_tokens=128、finish_reason="length"
reasoning 字段为 null（SmolLM2 非 reasoning 模型，符合预期）

附：本仓库 smollm2_adapt/ 同时保留了 HF transformers 直推路径的产物（run_npu.py / npu_compat.py / verify_precision.py），与本 vLLM 路径互补——前者适合研究迭代与精度对齐，后者适合服务化部署。CPU fp32 vs NPU fp16 logits 对比中 argmax 一致、top-10 完全重叠（max_abs_diff=0.039）。

5. 性能参考

测试条件：512 input / 256 output / concurrency=8 / num_prompts=32（适配 135M 小模型量级，未沿用模板 8k/1k 长上下文配置），连续两次，以下取第二次数据。

指标	数值
`duration`	`14.78 s`
`request_throughput`	`2.17 req/s`
`output_throughput`	`554.28 tok/s`
`total_token_throughput`	`1662.84 tok/s`
`peak_output_throughput`	`616.00 tok/s`
`mean_ttft_ms`	`93.63 ms`
`median_ttft_ms`	`88.50 ms`
`p99_ttft_ms`	`114.11 ms`
`mean_tpot_ms`	`14.12 ms`
`median_tpot_ms`	`14.28 ms`
`p99_tpot_ms`	`14.90 ms`
`mean_itl_ms`	`14.06 ms`
`p99_itl_ms`	`16.03 ms`
`successful_requests`	`32 / 32`
`failed_requests`	`0`

压测命令：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}
vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/chat/completions \
  --model smollm2-135m \
  --tokenizer /tmp/models/SmolLM2-135M-Instruct \
  --dataset-name random \
  --num-prompts 32 \
  --random-input-len 512 \
  --random-output-len 256 \
  --max-concurrency 8 \
  --seed 2048

注：未启用 speculative decoding（SmolLM2 不带 MTP），故无 spec_decode_acceptance_rate / acceptance_length 指标。

6. 精度评测

未做精度评测——SmolLM2-135M-Instruct 容量极小（135M 参数，~0.2% 大模型规模），在 AIME / GSM8K / MMLU 等主流 benchmark 上分数接近随机基线，评测意义不大。本次仅做"功能 + 性能 + 推理一致性"三项验证，作为 vllm-ascend 部署链路的冒烟基线。如确需精度数据，建议改用 HellaSwag / ARC-easy / WinoGrande 等小模型友好的 zero/few-shot 任务，或用 lm-eval-harness 跑 SmolLM2 paper 报告的同名子集对齐分数。

7. 注意事项

vllm-ascend 0.18.0rc1 在当前华为云 ModelArts pod 环境启动 SmolLM2 时，踩到 3 个非模型相关的环境坑，处理方式如下：

7.1 `ModuleNotFoundError: No module named 'acl'`

最容易踩到。/etc/profile.d/* 注入了 CANN 的 lib / bin / OPP 路径，但 PYTHONPATH 没注入 CANN 的 pyACL 包。vllm_ascend.device_allocator.camem import acl.rt 时直接挂掉。

File "/vllm-workspace/vllm-ascend/vllm_ascend/device_allocator/camem.py", line 26
    from acl.rt import memcpy
ModuleNotFoundError: No module named 'acl'

启动 vllm 前必须：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}

验证：python3 -c "import acl; print(acl.__file__)" 应输出 /usr/local/Ascend/cann-8.5.1/python/site-packages/acl.so。

7.2 `Invalid device ID 0` / chip id 重编号

容器外部默认 ASCEND_VISIBLE_DEVICES=11,10（phy-id 10/11），但容器内 vllm / torch_npu 看到的是 chip id 0/1（重新编号）。直接沿用模板的 ASCEND_RT_VISIBLE_DEVICES=14,15 写法会失败：

RuntimeError: ... NPU function error: aclInit, error code is 107001
[Error]: Invalid device ID.
input error deviceId:0 is err:0x7010003[FUNC:SetDefaultDeviceId]

实际启动单卡用 ASCEND_RT_VISIBLE_DEVICES=0（对应 phy-id 10），双卡 TP=2 用 ASCEND_RT_VISIBLE_DEVICES=0,1。

排查命令：npu-smi info 查实际 chip 编号。

7.3 文件同步：本地 → 容器 stdin 不通

本环境 SSH 入口实为 pyssh.py（容器内自实现的 paramiko 服务，非标准 sshd），不传递 stdin 到远端命令。所以 rsync / scp / tar cf - … | ssh … 'tar xf -' 全部失败。推送代码到容器只能走 base64-via-arg：

for f in *.py *.sh; do
  B64=$(base64 -w0 "$f")
  ssh -p 6122 atomgit@8.162.0.229 "echo '$B64' | base64 -d > /opt/atomgit/dst/$f"
done

反方向（容器 → 本地）stdout 通：ssh -p 6122 atomgit@8.162.0.229 'cd <dir> && tar cf -' | tar xf - -C <local>。

7.4 启动 / 推理过程中的良性警告

均不影响功能：

Permission mismatch: ... libop_plugin_atb.so does not match —— CANN 由 root 安装，atomgit 用户只读
can not create directory: /home/atomgit/ascend/log —— torch_npu 硬编码 $HOME/ascend/log 但容器 HOME=/opt/atomgit，可 export ASCEND_PROCESS_LOG_PATH=/opt/atomgit/ascend/log 消音
pad_token == eos_token warning —— HF 默认行为，对生成无实际影响
DeprecationWarning: builtin type swigvarlink has no __module__ attribute —— pyACL swig 绑定遗留，不影响

可全部忽略。

SmolLM2-135M-Instruct on vLLM-Ascend 0.18.0rc1

1. 简介

vLLM-Ascend 对 HF transformers LlamaForCausalLM 原生承接，模型代码零改动
单 NPU chip 即可承载（HBM 占用远低于单卡 64GB），用于 vllm-ascend 链路的快速冒烟与压测基线

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0`
`transformers`	`4.57.6`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`CANN`	`8.5.1`

NPU：1 逻辑卡（chip 0，physical Ascend910 phy-id 10，SOC ascend910_9391）
模型路径：/tmp/models/SmolLM2-135M-Instruct
服务端口：8000
容器：华为云 ModelArts pod（aarch64，/opt/atomgit HOME）

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}
export ASCEND_RT_VISIBLE_DEVICES=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve /tmp/models/SmolLM2-135M-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --served-model-name smollm2-135m \
  --max-num-seqs 32 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.30 \
  --no-enable-prefix-caching \
  --dtype float16

参数说明：

--tensor-parallel-size 1：135M 小模型单 chip 足矣，避免 HCCL overhead
--max-model-len 8192：与 config.json 中 max_position_embeddings 对齐
--gpu-memory-utilization 0.30：模型 + KV cache 占用极小，无需独占整卡
--dtype float16：原始权重 bf16，fp16 推理与 CPU fp32 baseline argmax 完全一致（精度足够，见 §4 旁注）
未使用 --reasoning-parser / --speculative_config：SmolLM2 非 reasoning 模型，也未训练 MTP 头
未启用 ACL Graph 自定义图桶：vllm-ascend 默认行为对短上下文小模型已足够，未踩到 Qwen3.6-27B 文档中提到的 KeyError: 90 路径

4. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smollm2-135m",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

验证结果：

/v1/models 返回 200，模型 ID smollm2-135m，max_model_len=8192
/v1/chat/completions 返回 200，中文输出连贯（受限于 135M 容量，复杂问题易语义打转，属模型本身能力上限，非适配 bug）
usage 字段正常：prompt_tokens=45、completion_tokens=128、finish_reason="length"
reasoning 字段为 null（SmolLM2 非 reasoning 模型，符合预期）

5. 性能参考

测试条件：512 input / 256 output / concurrency=8 / num_prompts=32（适配 135M 小模型量级，未沿用模板 8k/1k 长上下文配置），连续两次，以下取第二次数据。

指标	数值
`duration`	`14.78 s`
`request_throughput`	`2.17 req/s`
`output_throughput`	`554.28 tok/s`
`total_token_throughput`	`1662.84 tok/s`
`peak_output_throughput`	`616.00 tok/s`
`mean_ttft_ms`	`93.63 ms`
`median_ttft_ms`	`88.50 ms`
`p99_ttft_ms`	`114.11 ms`
`mean_tpot_ms`	`14.12 ms`
`median_tpot_ms`	`14.28 ms`
`p99_tpot_ms`	`14.90 ms`
`mean_itl_ms`	`14.06 ms`
`p99_itl_ms`	`16.03 ms`
`successful_requests`	`32 / 32`
`failed_requests`	`0`

压测命令：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}
vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/chat/completions \
  --model smollm2-135m \
  --tokenizer /tmp/models/SmolLM2-135M-Instruct \
  --dataset-name random \
  --num-prompts 32 \
  --random-input-len 512 \
  --random-output-len 256 \
  --max-concurrency 8 \
  --seed 2048

注：未启用 speculative decoding（SmolLM2 不带 MTP），故无 spec_decode_acceptance_rate / acceptance_length 指标。

6. 精度评测

7. 注意事项

vllm-ascend 0.18.0rc1 在当前华为云 ModelArts pod 环境启动 SmolLM2 时，踩到 3 个非模型相关的环境坑，处理方式如下：

7.1 `ModuleNotFoundError: No module named 'acl'`

File "/vllm-workspace/vllm-ascend/vllm_ascend/device_allocator/camem.py", line 26
    from acl.rt import memcpy
ModuleNotFoundError: No module named 'acl'

启动 vllm 前必须：

export PYTHONPATH=/usr/local/Ascend/cann-8.5.1/python/site-packages:${PYTHONPATH}

验证：python3 -c "import acl; print(acl.__file__)" 应输出 /usr/local/Ascend/cann-8.5.1/python/site-packages/acl.so。

7.2 `Invalid device ID 0` / chip id 重编号

RuntimeError: ... NPU function error: aclInit, error code is 107001
[Error]: Invalid device ID.
input error deviceId:0 is err:0x7010003[FUNC:SetDefaultDeviceId]

实际启动单卡用 ASCEND_RT_VISIBLE_DEVICES=0（对应 phy-id 10），双卡 TP=2 用 ASCEND_RT_VISIBLE_DEVICES=0,1。

排查命令：npu-smi info 查实际 chip 编号。

7.3 文件同步：本地 → 容器 stdin 不通

for f in *.py *.sh; do
  B64=$(base64 -w0 "$f")
  ssh -p 6122 atomgit@8.162.0.229 "echo '$B64' | base64 -d > /opt/atomgit/dst/$f"
done

反方向（容器 → 本地）stdout 通：ssh -p 6122 atomgit@8.162.0.229 'cd <dir> && tar cf -' | tar xf - -C <local>。

7.4 启动 / 推理过程中的良性警告

均不影响功能：

Permission mismatch: ... libop_plugin_atb.so does not match —— CANN 由 root 安装，atomgit 用户只读
can not create directory: /home/atomgit/ascend/log —— torch_npu 硬编码 $HOME/ascend/log 但容器 HOME=/opt/atomgit，可 export ASCEND_PROCESS_LOG_PATH=/opt/atomgit/ascend/log 消音
pad_token == eos_token warning —— HF 默认行为，对生成无实际影响
DeprecationWarning: builtin type swigvarlink has no __module__ attribute —— pyACL swig 绑定遗留，不影响

可全部忽略。

SmolLM2-135M-Instruct on vLLM-Ascend 0.18.0rc1

1. 简介

2. 验证环境

3. 服务启动

4. Smoke 验证

5. 性能参考

6. 精度评测

7. 注意事项

7.1 ModuleNotFoundError: No module named 'acl'

7.2 Invalid device ID 0 / chip id 重编号

7.3 文件同步：本地 → 容器 stdin 不通

7.4 启动 / 推理过程中的良性警告

SmolLM2-135M-Instruct on vLLM-Ascend 0.18.0rc1

1. 简介

2. 验证环境

3. 服务启动

4. Smoke 验证

5. 性能参考

6. 精度评测

7. 注意事项

7.1 ModuleNotFoundError: No module named 'acl'

7.2 Invalid device ID 0 / chip id 重编号

7.3 文件同步：本地 → 容器 stdin 不通

7.4 启动 / 推理过程中的良性警告

7.1 `ModuleNotFoundError: No module named 'acl'`

7.2 `Invalid device ID 0` / chip id 重编号

7.1 `ModuleNotFoundError: No module named 'acl'`

7.2 `Invalid device ID 0` / chip id 重编号