SmolLM-1.7B on Ascend NPU #+NPU

1. 简介

本文档记录 HuggingFaceTB/SmolLM-1.7B（预训练基础版，1.7B 参数）在 Ascend NPU 环境的部署与验证结果。

SmolLM-1.7B 是 HuggingFace 发布的轻量级语言模型，使用 Llama 架构，24 层 Transformer，2048 隐藏维度，49152 词表大小。模型在昇腾 NPU 上通过 PyTorch + torch_npu 推理，架构 LlamaForCausalLM 已被完整支持，无需额外适配即可运行。

2. 验证环境

组件	版本
`Python`	`3.11.14`
`PyTorch`	`2.9.0+cpu`
`torch_npu`	`2.9.0`
`transformers`	`4.57.6`
`CANN`	`8.5.1`
`accelerate`	`1.13.0`
`SOC`	`ascend910_9391`

NPU：Ascend 910B2 × 4 卡
推理设备：npu:0（单卡）
模型路径：/home/openmind/volume/models/HuggingFaceTB/SmolLM-1.7B

3. 模型配置

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 2048,
  "intermediate_size": 8192,
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "num_key_value_heads": 32,
  "max_position_embeddings": 2048,
  "rope_theta": 10000.0,
  "vocab_size": 49152,
  "torch_dtype": "float32",
  "tie_word_embeddings": false
}

4. 模型下载

使用 modelscope 的 snapshot_download 下载模型权重：

pip install modelscope

python3 -c "
from modelscope import snapshot_download
snapshot_download(
    'HuggingFaceTB/SmolLM-1.7B',
    cache_dir='/home/openmind/volume/models'
)
"

模型约 3.4GB，包含 2 个 safetensor 分片。

5. 推理验证（inference.py）

推理脚本 inference.py 支持 NPU 和 CPU 设备，提供单条、交互和默认 demo 三种运行模式。

快速运行

# 默认 demo（4 条示例）
python3 inference.py --device npu:0

# 指定 prompt
python3 inference.py --device npu:0 --prompt "The capital of France is"

# 交互模式
python3 inference.py --device npu:0 --interactive

完整参数

python3 inference.py \
  --model_path /home/openmind/volume/models/HuggingFaceTB/SmolLM-1.7B \
  --device npu:0 \
  --max_new_tokens 128

推理结果

Prompt	Output
`The capital of France is`	`Paris.`
`2 + 2 =`	`4`
`The largest planet in our solar system is`	`Jupiter.`
`Water boils at`	`100°C (212°F).`

验证结果：模型成功加载到 npu:0，推理输出正常，生成文本连贯。

6. 精度评测

评测配置

参数	值
数据集	GSM8K
评测方式	5-shot
样本总数	1,319
Max tokens	256
Temperature	1.0 (greedy)
Torch dtype	`float32`
评测设备	`npu:0`

评测命令

python3 eval_gsm8k.py npu:0

评测结果

指标	NPU 结果	CPU 基准	差异
GSM8K (5-shot) Accuracy	2.81%	3.70%	-0.89%

结论：NPU 上 GSM8K 5-shot 准确率为 2.81%，与 CPU 基准 3.70% 差异 -0.89%，在合理范围内，验证通过。

评测日志

=== GSM8K 5-shot evaluation on npu:0 ===
Loading tokenizer and model...
Model loaded to npu:0
Loading GSM8K test dataset...
Total test samples: 1319
[100/1319] Acc: 2.00% | Elapsed: 854.5s | ETA: 10435s
[200/1319] Acc: 3.00% | Elapsed: 1708.7s | ETA: 9559s
[300/1319] Acc: 2.67% | Elapsed: 2562.3s | ETA: 8791s
[400/1319] Acc: 3.00% | Elapsed: 3415.9s | ETA: 7870s
[500/1319] Acc: 2.80% | Elapsed: 4272.8s | ETA: 7017s
[600/1319] Acc: 3.00% | Elapsed: 5124.0s | ETA: 6156s
[700/1319] Acc: 2.86% | Elapsed: 5977.9s | ETA: 5305s
[800/1319] Acc: 2.88% | Elapsed: 6832.2s | ETA: 4460s
[900/1319] Acc: 2.89% | Elapsed: 7685.2s | ETA: 3622s
[1000/1319] Acc: 3.00% | Elapsed: 8536.5s | ETA: 2774s
[1100/1319] Acc: 2.91% | Elapsed: 9389.8s | ETA: 1929s
[1200/1319] Acc: 2.92% | Elapsed: 10245.2s | ETA: 1087s
==================================================
Device: npu:0
Total: 1319, Correct: 37
GSM8K 5-shot Accuracy: 2.81%
Time: 11059.1s
==================================================

7. 性能参考

测试条件：128 input / 128 output / 20 requests，单卡 NPU，float32。

指标	数值
`duration`	`84.40 s`
`request_throughput`	`0.237 req/s`
`output_throughput`	`30.33 tok/s`
`total_token_throughput`	`57.34 tok/s`
`mean_latency_ms`	`4220.16 ms`
`median_latency_ms`	`4168.68 ms`
`p99_latency_ms`	`4797.48 ms`
`mean_tpot_ms`	`32.97 ms/token`

性能测试命令

python3 benchmark.py \
  --device npu:0 \
  --input_tokens 128 \
  --output_tokens 128 \
  --num_requests 20 \
  --output_file logs/benchmark_npu.json

8. 注意事项

数据类型：当前验证使用 torch.float32，在 NPU 上精度有保障但推理速度较慢。如有性能需求可尝试 torch.float16 / bfloat16，需额外验证精度。
模型路径：ModelScope 下载后目录名可能为 SmolLM-1___7B（三个下划线），使用时注意路径匹配。
数据集获取：GSM8K 数据集在 ModelScope 上不可用，可从 HuggingFace 或 GitHub raw 链接下载 test.jsonl。
tokenizer 配置：模型默认无 pad_token，需手动设置 tokenizer.pad_token = tokenizer.eos_token。
此为基础版模型：本验证使用的是预训练基础版（非 Instruct 版），输出为续写模式。如需对话场景，建议使用 HuggingFaceTB/SmolLM-1.7B-Instruct。