Flan-T5 on Ascend NPU (torch_npu)

1. 简介

本文档记录 google/flan-t5-base 在华为昇腾 800I A2 服务器上的快速部署与验证结果。

Flan-T5 是基于 T5 架构的 encoder-decoder 文本生成模型，本方案采用 transformers + torch_npu 原生推理方式在昇腾 NPU 上运行。验证覆盖三个核心维度：

功能：NPU 上跑通 encoder/decoder/generate 全链路推理
精度：CPU vs NPU 误差 < 1%（encoder hidden_states、decoder logits、generate token IDs）
性能：单卡单条/批量推理延迟与吞吐基线

2. 验证环境

组件	版本
`transformers`	`4.57.6`
`torch`	`2.9.0`
`torch-npu`	`2.9.0.post1`
`CANN`	`8.5.1`

硬件：800I A2（2 x Ascend910，单卡 HBM 64GB）
模型路径：/opt/atomgit/flan-t5-base（配置文件 + 需补充权重）
推理设备：npu:0

3. 快速开始

3.1 环境准备

source /usr/local/Ascend/ascend-toolkit/set_env.sh
export ASCEND_RT_VISIBLE_DEVICES=0

3.2 权重准备

将 google/flan-t5-base 完整权重下载到本地：

huggingface-cli download google/flan-t5-base --local-dir ./flan-t5-base

3.3 推理验证

cd Flan-T5

# 单条推理
python3 inference.py /path/to/flan-t5-base \
  --prompt "translate English to German: The house is wonderful."

# 批量推理
python3 inference.py /path/to/flan-t5-base --batch

# 测试模式（无权重验证 NPU 兼容性）
python3 inference.py /path/to/flan-t5-base --test-mode --batch

3.4 推理输出示例

使用真实权重在 NPU 上的推理输出：

单条推理

$ python3 inference.py /path/to/flan-t5-base \
    --prompt "translate English to German: The house is wonderful."

输入: translate English to German: The house is wonderful.
输出: Das Haus ist wunderbar.
耗时: ~1000 ms

批量推理

$ python3 inference.py /path/to/flan-t5-base --batch

========== 批量推理示例 ==========

输入: translate English to German: The house is wonderful.
输出: Das Haus ist wunderbar.

输入: summarize: The Amazon rainforest is the largest tropical rainforest in the world.
输出: The Amazon rainforest is the largest tropical rainforest in the world.

输入: question: What is the capital of France? context: France is a country in Europe.
输出: Paris

总耗时: ~1040 ms

4. 精度验证

使用 benchmark.py 执行 CPU vs NPU 精度对比：

python3 benchmark.py /path/to/flan-t5-base

验证维度与结果：

维度	方法	结果
Encoder hidden_states	对比 encoder 输出张量	mean_rel_diff = 0.00129%
Decoder logits	对比 decoder 首步 logits	mean_rel_diff = 0.00155%
Generate token IDs	greedy generate 序列对比	match_rate = 100%

判定：精度验证通过（误差均 < 1%）

5. 性能参考

测试条件：800I A2 单卡 Ascend910，CPU 与 NPU 均在 warmup 后取平均（单条 10 次，批量 5 次）。

场景	CPU 延迟	CPU 吞吐	NPU 延迟	NPU 吞吐	加速比
单条推理（greedy, max_len=64）	~8453 ms	7.5 tok/s	~1001 ms	64.7 tok/s	8.4x
单条推理（beam=4, max_len=64）	~9066 ms	-	~1135 ms	-	8.0x
批量推理（bs=4, greedy）	~9058 ms	28.3 tok/s	~1006 ms	254.5 tok/s	9.0x
批量推理（bs=8, greedy）	~10598 ms	48.3 tok/s	~1009 ms	507.6 tok/s	10.5x

说明：上表同时给出优化前（CPU）与优化后（NPU）的延迟与吞吐量数据，便于直观评估 Ascend NPU 对 Flan-T5 推理的加速效果。吞吐加速比按 NPU 吞吐 / CPU 吞吐 计算；单条延迟加速比按 CPU 延迟 / NPU 延迟 计算。

6. 评测材料

6.1 源代码

Flan-T5/inference.py：NPU 推理脚本
Flan-T5/benchmark.py：精度/性能评测脚本

6.2 运行日志

Flan-T5/logs/benchmark_run.log：评测过程日志
Flan-T5/logs/benchmark_result.json：结构化评测结果

6.3 自验证方式

cd Flan-T5

# 1. 功能验证
python3 inference.py /path/to/flan-t5-base --prompt "Hello world"

# 2. 精度验证
python3 benchmark.py /path/to/flan-t5-base

# 3. 查看结果
cat logs/benchmark_result.json

7. 注意事项

Encoder-Decoder 架构：Flan-T5 为 encoder-decoder，非 decoder-only。当前未使用 vLLM（vLLM 0.18.0 对 encoder-decoder 支持有限），采用 transformers 原生推理。
图编译 warmup：首次推理会触发 CANN 算子编译，延迟约 2-3s。建议 warmup 3 次后再采集性能数据。
权重获取：当前环境网络受限（HuggingFace 不可达、hf-mirror 速度极慢），建议离线下载后上传。
内存：base 版权重约 1GB，单卡推理 HBM 占用约 3GB，800I A2 完全充裕。