Flan-T5 on Ascend NPU

1. 简介

本文档记录 Flan-T5（Fine-tuned Language Net - Text-to-Text Transfer Transformer）在华为昇腾 NPU (Ascend 910) 上的适配与验证结果。

Flan-T5 是 Google 提出的文本到文本生成模型，基于 T5 架构并在 FLAN（Fine-tuned LAnguage Net）指令微调数据集上进行训练，能够执行多种 NLP 任务（翻译、摘要、问答、分类等）。

本适配工作完成了以下目标：

NPU 推理跑通：Flan-T5 完整 encoder-decoder 生成链路在 Ascend NPU 上正常运行
精度一致性：NPU 与 CPU 输出 logits 相对误差 0.00011%，生成 token 完全一致
性能优化：相比 CPU 推理获得显著加速
一键推理：提供统一的 inference.py 入口脚本

2. 验证环境

组件	版本
`transformers`	`4.57.6`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`torchvision`	`>=0.11.1`
`numpy`	`>=1.21`

NPU：2 逻辑卡 (Ascend 910)
模型路径：/opt/atomgit/flan-t5-npu/flan-t5-base

3. 快速开始

3.1 环境准备

cd flan-t5-npu
pip install transformers torch torch-npu sentencepiece

3.2 权重下载

方式一：国内镜像（推荐）

export HF_ENDPOINT=https://hf-mirror.com
python3 -c "
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base', legacy=False)
model.save_pretrained('./flan-t5-base')
tokenizer.save_pretrained('./flan-t5-base')
"

方式二：ModelScope

pip install modelscope
python3 -c "
from modelscope import snapshot_download
snapshot_download('AI-ModelScope/flan-t5-base', local_dir='./flan-t5-base')
"

3.3 NPU 推理

# NPU 推理
python inference.py \
  --model_path ./flan-t5-base \
  --prompt "Translate English to German: The house is wonderful." \
  --device npu

# CPU 推理（用于对比）
python inference.py \
  --model_path ./flan-t5-base \
  --prompt "Translate English to German: The house is wonderful." \
  --device cpu

3.4 使用 transformers 原生 API

import torch
import torch_npu
from transformers import T5ForConditionalGeneration, T5Tokenizer

device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
tokenizer = T5Tokenizer.from_pretrained("./flan-t5-base", legacy=False)

input_text = "Translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

outputs = model.generate(input_ids, max_length=50, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4. 精度评测

在随机初始化 T5-base 配置（batch=2, seq_len=32）上进行 NPU vs CPU 一致性验证：

指标	CPU	NPU	差异
Logits max relative error	—	—	`0.00011%`
Generation token match	—	—	100%

结论：NPU 与 CPU 的 logits 最大相对误差仅 0.00011%，生成输出完全一致，满足精度误差 < 1% 的要求。

5. 性能参考

以下性能数据基于 Flan-T5 base 架构（d_model=768, 12 layers, 12 heads），batch_size=2, sequence_length=32。

5.1 优化前（CPU 基线）

阶段	耗时
Encoder-Decoder Forward	`0.528 s`
Generate (max_length=20)	`2.404 s`

5.2 优化后（NPU）

阶段	耗时	加速比
Encoder-Decoder Forward	`0.026 s`	19.96x
Generate (max_length=20)	`0.376 s`	6.40x

5.3 性能对比汇总

指标	CPU (基线)	NPU (优化后)	加速比
Forward 延迟	0.528 s	0.026 s	19.96x
Generate 延迟	2.404 s	0.376 s	6.40x
Forward 吞吐	3.79 it/s	75.6 it/s	19.96x

6. 适配改动说明

Flan-T5 作为标准 Transformers 架构模型，NPU 适配极为简洁，无需修改模型源码。核心适配点仅为设备选择与同步：

6.1 设备迁移

import torch
import torch_npu

device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)

6.2 NPU 同步计时

with torch.no_grad():
    outputs = model.generate(input_ids, max_length=50)
torch.npu.synchronize()  # 确保 NPU 计算完成后再计时

6.3 推理脚本 (`inference.py`)

封装了完整的推理流程：

自动检测设备（NPU / CUDA / CPU）
支持从本地路径或随机初始化配置加载
支持 greedy / beam search 生成
内置 warmup 与精确计时

7. 注意事项

权重下载：首次运行需从 HuggingFace 或 ModelScope 下载 Flan-T5 权重（base 版本约 990MB）。网络受限时可设置 HF_ENDPOINT=https://hf-mirror.com 使用国内镜像。
Tokenizer 兼容性：T5Tokenizer 默认使用 SentencePiece。较新版本的 transformers 中 legacy=False 可避免 add_tokens 相关警告。
NPU 图编译加速：对于固定形状的推理场景，建议开启图编译：
```
import torch_npu
torch.npu.set_compile_mode(jit_compile=True)
```
内存占用：Flan-T5 base 在 NPU 上约占用 2GB HBM。对于长序列生成，可通过 max_length 和 num_beams 控制内存使用。
相对位置编码警告：当前 NPU 环境下可能出现 Cannot create tensor with interal format 警告，属于 torch_npu 内部格式提示，不影响输出正确性。

8. 引用

@article{chung2022scaling,
  title={Scaling instruction-finetuned language models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

@article{raffel2020exploring,
  title={Exploring the limits of transfer learning with a unified text-to-text transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
  journal={The Journal of Machine Learning Research},
  year={2020}
}

Flan-T5 on Ascend NPU

1. 简介

本文档记录 Flan-T5（Fine-tuned Language Net - Text-to-Text Transfer Transformer）在华为昇腾 NPU (Ascend 910) 上的适配与验证结果。

本适配工作完成了以下目标：

NPU 推理跑通：Flan-T5 完整 encoder-decoder 生成链路在 Ascend NPU 上正常运行
精度一致性：NPU 与 CPU 输出 logits 相对误差 0.00011%，生成 token 完全一致
性能优化：相比 CPU 推理获得显著加速
一键推理：提供统一的 inference.py 入口脚本

2. 验证环境

组件	版本
`transformers`	`4.57.6`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`torchvision`	`>=0.11.1`
`numpy`	`>=1.21`

NPU：2 逻辑卡 (Ascend 910)
模型路径：/opt/atomgit/flan-t5-npu/flan-t5-base

3. 快速开始

3.1 环境准备

cd flan-t5-npu
pip install transformers torch torch-npu sentencepiece

3.2 权重下载

方式一：国内镜像（推荐）

export HF_ENDPOINT=https://hf-mirror.com
python3 -c "
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base', legacy=False)
model.save_pretrained('./flan-t5-base')
tokenizer.save_pretrained('./flan-t5-base')
"

方式二：ModelScope

pip install modelscope
python3 -c "
from modelscope import snapshot_download
snapshot_download('AI-ModelScope/flan-t5-base', local_dir='./flan-t5-base')
"

3.3 NPU 推理

# NPU 推理
python inference.py \
  --model_path ./flan-t5-base \
  --prompt "Translate English to German: The house is wonderful." \
  --device npu

# CPU 推理（用于对比）
python inference.py \
  --model_path ./flan-t5-base \
  --prompt "Translate English to German: The house is wonderful." \
  --device cpu

3.4 使用 transformers 原生 API

import torch
import torch_npu
from transformers import T5ForConditionalGeneration, T5Tokenizer

device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
tokenizer = T5Tokenizer.from_pretrained("./flan-t5-base", legacy=False)

input_text = "Translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

outputs = model.generate(input_ids, max_length=50, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4. 精度评测

在随机初始化 T5-base 配置（batch=2, seq_len=32）上进行 NPU vs CPU 一致性验证：

指标	CPU	NPU	差异
Logits max relative error	—	—	`0.00011%`
Generation token match	—	—	100%

结论：NPU 与 CPU 的 logits 最大相对误差仅 0.00011%，生成输出完全一致，满足精度误差 < 1% 的要求。

5. 性能参考

以下性能数据基于 Flan-T5 base 架构（d_model=768, 12 layers, 12 heads），batch_size=2, sequence_length=32。

5.1 优化前（CPU 基线）

阶段	耗时
Encoder-Decoder Forward	`0.528 s`
Generate (max_length=20)	`2.404 s`

5.2 优化后（NPU）

阶段	耗时	加速比
Encoder-Decoder Forward	`0.026 s`	19.96x
Generate (max_length=20)	`0.376 s`	6.40x

5.3 性能对比汇总

指标	CPU (基线)	NPU (优化后)	加速比
Forward 延迟	0.528 s	0.026 s	19.96x
Generate 延迟	2.404 s	0.376 s	6.40x
Forward 吞吐	3.79 it/s	75.6 it/s	19.96x

6. 适配改动说明

Flan-T5 作为标准 Transformers 架构模型，NPU 适配极为简洁，无需修改模型源码。核心适配点仅为设备选择与同步：

6.1 设备迁移

import torch
import torch_npu

device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)

6.2 NPU 同步计时

with torch.no_grad():
    outputs = model.generate(input_ids, max_length=50)
torch.npu.synchronize()  # 确保 NPU 计算完成后再计时

6.3 推理脚本 (`inference.py`)

封装了完整的推理流程：

自动检测设备（NPU / CUDA / CPU）
支持从本地路径或随机初始化配置加载
支持 greedy / beam search 生成
内置 warmup 与精确计时

7. 注意事项

权重下载：首次运行需从 HuggingFace 或 ModelScope 下载 Flan-T5 权重（base 版本约 990MB）。网络受限时可设置 HF_ENDPOINT=https://hf-mirror.com 使用国内镜像。
Tokenizer 兼容性：T5Tokenizer 默认使用 SentencePiece。较新版本的 transformers 中 legacy=False 可避免 add_tokens 相关警告。
NPU 图编译加速：对于固定形状的推理场景，建议开启图编译：
```
import torch_npu
torch.npu.set_compile_mode(jit_compile=True)
```
内存占用：Flan-T5 base 在 NPU 上约占用 2GB HBM。对于长序列生成，可通过 max_length 和 num_beams 控制内存使用。
相对位置编码警告：当前 NPU 环境下可能出现 Cannot create tensor with interal format 警告，属于 torch_npu 内部格式提示，不影响输出正确性。

8. 引用

@article{chung2022scaling,
  title={Scaling instruction-finetuned language models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

@article{raffel2020exploring,
  title={Exploring the limits of transfer learning with a unified text-to-text transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
  journal={The Journal of Machine Learning Research},
  year={2020}
}

Flan-T5 on Ascend NPU

1. 简介

2. 验证环境

3. 快速开始

3.1 环境准备

3.2 权重下载

3.3 NPU 推理

3.4 使用 transformers 原生 API

4. 精度评测

5. 性能参考

5.1 优化前（CPU 基线）

5.2 优化后（NPU）

5.3 性能对比汇总

6. 适配改动说明

6.1 设备迁移

6.2 NPU 同步计时

6.3 推理脚本 (inference.py)

7. 注意事项

8. 引用

Flan-T5 on Ascend NPU

1. 简介

2. 验证环境

3. 快速开始

3.1 环境准备

3.2 权重下载

3.3 NPU 推理

3.4 使用 transformers 原生 API

4. 精度评测

5. 性能参考

5.1 优化前（CPU 基线）

5.2 优化后（NPU）

5.3 性能对比汇总

6. 适配改动说明

6.1 设备迁移

6.2 NPU 同步计时

6.3 推理脚本 (inference.py)

7. 注意事项

8. 引用

6.3 推理脚本 (`inference.py`)

6.3 推理脚本 (`inference.py`)