本文档记录 Flan-T5(Fine-tuned Language Net - Text-to-Text Transfer Transformer)在华为昇腾 NPU (Ascend 910) 上的适配与验证结果。
Flan-T5 是 Google 提出的文本到文本生成模型,基于 T5 架构并在 FLAN(Fine-tuned LAnguage Net)指令微调数据集上进行训练,能够执行多种 NLP 任务(翻译、摘要、问答、分类等)。
本适配工作完成了以下目标:
0.00011%,生成 token 完全一致inference.py 入口脚本相关地址:
https://hf-mirror.com/google/flan-t5-base| 组件 | 版本 |
|---|---|
transformers | 4.57.6 |
torch | 2.9.0+cpu |
torch-npu | 2.9.0.post1+gitee7ba04 |
torchvision | >=0.11.1 |
numpy | >=1.21 |
2 逻辑卡 (Ascend 910)/opt/atomgit/flan-t5-npu/flan-t5-basecd flan-t5-npu
pip install transformers torch torch-npu sentencepiece方式一:国内镜像(推荐)
export HF_ENDPOINT=https://hf-mirror.com
python3 -c "
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base', legacy=False)
model.save_pretrained('./flan-t5-base')
tokenizer.save_pretrained('./flan-t5-base')
"方式二:ModelScope
pip install modelscope
python3 -c "
from modelscope import snapshot_download
snapshot_download('AI-ModelScope/flan-t5-base', local_dir='./flan-t5-base')
"# NPU 推理
python inference.py \
--model_path ./flan-t5-base \
--prompt "Translate English to German: The house is wonderful." \
--device npu
# CPU 推理(用于对比)
python inference.py \
--model_path ./flan-t5-base \
--prompt "Translate English to German: The house is wonderful." \
--device cpuimport torch
import torch_npu
from transformers import T5ForConditionalGeneration, T5Tokenizer
device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
tokenizer = T5Tokenizer.from_pretrained("./flan-t5-base", legacy=False)
input_text = "Translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=50, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))在随机初始化 T5-base 配置(batch=2, seq_len=32)上进行 NPU vs CPU 一致性验证:
| 指标 | CPU | NPU | 差异 |
|---|---|---|---|
| Logits max relative error | — | — | 0.00011% |
| Generation token match | — | — | 100% |
结论:NPU 与 CPU 的 logits 最大相对误差仅 0.00011%,生成输出完全一致,满足精度误差 < 1% 的要求。
以下性能数据基于 Flan-T5 base 架构(d_model=768, 12 layers, 12 heads),batch_size=2, sequence_length=32。
| 阶段 | 耗时 |
|---|---|
| Encoder-Decoder Forward | 0.528 s |
| Generate (max_length=20) | 2.404 s |
| 阶段 | 耗时 | 加速比 |
|---|---|---|
| Encoder-Decoder Forward | 0.026 s | 19.96x |
| Generate (max_length=20) | 0.376 s | 6.40x |
| 指标 | CPU (基线) | NPU (优化后) | 加速比 |
|---|---|---|---|
| Forward 延迟 | 0.528 s | 0.026 s | 19.96x |
| Generate 延迟 | 2.404 s | 0.376 s | 6.40x |
| Forward 吞吐 | 3.79 it/s | 75.6 it/s | 19.96x |
Flan-T5 作为标准 Transformers 架构模型,NPU 适配极为简洁,无需修改模型源码。核心适配点仅为设备选择与同步:
import torch
import torch_npu
device = torch.device("npu:0")
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-base").to(device)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)with torch.no_grad():
outputs = model.generate(input_ids, max_length=50)
torch.npu.synchronize() # 确保 NPU 计算完成后再计时inference.py)封装了完整的推理流程:
权重下载:首次运行需从 HuggingFace 或 ModelScope 下载 Flan-T5 权重(base 版本约 990MB)。网络受限时可设置 HF_ENDPOINT=https://hf-mirror.com 使用国内镜像。
Tokenizer 兼容性:T5Tokenizer 默认使用 SentencePiece。较新版本的 transformers 中 legacy=False 可避免 add_tokens 相关警告。
NPU 图编译加速:对于固定形状的推理场景,建议开启图编译:
import torch_npu
torch.npu.set_compile_mode(jit_compile=True)内存占用:Flan-T5 base 在 NPU 上约占用 2GB HBM。对于长序列生成,可通过 max_length 和 num_beams 控制内存使用。
相对位置编码警告:当前 NPU 环境下可能出现 Cannot create tensor with interal format 警告,属于 torch_npu 内部格式提示,不影响输出正确性。
@article{chung2022scaling,
title={Scaling instruction-finetuned language models},
author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others},
journal={arXiv preprint arXiv:2210.11416},
year={2022}
}
@article{raffel2020exploring,
title={Exploring the limits of transfer learning with a unified text-to-text transformer},
author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
journal={The Journal of Machine Learning Research},
year={2020}
}