google/flan-t5-small on Ascend NPU

1. 简介

本文档记录 google/flan-t5-small Flan-T5-small 指令微调模型在昇腾 NPU（Ascend 910B3）上的迁移适配、精度评测与性能验证结果。

Flan-T5 是 Google 对 T5 进行 Flan（Finetuned LAnguage Net）指令微调的版本。相比原始 T5，Flan-T5 在 1800+ 种 NLP 任务上通过指令格式统一微调，因此单个模型可支持摘要、翻译、问答、分类等多种任务（通过不同 prompt 切换）。该模型为 small 版本（60M），是 Flan-T5 系列中最轻量的。

评测方式：比较 encoder last_hidden_state（与 T5 系列相同，decoder 自回归不参与精度比较）。

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`5.8.1`
`CANN`	`8.5.1`

NPU：8 × Ascend 910B3

3. 部署使用流程

3.1 环境准备

conda create -n google_flan-t5-small python=3.11 -y
conda activate google_flan-t5-small

pip install torch==2.8.0 torch_npu==2.8.0.post4 \
    -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers numpy \
    -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 推理脚本使用

python inference.py --device npu

编程接口：

from inference import T5Summarizer
model = T5Summarizer(model_path="./google_flan-t5-small", device="npu")
# Flan-T5 支持多种任务通过 prompt 切换
result = model.summarize(["summarize: Long text..."])  # 摘要
result = model.summarize(["translate English to French: Hello"])  # 翻译

4. Smoke 验证

python inference.py --device npu

预期输出：根据默认 prompt 生成的文本，无运行时错误。

5. 性能参考

指标	数值
NPU encoder 加速	1.0×（T5-small 小 batch 下与 CPU 相近）

6. 精度评测

6.1 评测方法

比较 encoder last_hidden_state 的展平余弦相似度。

6.2 评测结果

指标	数值
平均余弦相似度	`1.000000`
精度误差率	`0.0000%`

结论：精度误差率 0.0000%，encoder 输出完全一致，评测通过。

7. 迁移适配说明

7.1 模型结构

Encoder：T5 Encoder（6 层，512 维，relative position bias）
Decoder：T5 Decoder（6 层，512 维，自回归）
参数量：60M（T5-small）
特殊能力：Flan 指令微调，通过 prompt 切换支持 1800+ NLP 任务

7.2 适配要点

T5ForConditionalGeneration.from_pretrained() 加载（Flan-T5 与 T5 共用模型类）
model.to("npu:0") 迁移
encoder 输出提取：model.encoder(**inputs).last_hidden_state
与其他 T5 摘要模型完全相同的适配代码

7.3 关键代码

import torch, torch_npu
from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-small"
).to("npu:0")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

text = "translate English to German: How are you?"
inputs = tokenizer(text, return_tensors="pt")
enc_inputs = {k: v.to("npu:0") for k, v in inputs.items()
              if k in ["input_ids", "attention_mask"]}

with torch.no_grad():
    encoder_output = model.encoder(**enc_inputs).last_hidden_state

8. 注意事项

指令微调多任务：Flan-T5 通过 prompt 格式切换任务。常用格式："summarize: {text}"（摘要）、"translate English to French: {text}"（翻译）、"question: {q} context: {c}"（问答）。
small vs base vs large：small (60M) → base (220M) → large (770M) → xl (3B) → xxl (11B)。small 版本速度快但复杂任务精度有限。
encoder 评测：与其他 Seq2Seq 模型相同，精度评测仅覆盖 encoder 部分。
首次 NPU 推理：T5-small 12 层预热约 4-6 秒。
零样本能力：Flan 指令微调使模型具备零样本（zero-shot）能力，对未见过的任务格式也能通过合适的 prompt 完成，这是相比原始 T5 的核心优势。