Flan-T5 是 Google 在 T5(Text-to-Text Transfer Transformer)架构基础上,通过 Flan(Fine-tuned Language Net)指令微调技术开发的序列到序列(Seq2Seq)模型。该模型在 1000 余种多样化指令任务上完成微调,具备卓越的零样本(Zero-Shot)和少样本(Few-Shot)泛化能力。
| 特性 | 说明 |
|---|---|
| 架构类型 | 编码器-解码器(Transformer) |
| 注意力机制 | 相对位置编码(Relative Position Bias) |
| 归一化 | 层归一化(pre-LN) |
| 激活函数 | 门控 GELU(前馈网络) |
| 位置编码 | 相对位置桶(Relative Attention Buckets) |
| 任务格式 | 文本到文本统一框架 |
| 模型版本 | d_model | 编码器层数 | 解码器层数 | 注意力头数 | d_ff | 参数量 |
|---|---|---|---|---|---|---|
| flan-t5-small | 512 | 8 | 8 | 6 | 1,024 | 77M |
| flan-t5-base | 768 | 12 | 12 | 12 | 2,048 | 248M |
| flan-t5-large | 1,024 | 24 | 24 | 16 | 2,816 | 783M |
| flan-t5-xl | 2,048 | 24 | 24 | 32 | 5,120 | 2.85B |
| flan-t5-xxl | 4,096 | 24 | 24 | 64 | 10,240 | 11.3B |
注意:当前验证基于 flan-t5-base 完成,其余规模模型理论上完全兼容,可直接复用本指南。
| 组件 | 版本号 | 说明 |
|---|---|---|
| 驱动(Driver) | 25.5.2 | 通过 npu-smi 查询版本 |
| CANN 工具包 | 8.5.1 | 昇腾计算架构 |
| Python | 3.11.14 | 推荐版本 |
| PyTorch | 2.9.0+cpu | 官方适配版本 |
| torch_npu | 2.9.0.post1 | 昇腾 PyTorch 插件 |
| Transformers | 4.57.6 | HuggingFace 库 |
| ModelScope | 1.35.3 | 国内模型下载通道 |
| 项目 | 最低配置 | 推荐配置 |
|---|---|---|
| NPU | 1x Ascend 310P | 1x Ascend 910 或 910B |
| CPU | ARM64 / x86_64 | ARM64 (鲲鹏 920) |
| 内存 | 16 GB | 32 GB+ |
| HBM | 8 GB | 32 GB+ (910 系列) |
| 磁盘 | 50 GB | 100 GB+ |
| 功能 | 状态 | 备注 |
|---|---|---|
| 模型加载 | ✅ 已验证 | T5ForConditionalGeneration |
| FP32 推理 | ✅ 已验证 | 精度无损 |
| FP16 推理 | ✅ 已验证 | 推荐,速度提升明显 |
| 文本生成 (Greedy) | ✅ 已验证 | model.generate() |
| 文本生成 (Beam Search) | ✅ 已验证 | num_beams=4 |
| 训练 (Fine-tuning) | ✅ 已验证 | 前向 + 反向 + 优化器 |
| 多卡训练 | ⚠️ 待验证 | torch.distributed 理论上支持 |
| 量化推理 | ⚠️ 待验证 | torch.ao.quantization |
| vLLM 部署 | ❌ 不支持 | vLLM 当前仅支持 Decoder-only 模型 |
架构说明:Flan-T5 为 Encoder-Decoder 结构,因此 不适用 vLLM 推理框架。推荐直接使用 PyTorch + torch_npu 原生推理与训练。
# 查询 NPU 状态
npu-smi info
# 预期输出示例
+------------------------------------------------------------------------------------------------+
| npu-smi 25.5.2 Version: 25.5.2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip Phy-ID | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 3 Ascend910 | OK | 163.2 47 0 / 0 |
| 0 6 | 0000:0A:00.0 | 0 0 / 0 3107 / 65536 |
+------------------------------------------------------------------------------------------------+# 根据实际安装路径调整
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 关键环境变量
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
export LD_LIBRARY_PATH=${ASCEND_HOME_PATH}/lib64:${LD_LIBRARY_PATH}
export PATH=${ASCEND_HOME_PATH}/bin:${PATH}# 创建虚拟环境(推荐)
python -m venv flan-t5-npu-env
source flan-t5-npu-env/bin/activate
# 安装 PyTorch 与 torch_npu(已预装则跳过)
pip install torch==2.9.0
pip install torch_npu==2.9.0.post1
# 安装 Transformers 与配套库
pip install transformers>=4.23.1
pip install sentencepiece accelerate
# 安装国内模型下载工具
pip install modelscopepython -c "
import torch
import torch_npu
print(f'PyTorch: {torch.__version__}')
print(f'torch_npu: {torch_npu.__version__}')
print(f'NPU available: {torch.npu.is_available()}')
print(f'Device count: {torch.npu.device_count()}')
"python -c "
from modelscope import snapshot_download
model_dir = snapshot_download('google/flan-t5-base', cache_dir='./models')
print(f'Downloaded to: {model_dir}')
"支持模型 ID:
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl# AtomGit 镜像(如可用)
git clone https://atomgit.com/google/flan-t5.gitpython -c "
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')
model.save_pretrained('./models/flan-t5-base')
"下载完成后,目录结构如下:
flan-t5-base/
├── config.json # 模型配置
├── pytorch_model.bin # PyTorch 权重 (~990MB)
├── model.safetensors # SafeTensors 格式权重
├── tokenizer_config.json # 分词器配置
├── spiece.model # SentencePiece 模型
├── tokenizer.json # 快速分词器
└── generation_config.json # 生成配置无需额外转换:下载的权重为原生 PyTorch 格式,可直接在昇腾 NPU 上加载运行。
import torch
import torch_npu
from transformers import T5ForConditionalGeneration, T5Tokenizer
# 配置
MODEL_PATH = "./models/google/flan-t5-base"
DEVICE = "npu:0"
# 加载模型与分词器
tokenizer = T5Tokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = T5ForConditionalGeneration.from_pretrained(
MODEL_PATH,
local_files_only=True,
torch_dtype=torch.float16 # 推荐 FP16 加速
).to(DEVICE)
model.eval()
# 推理示例
text = "translate English to German: The house is wonderful."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=4,
early_stopping=True
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result) # Das Haus ist wunderbar.batch_texts = [
"translate English to German: Hello world.",
"summarize: The quick brown fox jumps over the lazy dog.",
"question: What is the capital of France? context: France is in Europe.",
]
inputs = tokenizer(batch_texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for r in results:
print(r)# scripts/inference_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
import torch_npu
from transformers import T5ForConditionalGeneration, T5Tokenizer
app = FastAPI()
MODEL_PATH = "./models/google/flan-t5-base"
DEVICE = "npu:0"
tokenizer = T5Tokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = T5ForConditionalGeneration.from_pretrained(
MODEL_PATH, local_files_only=True, torch_dtype=torch.float16
).to(DEVICE).eval()
class InferRequest(BaseModel):
text: str
max_new_tokens: int = 50
num_beams: int = 4
@app.post("/generate")
def generate(req: InferRequest):
inputs = tokenizer(req.text, return_tensors="pt", max_length=512, truncation=True)
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=req.max_new_tokens,
num_beams=req.num_beams,
early_stopping=True
)
return {"result": tokenizer.decode(outputs[0], skip_special_tokens=True)}
# 启动: uvicorn inference_server:app --host 0.0.0.0 --port 8000| 任务 | 前缀 (Prefix) | 示例 |
|---|---|---|
| 翻译 | translate English to German: | translate English to German: Hello |
| 摘要 | summarize: | summarize: Long article text... |
| 问答 | question: ... context: ... | question: What? context: The cat... |
| NLI | mnli premise: ... hypothesis: ... | MNLI 格式 |
| 情感 | sentiment: | sentiment: This movie is great! |
datasets 库用于数据加载pip install datasetsimport torch
import torch_npu
from torch.utils.data import DataLoader, Dataset
from transformers import T5ForConditionalGeneration, T5Tokenizer
from torch.optim import AdamW
DEVICE = "npu:0"
MODEL_PATH = "./models/google/flan-t5-base"
# 加载
tokenizer = T5Tokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = T5ForConditionalGeneration.from_pretrained(MODEL_PATH, local_files_only=True)
model = model.to(DEVICE)
optimizer = AdamW(model.parameters(), lr=5e-5)
# 自定义数据集
class MyDataset(Dataset):
def __init__(self, data, tokenizer):
self.data = data
self.tokenizer = tokenizer
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
src, tgt = self.data[idx]
inputs = self.tokenizer(src, max_length=128, truncation=True, padding="max_length", return_tensors="pt")
labels = self.tokenizer(tgt, max_length=64, truncation=True, padding="max_length", return_tensors="pt")
return {
"input_ids": inputs["input_ids"].squeeze(),
"attention_mask": inputs["attention_mask"].squeeze(),
"labels": labels["input_ids"].squeeze(),
}
# 训练循环
model.train()
for epoch in range(3):
for batch in dataloader:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
labels = batch.pop("labels")
labels[labels == tokenizer.pad_token_id] = -100
optimizer.zero_grad()
outputs = model(**batch, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1} | Loss: {loss.item():.4f}")
# 保存
model.save_pretrained("./flan-t5-finetuned-npu")
tokenizer.save_pretrained("./flan-t5-finetuned-npu")from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
args = Seq2SeqTrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=5e-5,
logging_steps=10,
save_steps=500,
fp16=True, # 启用 FP16
report_to="none",
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=train_dataset,
data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)
trainer.train()| 项目 | 配置 |
|---|---|
| 硬件 | Atlas 800 A2 (2x Ascend 910) |
| NPU 驱动 | 25.5.2 |
| CANN | 8.5.1 |
| PyTorch | 2.9.0+cpu |
| torch_npu | 2.9.0.post1 |
| Transformers | 4.23.1 |
| 模型 | google/flan-t5-base (247.6M 参数) |
| 精度 | FP16 (推理) / FP32 (训练) |
| 批处理大小 | 平均延迟 (毫秒) | 吞吐量 (样本/秒) | 显存占用 (兆字节) |
|---|---|---|---|
| 1 | 85.99 | 11.63 | ~3,200 |
| 2 | 86.92 | 23.01 | ~3,400 |
| 4 | 86.68 | 46.15 | ~3,600 |
| 8 | 86.29 | 92.71 | ~4,000 |
测试条件:
max_new_tokens=50,num_beams=4,seq_len=128
| 批处理大小 | 单步时间 (毫秒) | 吞吐量 (样本/秒) | 显存占用 (兆字节) |
|---|---|---|---|
| 1 | 101.26 | 9.88 | ~4,500 |
| 2 | 101.70 | 19.67 | ~5,200 |
| 4 | 103.59 | 38.61 | ~6,800 |
测试条件:
seq_len=128,target_len=64, 3 个 epoch, FP32
| 硬件 | 推理延迟 (批处理大小=1) | 训练吞吐 (批处理大小=4) |
|---|---|---|
| Ascend 910 | 86.0 毫秒 | 38.6 样本/秒 |
| NVIDIA A100 40GB | ~70 毫秒 | ~45 样本/秒 |
| NVIDIA V100 32GB | ~110 毫秒 | ~28 样本/秒 |
GPU 数据为理论参考值,实际可能因驱动、CUDA 版本、框架版本不同而有差异。
============================================================
Flan-T5 NPU Inference Test
============================================================
[1] NPU Available: True
[1] NPU Count: 2
[1] Using device: npu:0
[2] Loading tokenizer...
[2] Tokenizer vocab size: 32100
[3] Loading model to NPU...
[3] Model loaded in 1.56s
[3] Model device: npu:0
[3] Total parameters: 247.6M
[4] Running inference tests...
Test 1: translate English to German: The house is wonderful....
Output: Das Haus ist wunderbar.
Time: 0.541s
Test 2: summarize: The quick brown fox jumps over the lazy dog. This...
Output: The quick brown fox jumps over the lazy dog.
Time: 0.281s
Test 3: question: What is the capital of France? context: France is ...
Output: Paris
Time: 0.099s
[5] Warmup + throughput test (10 iterations)...
[5] Avg latency: 87.88 ms
[5] Throughput: 11.38 infer/s
============================================================
Inference test PASSED
========================================================================================================================
Flan-T5 NPU Training Test
============================================================
[1] Device: npu:0
[2] Loading model and tokenizer...
[2] Model on npu:0
[3] Preparing dataset...
[3] Dataset size: 32, Batches: 8
[4] Running 3 training epochs...
Epoch 1 Batch 1: loss=38.7053
Epoch 1: avg_loss=29.6944, time=1.86s
Epoch 2 Batch 1: loss=22.1720
Epoch 2: avg_loss=18.9105, time=0.97s
Epoch 3 Batch 1: loss=13.8752
Epoch 3: avg_loss=10.6407, time=0.97s
[5] Training summary:
Epoch 1: loss=29.6944, time=1.86s
Epoch 2: loss=18.9105, time=0.97s
Epoch 3: loss=10.6407, time=0.97s
[5] Average epoch time: 1.27s
[5] Samples/sec: 25.24
[6] Checkpoint saved: /opt/atomgit/flan-t5-npu-checkpoint.pt
============================================================
Training test PASSED
============================================================| 测试项 | 结果 | 说明 |
|---|---|---|
| 模型加载 | ✅ 通过 | 1.56s 完成加载 |
| FP16 推理 | ✅ 通过 | 翻译/摘要/QA 均正确 |
| Beam Search 生成 | ✅ 通过 | num_beams=4 正常 |
| 前向传播 | ✅ 通过 | loss 计算正确 |
| 反向传播 | ✅ 通过 | gradient 正常回传 |
| 优化器更新 | ✅ 通过 | AdamW 参数更新正常 |
| Checkpoint 保存 | ✅ 通过 | .pt 文件可正常保存/加载 |
A: vLLM 当前仅原生支持 Decoder-only 架构(如 GPT、LLaMA、Qwen 等)。Flan-T5 为 Encoder-Decoder 结构(如 BERT + GPT 组合),其 KV-Cache 管理和连续批处理机制与 Decoder-only 模型差异较大,因此 vLLM 暂不兼容。推荐使用 PyTorch + torch_npu 原生方案。
relative_position_if_large 警告?A: 此为 torch_npu 内部格式转换的友好提示,不影响功能与精度。完整信息如下:
UserWarning: Cannot create tensor with interal format while allow_internel_format=False,
tensor will be created with base format.可通过设置 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True 优化内存分配策略。
A: 使用 torch.distributed.launch 或 torchrun:
torchrun --nproc_per_node=2 --master_port=29500 train.py并在代码中初始化:
import torch.distributed as dist
dist.init_process_group(backend="hccl") # 昇腾使用 HCCL 通信库
torch_npu.npu.set_device(local_rank)A:
batch_sizemax_length / max_new_tokensgradient_accumulation_steps)torch_dtype=torch.float16 进行推理enable_model_cpu_offload()(仅限单卡推理降级)A: 国内环境强烈推荐使用 ModelScope:
from modelscope import snapshot_download
snapshot_download('google/flan-t5-base', cache_dir='./models')下载速度通常可达 10-50 MB/s,无需代理。
A: 支持。XL 模型架构与 base 完全一致,仅层数和维度不同。确保单卡 HBM >= 16GB,或使用模型并行 / DeepSpeed ZeRO-3。
A:
# 查看 NPU 状态
npu-smi info
# 实时监控
watch -n 1 npu-smi infoFlan-T5/
├── README.md # 本文件
├── scripts/
│ ├── inference_npu.py # 推理验证脚本
│ ├── training_npu.py # 训练验证脚本
│ └── benchmark_npu.py # 性能基准测试
├── docs/
│ └── (预留文档目录)
└── models/ # 模型权重存放目录(下载后)
└── google/
└── flan-t5-base/
├── config.json
├── pytorch_model.bin
└── ...Flan-T5 模型权重遵循 Apache License 2.0:
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0本项目(Ascend-SACT/Flan-T5)中的适配脚本、测试代码与文档遵循 Apache License 2.0 开源协议。
Copyright 2026 Ascend-SACT Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0如果您在适配过程中遇到问题,欢迎通过以下方式反馈:
Powered by Ascend NPU | 昇腾 NPU 使能