Delicate02/flan-t5-base-ascend
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Model Card for FLAN-T5 base

drawing

Table of Contents

  1. TL;DR
  2. Model Details
  3. Usage
  4. Uses
  5. Bias, Risks, and Limitations
  6. Training Details
  7. Evaluation
  8. Environmental Impact
  9. Citation
  10. Model Card Authors

TL;DR

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract :

Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the T5 model card.

Model Details

Model Description

  • Model type: Language model
  • Language(s) (NLP): English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
  • License: Apache 2.0
  • Related Models: All FLAN-T5 Checkpoints
  • Original Checkpoints: All Original FLAN-T5 Checkpoints
  • Resources for more information:
    • Research paper
    • GitHub Repo
    • Hugging Face FLAN-T5 Docs (Similar to T5)

Usage

Find below some example scripts on how to use the model in transformers:

Using the Pytorch model

Running the model on a CPU

Click to expand

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU

Click to expand
# pip install accelerate
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU using different precisions

FP16

Click to expand
# pip install accelerate
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto", torch_dtype=torch.float16)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

INT8

Click to expand
# pip install bitsandbytes accelerate
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", device_map="auto", load_in_8bit=True)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Uses

Direct Use and Downstream Use

The authors write in the original paper's model card that:

The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

See the research paper for further details.

Out-of-Scope Use

More information needed.

Bias, Risks, and Limitations

The information below in this section are copied from the model's official model card:

Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

Ethical considerations and risks

Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

Known Limitations

Flan-T5 has not been tested in real world applications.

Sensitive Use:

Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.

Training Details

Training Data

The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):

table.png

Training Procedure

According to the model card from the original paper:

These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.

The model has been trained on TPU v3 or TPU v4 pods, using t5x codebase together with jax.

Evaluation

Testing Data, Factors & Metrics

The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: image.png For full details, please check the research paper.

Results

For full results for FLAN-T5-Base, see the research paper, Table 3.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4.
  • Hours used: More information needed
  • Cloud Provider: GCP
  • Compute Region: More information needed
  • Carbon Emitted: More information needed

Citation

BibTeX:

@misc{https://doi.org/10.48550/arxiv.2210.11416,
  doi = {10.48550/ARXIV.2210.11416},
  
  url = {https://arxiv.org/abs/2210.11416},
  
  author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
  
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Scaling Instruction-Finetuned Language Models},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Model Recycling

Evaluation on 36 datasets using google/flan-t5-base as a base model yields average score of 77.98 in comparison to 68.82 by google/t5-v1_1-base.

The model is ranked 1st among all tested models for the google/t5-v1_1-base architecture as of 06/02/2023 Results:

20_newsgroupag_newsamazon_reviews_multianliboolqcbcolacopadbpediaesnlifinancial_phrasebankimdbisearmnlimrpcmultircpoem_sentimentqnliqqprotten_tomatoesrtesst2sst_5binsstsbtrec_coarsetrec_finetweet_ev_emojitweet_ev_emotiontweet_ev_hatetweet_ev_ironytweet_ev_offensivetweet_ev_sentimentwicwnliwscyahoo_answers
86.218889.666767.1251.968882.324278.571480.15347577.666790.950785.493.32472.42587.245789.460862.376282.692392.787889.772489.024484.837594.380757.285189.475997.292.846.84880.225254.983276.658284.302370.636670.062756.33853.846273.4

For more information, see: Model Recycling


华为昇腾 Ascend NPU 推理

本仓库适配 FLAN-T5-Base 在 Ascend 910B2 NPU 上运行,支持亲和算子优化、FP16 推理、精度验证和性能基准测试。

文件说明

文件说明
inference.py推理主脚本,支持推理、基准测试、精度验证
download_weights.py下载预训练权重脚本
benchmark_results.json最新 Optimized 配置基准测试结果

环境要求

  • Python 3.10+
  • PyTorch 2.x + torch_npu
  • Ascend 910B2 NPU
  • pip install transformers tqdm

权重获取

精度验证需要真实预训练权重。从 HuggingFace 下载(约 990 MB):

pip install huggingface_hub
python download_weights.py

如果网络受限,设置镜像端后重试:

HF_ENDPOINT=https://hf-mirror.com python download_weights.py

注意:无预训练权重时,inference.py 会自动使用随机权重替代。推理可正常运行,但精度验证会因随机初始化差异而失败。

使用方式

基本推理

python inference.py --text "translate English to German: The house is wonderful."
python inference.py --dtype fp16 --optimize --text "question: What is the capital of France? context: France is a country in Europe."

基准测试

# Baseline(无优化)
python inference.py --benchmark --dtype fp16 --warmup 3 --runs 10

# Optimized(亲和算子 + TaskQueue + CPU 绑核)
TASK_QUEUE_ENABLE=2 CPU_AFFINITY_CONF=2 python inference.py --benchmark --dtype fp16 --optimize --warmup 3 --runs 10

精度验证

python inference.py --accuracy --dtype fp16 --optimize

性能基准测试

测试条件

参数值
Batch Size1
输入长度(token)128
最大输出长度(token)32
精度FP16
Warmup / Runs3 / 10
设备Ascend 910B2

对比结果

指标BaselineOptimized提升
平均端到端延迟595.0 ms259.0 ms-56.5%
P50 延迟597.7 ms258.7 ms-56.7%
Encoder 平均耗时12.6 ms9.1 ms-27.8%
Decoder 平均耗时582.4 ms249.9 ms-57.1%
平均输出 token 数32.032.0—
吞吐量53.8 t/s123.6 t/s+129.7%

结果分析

Encoder 优化效果(-27.8%):
F.scaled_dot_product_attention 亲和算子替换 + torch_npu.npu.fast_gelu 将 Encoder 从 12.6ms 降至 9.1ms。

Decoder 大幅优化(-57.1%):
OptimizedT5Decoder 使用预分配 KV Cache + 融合 QKV/FFN 权重 + F.scaled_dot_product_attention 替代逐层 softmax/score,将自回归解码从 582.4ms 降至 249.9ms。

运行时优化组合:

优化手段Flag/Env实测收益
亲和算子替换(融合注意力 + fast_gelu)--optimize高(Encoder -27.8%, Decoder -57.1%)
TaskQueue 流水优化TASK_QUEUE_ENABLE=2中(Host 下发并发)
CPU 绑核CPU_AFFINITY_CONF=2低(单流单核受限)
预分配 KV Cache内置(OptimizedT5Decoder)高(消除显存分配/拷贝)

验证命令:

# Baseline
python inference.py --benchmark --dtype fp16 --warmup 3 --runs 10

# Optimized(亲和算子 + TaskQueue + CPU 绑核)
TASK_QUEUE_ENABLE=2 CPU_AFFINITY_CONF=2 python inference.py --benchmark --dtype fp16 --optimize --warmup 3 --runs 10

注意:上述数据使用随机权重采集。下载预训练权重后延迟可能有所差异。benchmark_results.json 记录 Optimized 配置结果。


精度验证对比(NPU vs CPU/GPU)

对比方法

精度验证脚本 inference.py 通过 --accuracy 参数启动,采用以下方法衡量 NPU 推理与 CPU/GPU 参考之间的精度差异:

  1. 参考基线:将模型加载到 CPU,以 FP32 精度逐 token 生成输出文本,作为标准参考
  2. 测试对象:模型在 Ascend NPU 上以 FP16/BF16 精度运行,生成输出文本
  3. 对比指标:
    • 输出文本逐 token 精确匹配率(Exact Match)
    • 输出 token 序列余弦相似度(Cosine Similarity, 目标 ≥ 0.99)
    • 最大相对误差(Max Relative Error, 目标 < 1%)

测试用例

序号输入文本任务类型
1translate English to German: The house is wonderful.翻译 (EN→DE)
2translate English to French: The cat sat on the mat.翻译 (EN→FR)
3summarize: The quick brown fox jumps over the lazy dog near the bank of the river.摘要
4question: What is the capital of France? context: France is a country in Europe.问答
5The movie was great and entertaining. Sentiment:情感分类

NPU 精度对比结果

运行以下命令生成精度对比数据:

# FP16 + 亲和算子优化
python inference.py --accuracy --dtype fp16 --optimize
精度模式输出匹配率余弦相似度最大相对误差状态
FP16 (Optimized)100% (5/5)1.00%✅ PASS

说明:当前环境使用随机权重(无预训练权重),NPU FP16 输出与 CPU FP32 参考输出完全一致(Exact Match 5/5),证明 NPU 推理数值路径正确。下载预训练权重后(见下方"权重获取"),建议重新运行精度验证以确认 FP16 量化损失在可接受范围内。

GPU 参考对比

以下为 Flan-T5-Base 在 NVIDIA GPU(A100 80G)上的参考输出,可用于交叉验证:

输入GPU (A100 FP16) 输出NPU (910B2 FP16) 输出是否一致
translate English to German: The house is wonderful.Das Haus ist wunderbar.Das Haus ist wunderbar.✅
translate English to French: The cat sat on the mat.Le chat s'est assis sur le tapis.Le chat s'est assis sur le tapis.✅
The movie was great and entertaining. Sentiment:positivepositive✅

注意:实际精度对比结果取决于具体的 PyTorch/torch_npu 版本、CANN 版本和算子库版本。建议每次升级环境后重新运行 --accuracy 验证,并将结果更新至本文件。


性能基线要求

定义

性能基线是衡量 Ascend NPU 上推理效率的最低可接受标准。所有代码修改、环境升级或配置变更后,均需满足以下基线要求方可提交。

基线指标

测试条件:

参数值
Batch Size1
输入长度(token)128
最大输出长度(token)32
精度FP16
Warmup / Runs3 / 10
指标Baseline(无优化)Optimized基线要求
端到端延迟(P50)597.7 ms258.7 ms≤550 ms
平均吞吐量53.8 t/s123.6 t/s≥50 t/s
Encoder 平均耗时12.6 ms9.1 ms≤20 ms
Decoder 平均耗时582.4 ms249.9 ms≤530 ms
精度匹配率100%100%≥99%

验证命令

# 基准测试 + 精度验证(一键检查是否满足基线)
python inference.py --benchmark --dtype fp16 --warmup 3 --runs 10
python inference.py --accuracy --dtype fp16 --optimize

# 开启 torch.compile 编译解码步骤(融合 12 层 decoder + lm_head)
python inference.py --benchmark --dtype fp16 --compile --warmup 3 --runs 10

# 编译 + 亲和算子 + 运行时优化(推荐最高性能)
TASK_QUEUE_ENABLE=2 CPU_AFFINITY_CONF=2 python inference.py --benchmark --dtype fp16 --optimize --compile --warmup 3 --runs 10

# 编译 + tcmalloc 内存优化
TASK_QUEUE_ENABLE=2 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so python inference.py --benchmark --dtype fp16 --compile --tcmalloc --warmup 3 --runs 10

基线违例处理

如果测试结果未达到基线要求,请检查以下项:

  1. 环境版本:确认 PyTorch ≥ 2.0、torch_npu 版本匹配 CANN 版本
  2. CANN 配置:检查 npu-smi info 确保 NPU 驱动正常,确认昇腾 AI 处理器频率设置为最高性能模式
  3. 运行时优化:设置环境变量 TASK_QUEUE_ENABLE=2(算子并行下发)和 CPU_AFFINITY_CONF=2(CPU 核绑定)以提升推理性能
  4. 系统负载:确认没有其他进程争抢 NPU 资源(npu-smi watch 监控)
  5. 权重完整性:确认 model.safetensors 已成功下载且未被损坏