GPA(通用音频模型,General Purpose Audio) 是一款统一自回归Transformer模型,它基于Qwen3架构,将自动语音识别(ASR)、文本转语音(TTS)和语音转换(VC)这三大核心语音任务整合到一个仅含0.31B参数的模型中。本仓库提供的是适配华为昇腾910 NPU的模型版本。
| 属性 | 数值 |
|---|---|
| 参数规模 | 0.31B |
| 架构 | Qwen3ForCausalLM |
| 层数 | 28(全注意力) |
| 隐藏层维度 | 512 |
| 中间层维度 | 3072 |
| 注意力头数 | 16(8个KV头,GQA) |
| 词表大小 | 180,445 |
| 最大位置嵌入 | 32,768 |
| 推理精度 | float32 / bfloat16 |
| 支持语言 | en, zh |
| 组件 | 版本/型号 |
|---|---|
| NPU | 昇腾910(2卡) |
| CANN | 8.5.1 |
| torch | 2.9.0 |
| torch_npu | 2.9.0.post1 |
| transformers | 4.57.6 |
| Python | 3.11.14 |
torch_npu库的model.to(torch.device("npu:0"))接口,将模型权重和计算迁移至NPUAutoModelForCausalLM.from_pretrained()加载,并需设置trust_remote_code=Truemodeling_qwen3.py和configuration_qwen3.py文件在8个多样化提示词上,将NPU(float32)输出的logits与CPU(float32)基线进行比较。比较采用4个互补指标:
||NPU - CPU||_2 / ||CPU||_2 * 100%| 指标 | 数值 | 阈值 | 状态 |
|---|---|---|---|
| 相对 L2 误差 | 0.0068% | < 1.0% | 通过 |
| 余弦相似度 | 0.99998 | > 0.99 | 通过 |
| Top-1 准确率 | 100.0% (8/8) | > 99% | 通过 |
| Top-5 重叠度 | 5.0/5 | > 4/5 | 通过 |
| # | 提示(已截断) | rel_L2 (%) | cos_sim | top1 | top5 |
|---|---|---|---|---|---|
| 0 | Hello, how are you doing today? | 0.0052 | 1.000095 | OK | 5/5 |
| 1 | The capital of France is Paris. | 0.0009 | 0.999924 | OK | 5/5 |
| 2 | Once upon a time in a land far away, | 0.0059 | 0.999999 | OK | 5/5 |
| 3 | Artificial intelligence is the future of | 0.0334 | 0.999993 | OK | 5/5 |
| 4 | The speed of light in vacuum is approx | 0.0007 | 0.999927 | OK | 5/5 |
| 5 | Machine learning models can be used for | 0.0064 | 0.999933 | OK | 5/5 |
| 6 | In recent years, deep learning has | 0.0012 | 1.000045 | OK | 5/5 |
| 7 | The most important thing in science is | 0.0005 | 0.999941 | OK | 5/5 |
结论:通过——NPU 推理在数值上与 CPU 基准相当。所有 8 条提示均实现了完美的 top-1 令牌匹配和 top-5 重叠度,相对 L2 误差为 0.0068%,远低于 1% 的阈值。
| 指标 | 数值 |
|---|---|
| 吞吐量 | 26.1 tok/s |
| 平均延迟(128 令牌) | 4.904 s |
| 中位数延迟 | 4.874 s |
| P99 延迟 | 5.287 s |
| 最小延迟 | 4.638 s |
| 最大延迟 | 5.314 s |
| NPU 内存(已分配) | 618.1 MB |
| NPU 内存(已预留) | 680.0 MB |
| 模型加载时间 | 1.9 s |
pip install modelscope transformers torch torch_npumodelscope download --model AutoArk/GPA --local_dir ./GPA_modelimport torch
import torch_npu
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_DIR = "./GPA_model"
device = torch.device("npu:0")
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True)
inputs = tokenizer("Hello, who are you?", return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))# Text generation
python inference.py --task text --prompt "Hello!" --max_new_tokens 128
# Accuracy validation (NPU vs CPU)
python inference.py --task accuracy
# Performance benchmark
python inference.py --task benchmark --num_iters 10
# Full evaluation (accuracy + benchmark + demo)
python inference.py --task full --output eval_results.json.
├── inference.py # NPU inference script (text/accuracy/benchmark/memory/full)
├── README.md # This documentation
├── results_accuracy.json # Accuracy evaluation results (JSON)
├── results_benchmark.json # Performance benchmark results (JSON)
├── results_text.json # Text generation results (JSON)
└── GPA_model/ # Model files (downloaded from ModelScope)
├── config.json
├── model.safetensors
├── modeling_qwen3.py
├── configuration_qwen3.py
├── tokenizer.json
├── tokenizer_config.json
├── vocab.json
├── merges.txt
├── generation_config.json
├── chat_template.jinja
├── special_tokens_map.json
├── added_tokens.json
├── BiCodec/
│ ├── config.json
│ ├── model.safetensors
│ └── wav2vec2-large-xlsr-53/
└── glm-4-voice-tokenizer/
├── config.json
├── model.safetensors
└── preprocessor_config.json@misc{gpa2026,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformer},
author={Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu and Xiaodong Zeng},
year={2026},
howpublished={\url{https://github.com/AutoArk/GPA}},
}Apache 许可协议 2.0
于 2026 年 5 月 16 日适配昇腾 NPU