GPA-0.3B-preview（昇腾NPU适配版）

模型概述

GPA（通用音频模型，General Purpose Audio） 是一款统一自回归Transformer模型，它基于Qwen3架构，将自动语音识别（ASR）、文本转语音（TTS）和语音转换（VC）这三大核心语音任务整合到一个仅含0.31B参数的模型中。本仓库提供的是适配华为昇腾910 NPU的模型版本。

原始模型：AutoArk/GPA
架构：Qwen3ForCausalLM（28层，512隐藏维度，16个注意力头，GQA）
参数规模：0.31B
支持任务：ASR（语音转文本）、TTS（文本转语音）、VC（语音转换）

模型规格

属性	数值
参数规模	0.31B
架构	Qwen3ForCausalLM
层数	28（全注意力）
隐藏层维度	512
中间层维度	3072
注意力头数	16（8个KV头，GQA）
词表大小	180,445
最大位置嵌入	32,768
推理精度	float32 / bfloat16
支持语言	en, zh

NPU适配

硬件环境

组件	版本/型号
NPU	昇腾910（2卡）
CANN	8.5.1
torch	2.9.0
torch_npu	2.9.0.post1
transformers	4.57.6
Python	3.11.14

适配细节

设备迁移：通过torch_npu库的model.to(torch.device("npu:0"))接口，将模型权重和计算迁移至NPU
精度设置：float32用于精度验证；bfloat16用于推理吞吐量优化
HuggingFace兼容性：模型可通过AutoModelForCausalLM.from_pretrained()加载，并需设置trust_remote_code=True
自定义建模：GPA使用原始 checkpoint 中的自定义modeling_qwen3.py和configuration_qwen3.py文件

精度评估

评估方法

在8个多样化提示词上，将NPU（float32）输出的logits与CPU（float32）基线进行比较。比较采用4个互补指标：

相对L2误差：||NPU - CPU||_2 / ||CPU||_2 * 100%
余弦相似度：输出向量的方向一致性
Top-1准确率：NPU和CPU是否选择相同的下一个token
Top-5重叠率：概率最高的5个token的重叠程度

结果摘要

指标	数值	阈值	状态
相对 L2 误差	0.0068%	< 1.0%	通过
余弦相似度	0.99998	> 0.99	通过
Top-1 准确率	100.0% (8/8)	> 99%	通过
Top-5 重叠度	5.0/5	> 4/5	通过

每条提示详情

#	提示（已截断）	rel_L2 (%)	cos_sim	top1	top5
0	Hello, how are you doing today?	0.0052	1.000095	OK	5/5
1	The capital of France is Paris.	0.0009	0.999924	OK	5/5
2	Once upon a time in a land far away,	0.0059	0.999999	OK	5/5
3	Artificial intelligence is the future of	0.0334	0.999993	OK	5/5
4	The speed of light in vacuum is approx	0.0007	0.999927	OK	5/5
5	Machine learning models can be used for	0.0064	0.999933	OK	5/5
6	In recent years, deep learning has	0.0012	1.000045	OK	5/5
7	The most important thing in science is	0.0005	0.999941	OK	5/5

结论：通过——NPU 推理在数值上与 CPU 基准相当。所有 8 条提示均实现了完美的 top-1 令牌匹配和 top-5 重叠度，相对 L2 误差为 0.0068%，远低于 1% 的阈值。

性能基准测试

测试配置

设备：Ascend 910 NPU（单卡）
精度：bfloat16
最大新令牌数：每轮生成 128 个
迭代次数：10 次（2 次热身之后）
解码方式：贪心算法（do_sample=False，temperature=1.0）

结果

指标	数值
吞吐量	26.1 tok/s
平均延迟（128 令牌）	4.904 s
中位数延迟	4.874 s
P99 延迟	5.287 s
最小延迟	4.638 s
最大延迟	5.314 s
NPU 内存（已分配）	618.1 MB
NPU 内存（已预留）	680.0 MB
模型加载时间	1.9 s

分析

模型加载时间不到 2 秒，且占用 NPU 内存不到 700 MB，运行流畅
延迟稳定，10 次运行中最小与最大延迟仅相差 0.68 秒
P99 延迟（5.29 秒）仅比平均值（4.90 秒）高出 8%，表明性能稳定
吞吐量为 26.1 tok/s，意味着在单 NPU 上每分钟可生成约 1500 个令牌

快速开始

1. 安装依赖项

pip install modelscope transformers torch torch_npu

2. 下载模型

modelscope download --model AutoArk/GPA --local_dir ./GPA_model

3. 运行推理

import torch
import torch_npu
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_DIR = "./GPA_model"
device = torch.device("npu:0")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(device)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True)

inputs = tokenizer("Hello, who are you?", return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

命令行推理

# Text generation
python inference.py --task text --prompt "Hello!" --max_new_tokens 128

# Accuracy validation (NPU vs CPU)
python inference.py --task accuracy

# Performance benchmark
python inference.py --task benchmark --num_iters 10

# Full evaluation (accuracy + benchmark + demo)
python inference.py --task full --output eval_results.json

文件结构

.
├── inference.py              # NPU inference script (text/accuracy/benchmark/memory/full)
├── README.md                 # This documentation
├── results_accuracy.json     # Accuracy evaluation results (JSON)
├── results_benchmark.json    # Performance benchmark results (JSON)
├── results_text.json         # Text generation results (JSON)
└── GPA_model/               # Model files (downloaded from ModelScope)
    ├── config.json
    ├── model.safetensors
    ├── modeling_qwen3.py
    ├── configuration_qwen3.py
    ├── tokenizer.json
    ├── tokenizer_config.json
    ├── vocab.json
    ├── merges.txt
    ├── generation_config.json
    ├── chat_template.jinja
    ├── special_tokens_map.json
    ├── added_tokens.json
    ├── BiCodec/
    │   ├── config.json
    │   ├── model.safetensors
    │   └── wav2vec2-large-xlsr-53/
    └── glm-4-voice-tokenizer/
        ├── config.json
        ├── model.safetensors
        └── preprocessor_config.json

局限性

仅文本生成：当前适配工作聚焦于LLM主干网络（Qwen3ForCausalLM）。包含BiCodec和语音令牌化器的完整音频处理流程（ASR/TTS/VC）需使用原始GPA推理脚本
单NPU：基准测试在单张Ascend 910卡上运行；多NPU推理尚未经过测试
bfloat16推理：生产环境推理采用bfloat16以实现最佳吞吐量；float32用于精度验证

引用

@misc{gpa2026,
  title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformer},
  author={Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu and Xiaodong Zeng},
  year={2026},
  howpublished={\url{https://github.com/AutoArk/GPA}},
}

许可协议

Apache 许可协议 2.0

于 2026 年 5 月 16 日适配昇腾 NPU