Granite-Speech-4.1-2B Ascend NPU 部署指南

项目简介

Granite-Speech-4.1-2B 是一个 2.1B 参数的 ASR (自动语音识别) 模型，本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU vs NPU 精度对比测试
Speech-to-Text 语音识别能力

环境要求

硬件: 华为 Ascend 910 系列 NPU
CANN: 8.0.RC1 或更高版本
PyTorch: 2.0+ with torch_npu
Docker: 容器名称 test-modelagent

目录结构

/data/ysws/agentsp/granite-speech-4.1-2b-ascend/
├── inference.py          # 精度测试脚本
├── log.txt               # 测试日志
├── README.md             # 本文档
├── test_audio_1s.pt      # 测试音频样本 (1秒)
└── test_audio_3s.pt      # 测试音频样本 (3秒)

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件应放在 /data/ysws/agentsp/granite-speech-4.1-2b/ 目录下：

config.json - 模型配置
tokenizer.json - 分词器
processor_config.json - 处理器配置
model.safetensors - 模型权重 (3个分片)

4. 执行精度测试

cd /data/ysws/agentsp/granite-speech-4.1-2b-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (mean)	1.19e-07	< 1e-5	PASS
Max Error (std)	1.86e-09	< 1e-5	PASS

注: sum 阈值采用自适应策略, 基于 tensor 规模 (max(1e-2, 1e-4 * |sum|))

性能数据

操作	耗时
模型加载	9.79s
CPU 参考计算 (20 tensors)	7.37s
NPU 推理 (20 tensors)	0.16s
语音推理 (3s音频)	1.30s

测试日志

完整测试日志保存在 log.txt

使用示例

运行推理

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model_path = "/data/ysws/agentsp/granite-speech-4.1-2b"
device = torch.device("npu:0")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoProcessor.from_pretrained(model_path)

speech = torch.randn(16000 * 3).numpy()
inputs = processor(text="transcribe", audio=speech, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=256)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
    print(transcription)

处理器调用说明

GraniteSpeechProcessor 的调用方式:

inputs = processor(
    text="transcribe",  # 文本提示 (必需)
    audio=speech,      # 音频数据
    return_tensors="pt"
)

注意: text 是位置参数, 不是关键字参数。音频采样率从处理器配置中读取 (16000 Hz), 无需单独传递。

模型结构

模型主要组件:

组件	说明
language_model	Transformer 解码器
encoder	语音编码器
embed_tokens	词嵌入层

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装, 确保 CANN 环境变量已 source。

Q: 如何处理不同长度音频?

A: 音频长度不影响推理, 模型使用动态卷积核处理变长输入。

Q: 支持哪些音频格式?

A: 支持 16kHz 采样的任意长度 WAV 格式音频。

许可证

本项目遵循 Granite-Speech-4.1-2B 原始许可证。