Granite-Speech-4.1-2B-Plus Ascend NPU 部署指南

项目简介

Granite-Speech-4.1-2B-Plus 是一个增强版 ASR (自动语音识别) 模型，包含更深的编码器和投影器。本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU vs NPU 精度对比测试 (权重统计)
Speech-to-Text 语音识别能力
增强的音频编码器配置

环境要求

硬件: 华为 Ascend 910 系列 NPU
CANN: 8.0.RC1 或更高版本
PyTorch: 2.0+ with torch_npu
Docker: 容器名称 test-modelagent

目录结构

/data/ysws/agentsp/granite-speech-4.1-2b-plus-ascend/
├── inference.py          # 精度测试脚本
├── log.txt               # 测试日志
├── README.md             # 本文档
├── test_audio_1s.pt      # 测试音频样本 (1秒)
└── test_audio_3s.pt      # 测试音频样本 (3秒)

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件应放在 /data/ysws/agentsp/granite-speech-4.1-2b-plus/ 目录下：

config.json - 模型配置
tokenizer.json - 分词器
processor_config.json - 处理器配置
model-*.safetensors - 模型权重 (3个分片)

4. 执行推理+精度测试

cd /data/ysws/agentsp/granite-speech-4.1-2b-plus-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (mean)	1.19e-07	< 1e-5	PASS
Max Error (std)	1.86e-09	< 1e-5	PASS

注: sum 阈值采用自适应策略, 基于 tensor 规模 (max(1e-2, 1e-4 * |sum|))

性能数据

操作	耗时
模型加载	55.95s
CPU 参考计算 (20 tensors)	4.57s
NPU 推理 (20 tensors)	0.15s
语音推理 (3s音频)	19.71s

测试日志

完整测试日志保存在 log.txt

使用示例

运行推理

import torch
import numpy as np
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model_path = "/data/ysws/agentsp/granite-speech-4.1-2b-plus"
device = torch.device("npu:0")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoProcessor.from_pretrained(model_path)

speech = np.random.randn(16000 * 3).astype(np.float32)
inputs = processor(text="transcribe", audio=speech, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=256)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
    print(transcription)

处理器调用说明

GraniteSpeechProcessor 的调用方式:

inputs = processor(
    text="transcribe",  # 文本提示 (必需)
    audio=speech,       # 音频数据
    return_tensors="pt"
)

注意: text 是位置参数, 不是关键字参数。音频采样率从处理器配置中读取 (16000 Hz), 无需单独传递。

模型结构

模型主要组件:

组件	说明
language_model	Transformer 解码器 (GraniteForCausalLM)
encoder	语音编码器 (GraniteSpeechPlusEncoder, 16层)
projector	Q-Former 投影器

已知问题与解决方案

Issue: model_type granite_speech_plus not recognized

问题: transformers 4.56.0 不支持 granite_speech_plus 模型类型

解决方案: 脚本会自动临时修改 config.json, 将 granite_speech_plus 改为 granite_speech 以加载模型。测试完成后恢复原配置。

Issue: mistral_common 导入错误

问题: ReasoningEffort 类在新版 mistral_common 中不存在

解决方案: 脚本会自动 patch mistral_common.protocol.instruct.request, 添加空的 ReasoningEffort 类。

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装, 确保 CANN 环境变量已 source。

Q: 模型加载时间较长?

A: 模型包含 3 个 safetensors 分片, 总计约 4GB。首次加载需要约 1 分钟。

Q: 如何处理不同长度音频?

A: 音频长度不影响推理, 模型使用动态卷积核处理变长输入。

许可证

本项目遵循 Granite-Speech-4.1-2B-Plus 原始许可证。