Granite-Speech-4.1-2B-Plus Ascend NPU 部署指南

项目简介

Granite-Speech-4.1-2B-Plus 是一个增强版 ASR (自动语音识别) 模型，包含更深的编码器和投影器。本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU vs NPU 精度对比测试 (权重统计)
Speech-to-Text 语音识别能力
增强的音频编码器配置

环境要求

硬件: 华为 Ascend 910 系列 NPU
CANN: 8.0.RC1 或更高版本
PyTorch: 2.0+ with torch_npu

文件结构

granite-speech-4.1-2b-plus-ascend/
├── inference.py          # 推理脚本
├── test.log              # 测试日志
├── README.md             # 本文档

部署步骤

1. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

2. 准备模型文件

模型文件位于 /opt/atomgit/mxy/granite-speech-4.1-2b-plus/ 目录下：

config.json - 模型配置
tokenizer.json - 分词器
processor_config.json - 处理器配置
model-*.safetensors - 模型权重 (3个分片)

3. 执行精度测试

cd granite-speech-4.1-2b-plus-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (mean)	1.19e-07	< 1e-5	PASS
Max Error (std)	1.86e-09	< 1e-5	PASS

注: sum 阈值采用自适应策略, 基于 tensor 规模 (max(1e-2, 1e-4 * |sum|))

性能数据

操作	耗时
模型加载	55.95s
CPU 参考计算 (20 tensors)	4.57s
NPU 推理 (20 tensors)	0.15s
语音推理 (3s音频)	19.71s

测试日志

2026-05-19 09:17:53,103 - INFO - Granite-Speech-4.1-2B-Plus ASR Ascend NPU Inference
2026-05-19 09:17:53,103 - INFO - Model path: /opt/atomgit/mxy/granite-speech-4.1-2b-plus
2026-05-19 09:18:50,371 - INFO - Model loaded on device: npu:0
2026-05-19 09:19:09,007 - INFO - Inference time: 18.601s
2026-05-19 09:19:13,486 - INFO -   CPU computation time: 4.3122s
2026-05-19 09:19:13,486 - INFO -   NPU inference time: 0.1537s
2026-05-19 09:19:13,487 - INFO - PRECISION TEST PASSED

使用示例

运行推理

import torch
import numpy as np
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model_path = "/opt/atomgit/mxy/granite-speech-4.1-2b-plus"
device = torch.device("npu:0")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoProcessor.from_pretrained(model_path)

speech = np.random.randn(16000 * 3).astype(np.float32)
inputs = processor(text="transcribe", audio=speech, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=256)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
    print(transcription)

处理器调用说明

GraniteSpeechProcessor 的调用方式:

inputs = processor(
    text="transcribe",  # 文本提示 (必需)
    audio=speech,       # 音频数据
    return_tensors="pt"
)

注意: text 是位置参数, 不是关键字参数。音频采样率从处理器配置中读取 (16000 Hz), 无需单独传递。

模型结构

模型主要组件:

组件	说明
language_model	Transformer 解码器 (GraniteForCausalLM)
encoder	语音编码器 (GraniteSpeechPlusEncoder, 16层)
projector	Q-Former 投影器

已知问题与解决方案

Issue: model_type granite_speech_plus not recognized

问题: transformers 4.56.0 不支持 granite_speech_plus 模型类型

解决方案: 脚本会自动临时修改 config.json, 将 granite_speech_plus 改为 granite_speech 以加载模型。测试完成后恢复原配置。

Issue: mistral_common 导入错误

问题: ReasoningEffort 类在新版 mistral_common 中不存在

解决方案: 脚本会自动 patch mistral_common.protocol.instruct.request, 添加空的 ReasoningEffort 类。

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装, 确保 CANN 环境变量已 source。

Q: 模型加载时间较长?

A: 模型包含 3 个 safetensors 分片, 总计约 4GB。首次加载需要约 1 分钟。

Q: 如何处理不同长度音频?

A: 音频长度不影响推理, 模型使用动态卷积核处理变长输入。

许可证

本项目遵循 Granite-Speech-4.1-2B-Plus 原始许可证。