Granite-Speech-4.1-2B Ascend NPU 部署指南

项目简介

Granite-Speech-4.1-2B 是一个 2.1B 参数的 ASR（自动语音识别）模型，本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU 与 NPU 精度对比测试
语音转文本（Speech-to-Text）识别能力

环境信息

项目	版本/内容
设备	Ascend 910B

文件结构

granite-speech-4.1-2b-ascend/
├── inference.py          # 推理脚本
├── test.log               # 测试日志
├── README.md             # 本文档

部署步骤

1. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

2. 准备模型文件

模型文件位于 /opt/atomgit/mxy/granite-speech-4.1-2b/ 目录下：

config.json - 模型配置
tokenizer.json - 分词器
processor_config.json - 处理器配置
model.safetensors - 模型权重

3. 执行精度测试

cd granite-speech-4.1-2b-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (mean)	1.19e-07	< 1e-5	✅ 通过
Max Error (std)	1.86e-09	< 1e-5	✅ 通过

性能数据

操作	耗时
模型加载	9.79s
CPU 参考计算 (20 tensors)	7.37s
NPU 推理 (20 tensors)	0.16s
语音推理 (3s音频)	1.30s

测试日志

2026-05-19 08:47:05,469 - INFO - ============================================================
2026-05-19 08:47:05,469 - INFO - Granite-Speech-4.1-2B ASR Ascend NPU Inference
2026-05-19 08:47:05,469 - INFO - ============================================================
2026-05-19 08:47:05,469 - INFO - Model path: /opt/atomgit/mxy/granite-speech-4.1-2b
2026-05-19 08:47:05,469 - INFO - Device: npu:0
2026-05-19 08:47:09,764 - INFO - Loading model from: /opt/atomgit/mxy/granite-speech-4.1-2b
2026-05-19 08:47:10,441 - INFO - Processor loaded: GraniteSpeechProcessor
2026-05-19 08:47:14,914 - INFO - Model loaded on device: npu:0
2026-05-19 08:47:14,914 - INFO - Model type: GraniteSpeechForConditionalGeneration
2026-05-19 08:47:14,914 - INFO - Running inference...
2026-05-19 08:47:16,111 - INFO - Inference completed in 1.160s
2026-05-19 08:47:23,712 - INFO - PRECISION TEST PASSED
2026-05-19 08:47:23,712 - INFO - ============================================================

使用示例

运行推理

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model_path = "/opt/atomgit/mxy/granite-speech-4.1-2b"
device = torch.device("npu:0")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoProcessor.from_pretrained(model_path)

speech = torch.randn(16000 * 3).numpy()
inputs = processor(text="transcribe", audio=speech, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=256)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
    print(transcription)

模型结构

组件	说明
language_model	Transformer 解码器
encoder	语音编码器
embed_tokens	词嵌入层

注意事项

模型使用 NPU 进行推理加速
精度测试基于 state_dict tensor 的 CPU vs NPU 比较
支持 16kHz 采样的任意长度 WAV 格式音频

Granite-Speech-4.1-2B Ascend NPU 部署指南

项目简介

Granite-Speech-4.1-2B 是一个 2.1B 参数的 ASR（自动语音识别）模型，本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU 与 NPU 精度对比测试
语音转文本（Speech-to-Text）识别能力

环境信息

项目	版本/内容
设备	Ascend 910B

文件结构

granite-speech-4.1-2b-ascend/
├── inference.py          # 推理脚本
├── test.log               # 测试日志
├── README.md             # 本文档

部署步骤

1. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

2. 准备模型文件

模型文件位于 /opt/atomgit/mxy/granite-speech-4.1-2b/ 目录下：

config.json - 模型配置
tokenizer.json - 分词器
processor_config.json - 处理器配置
model.safetensors - 模型权重

3. 执行精度测试

cd granite-speech-4.1-2b-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (mean)	1.19e-07	< 1e-5	✅ 通过
Max Error (std)	1.86e-09	< 1e-5	✅ 通过

性能数据

操作	耗时
模型加载	9.79s
CPU 参考计算 (20 tensors)	7.37s
NPU 推理 (20 tensors)	0.16s
语音推理 (3s音频)	1.30s

测试日志

2026-05-19 08:47:05,469 - INFO - ============================================================
2026-05-19 08:47:05,469 - INFO - Granite-Speech-4.1-2B ASR Ascend NPU Inference
2026-05-19 08:47:05,469 - INFO - ============================================================
2026-05-19 08:47:05,469 - INFO - Model path: /opt/atomgit/mxy/granite-speech-4.1-2b
2026-05-19 08:47:05,469 - INFO - Device: npu:0
2026-05-19 08:47:09,764 - INFO - Loading model from: /opt/atomgit/mxy/granite-speech-4.1-2b
2026-05-19 08:47:10,441 - INFO - Processor loaded: GraniteSpeechProcessor
2026-05-19 08:47:14,914 - INFO - Model loaded on device: npu:0
2026-05-19 08:47:14,914 - INFO - Model type: GraniteSpeechForConditionalGeneration
2026-05-19 08:47:14,914 - INFO - Running inference...
2026-05-19 08:47:16,111 - INFO - Inference completed in 1.160s
2026-05-19 08:47:23,712 - INFO - PRECISION TEST PASSED
2026-05-19 08:47:23,712 - INFO - ============================================================

使用示例

运行推理

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model_path = "/opt/atomgit/mxy/granite-speech-4.1-2b"
device = torch.device("npu:0")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoProcessor.from_pretrained(model_path)

speech = torch.randn(16000 * 3).numpy()
inputs = processor(text="transcribe", audio=speech, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=256)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
    print(transcription)

模型结构

组件	说明
language_model	Transformer 解码器
encoder	语音编码器
embed_tokens	词嵌入层

注意事项

模型使用 NPU 进行推理加速
精度测试基于 state_dict tensor 的 CPU vs NPU 比较
支持 16kHz 采样的任意长度 WAV 格式音频