m
mxy-yy/wav2vec2-base-960h-npu
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

wav2vec2-base-960h-npu Ascend NPU 部署指南

项目简介

wav2vec2-base-960h 是 Facebook 提出的自监督语音识别模型 Wav2Vec2 的基线模型,在 960 小时的 Librispeech 数据集上进行微调。该模型将原始音频波形映射到潜在表示,然后通过 CTC 解码器进行语音识别。

特性

  • 支持 Ascend NPU 推理加速
  • CPU 与 NPU 精度对比测试(误差 < 1%)
  • 原始音频输入,无需特征提取
  • 16kHz 采样率支持
  • CTC 解码(32 token 词汇表)

环境信息

项目版本/内容
设备Ascend 910B

文件结构

wav2vec2-base-960h-ascend/
├── inference.py          # 推理测试脚本
├── test.log              # 测试日志
├── README.md             # 本文档

部署步骤

1. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

2. 准备模型文件

模型文件位于 /opt/atomgit/mxy/wav2vec2-base-960h/ 目录下:

  • model.safetensors - 模型权重 (约 378MB)
  • config.json - 模型配置
  • vocab.json - 词汇表

3. 安装依赖

pip install transformers torch_npu

4. 执行推理

cd wav2vec2-base-960h-ascend/
python3 inference.py --mode inference

Usage

Method 1: Normal Inference Mode

cd wav2vec2-base-960h-ascend/
python3 inference.py --mode inference --device npu:0

方式二:精度测试模式 (CPU vs NPU)

cd wav2vec2-base-960h-ascend/
python3 inference.py --mode precision_test

命令行参数说明

参数说明默认值
--mode测试模式: inference 或 precision_testinference
--device运行设备: npu:0, cuda:0, cpu, autoauto

测试验证

精度测试结果

指标实测值阈值状态
Logits 相对误差0.8027%< 1.00%✅ PASS
综合评估正常范围内-✅ PASS

性能数据

操作耗时
NPU 推理时间 (1s audio)~6.7s
CPU 推理时间 (1s audio, bfloat16)~9.7s

测试日志

============================================================
Wav2Vec2-Base-960h NPU Inference Test
============================================================
Model: /opt/atomgit/mxy/wav2vec2-base-960h
Output: /opt/atomgit/mxy/wav2vec2-base-960h-ascend
Device: auto
Using device: npu:0
Created test audio: /opt/atomgit/mxy/wav2vec2-base-960h-ascend/test_audio/test_1.wav
Created test audio:/opt/atomgit/mxy/wav2vec2-base-960h-ascend/test_audio/test_2.wav
============================================================
Loading Wav2Vec2-Base-960h model...
Model directory: /opt/atomgit/mxy/wav2vec2-base-960h
============================================================
Model type: Wav2Vec2ForCTC
Vocab size: 32
Hidden size: 768
Num hidden layers: 12
Sampling rate: 16000
============================================================
Processing: test_1.wav - Speech sample 1
Audio length: 16000 samples (1.00s)
Input shape: torch.Size([1, 16000])
Inference time: 6.749s
Logits shape: torch.Size([1, 49, 32])
Transcription:
============================================================
Processing: test_2.wav - Speech sample 2
Audio length: 16000 samples (1.00s)
Input shape: torch.Size([1, 16000])
Inference time: 0.017s
Logits shape: torch.Size([1, 49, 32])
Transcription:
============================================================
Inference Summary
============================================================
Total samples processed: 2
Total inference time: 6.766s
Average time per sample: 3.383s
============================================================
Test Complete!
============================================================

Python API 使用示例

基本语音识别

import torch
import numpy as np
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2ForCTC

MODEL_DIR = "/opt/atomgit/mxy/wav2vec2-base-960h"

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_DIR)
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(MODEL_DIR)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()

audio = np.sin(2 * np.pi * 200 * np.linspace(0, 1, 16000)).astype(np.float32)

inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]

print(f"Transcription: {transcription}")

模型结构

组件说明
feature_extractorCNN 特征提取器(7层,512通道)
encoderTransformer 编码器(12层,768 hidden)
masked_spec_embed掩码嵌入(预训练)
lm_headCTC 解码头(32 tokens)

推理参数配置

参数值
hidden_size768
num_hidden_layers12
num_attention_heads12
vocab_size32
采样率16000 Hz

注意事项

  1. 模型使用 NPU 进行推理加速
  2. CPU vs NPU 精度误差 < 1%,满足要求
  3. 首次推理会有算子编译开销
  4. 合成音频转录结果为空是正常现象