Ascend-SACT/speechscorer
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

前言

speechscorer 是一款简单的无监督口语评分工具,能够为给定的语音 utterance(音频)赋予一个分数,该分数可反映语音的可懂度或流利度/语法正确性。

 核心特性:支持多种预训练模型:基于不同训练目标的模型,包括 HuBERT、Whisper 和 WavLM。

 评分原理:基于模型在预测输入语音时的内部 “犹豫度”(熵值),不同模型和训练目标下熵值的计算方式不同。

本文基于 Whisper 模型。

一、运行环境准备

版本配套

配套版本环境准备指导
CANN8.2.rc1
Python3.10.12
torch2.5.1+cpu
torch_npu2.5.1

硬件设备

设备型号NPU配置
Atlas 800I A2 910B1卡

二、克隆 speechscorer 代码

$ git clone https://github.com/yaya-sy/speechscorer.git

图片描述

当前目录下会自动创建 speechscorer 目录,speechscorer 内有如下文件:

图片描述

三、下载 Whisper 模型

在 speechscorer 目录下创建 model 目录,通过 modelscope 下载 Whisper 模型

$ mkdir model

$ cd model

$ modelscope download --model openai-mirror/whisper-base --local_dir ./model

图片描述

model 目录里如下文件:

图片描述

四、准备测试集

在 speechscorer 目录下创建 dataset 目录并准备测试集:

$ cd speechscorer

$ mkdir dataset

$ cd dataset

准备 wav 语音文件,并放置在 dataset 目录下,例如可以用以下命令下载 VCC2018 测试集:

$ wget https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_submitted_systems_converted_speech.tar.gz

$ tar zxvf vcc2018_submitted_systems_converted_speech.tar.gz

五、编辑推理脚本

1、main.py 脚本

使用 vi 编辑 speechscorer/main.py 脚本

import os
os.environ["NUMEXPR_MAX_THREADS"] = "16"
os.environ["NUMEXPR_NUM_THREADS"] = "16"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

import logging as py_logging
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()
hf_logging.disable_default_handler()

import subprocess
import tempfile
import json
from typing import Dict, Tuple, List, Type, Optional, Union
import glob
import importlib
import shutil
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", message="The following generation flags are not valid")

from .base_scorer import BaseScorer
from .mlm.wavlm_mlm_scorer import WavLMScorer
from .clm.whisper_clm_scorer import WhisperConditionalLanguageModelScorer
from .clm.wavlm_clm_scorer import WavLMConditionalLanguageModelScorer
from .clm.hubert_clm_scorer import HuBERTConditionalLanguageModelScorer
from .data_loader import DataLoader

from argparse import ArgumentParser
from typing import Iterable
import logging
from pathlib import Path
from tqdm import tqdm
import pandas as pd

logging.basicConfig(
    level=logging.WARNING,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

SCORER_MAP: Dict[str, Tuple[str, str]] = {
    "hubert-mlm": ("HuBERT-MLM", "speechscorer.mlm.hubert_mlm_scorer.HubertMLMScorer"),
    "wavlm-mlm": ("WavLM-MLM", "speechscorer.mlm.wavlm_mlm_scorer.WavLMScorer"),
    "whisper-clm": ("Whisper-CLM", "speechscorer.clm.whisper_clm_scorer.WhisperConditionalLanguageModelScorer"),
    "wavlm-clm": ("WavLM-CLM", "speechscorer.clm.wavlm_clm_scorer.WavLMConditionalLanguageModelScorer"),
    "hubert-clm": ("HuBERT-CLM", "speechscorer.clm.hubert_clm_scorer.HuBERTConditionalLanguageModelScorer")
}

WHISPER_AUDIO_CONFIG = {
    "sample_rate": 16000,
    "channels": 1,
    "codec": "pcm_s16le",
    "format": "wav"
}

PREVIEW_CONFIG = {
    "max_preview_rows": 10
}

PERFORMANCE_CONFIG = {
    "disable_tqdm_bar": False
}

LOGGER = logging.getLogger(__name__)

def check_ffmpeg_installed() -> bool:
    if shutil.which("ffmpeg") is None or shutil.which("ffprobe") is None:
        LOGGER.error("❌ 未找到FFmpeg/ffprobe!请先安装FFmpeg(sudo apt install ffmpeg 或 官网下载)")
        return False
    return True

def is_audio_whisper_compatible(audio_path: Path) -> Tuple[bool, Dict[str, Union[int, str]]]:
    cmd = [
        "ffprobe",
        "-v", "quiet",
        "-print_format", "json",
        "-show_streams",
        "-select_streams", "a:0",
        str(audio_path)
    ]
    try:
        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
        meta = json.loads(result.stdout)
        if not meta.get("streams"):
            return False, {"error": "无音频流"}
        audio_info = meta["streams"][0]
        sample_rate = int(audio_info.get("sample_rate", 0))
        channels = int(audio_info.get("channels", 0))
        codec_name = audio_info.get("codec_name", "").lower()
        format_name = audio_info.get("format_name", "").lower()
        is_compatible = (
            sample_rate == WHISPER_AUDIO_CONFIG["sample_rate"] and
            channels == WHISPER_AUDIO_CONFIG["channels"] and
            codec_name in ["pcm_s16le", "pcm_s16be", "pcm_u16le", "pcm_u16be"] and
            format_name == WHISPER_AUDIO_CONFIG["format"]
        )
        return is_compatible, {
            "sample_rate": sample_rate,
            "channels": channels,
            "codec": codec_name,
            "format": format_name
        }
    except (subprocess.CalledProcessError, json.JSONDecodeError, KeyError) as e:
        LOGGER.warning(f"⚠️ 音频格式解析失败:{audio_path.name}(错误:{str(e)})")
        return False, {"error": str(e)}

def convert_audio(input_path: Path, output_path: Optional[Path] = None) -> Path:
    if output_path is None:
        temp_dir = tempfile.gettempdir()
        output_path = Path(temp_dir) / f"whisper_converted_{input_path.stem}.wav"
    cmd = [
        "ffmpeg",
        "-i", str(input_path),
        "-ar", str(WHISPER_AUDIO_CONFIG["sample_rate"]),
        "-ac", str(WHISPER_AUDIO_CONFIG["channels"]),
        "-acodec", WHISPER_AUDIO_CONFIG["codec"],
        "-f", WHISPER_AUDIO_CONFIG["format"],
        "-y",
        "-loglevel", "quiet",
        str(output_path)
    ]
    try:
        subprocess.run(cmd, check=True, capture_output=True, text=True)
        return output_path
    except subprocess.CalledProcessError as e:
        LOGGER.warning(f"⚠️ 音频转换失败:{input_path.name}")
        raise RuntimeError(f"音频转换失败:{input_path.name}") from e

def run_scorer(scorer: BaseScorer,
               dataloader: DataLoader,
               padding="max_length",
               max_length: int=None,
               batch_size: int=4
               ) -> Iterable[tuple]:
    total = dataloader.sample_size
    bar = tqdm(total=total, desc="批量推理进度", disable=PERFORMANCE_CONFIG["disable_tqdm_bar"])
    for x, utterance_ids in dataloader(padding=padding, max_length=max_length, batch_size=batch_size):
        x = x.to(scorer.device, non_blocking=True)
        results = scorer.scores(x=x)
        assert len(results) == len(utterance_ids), "Mismatch between utterances and their metrics."
        for utterance_id, result in zip(utterance_ids, results):
            result["utterance_id"] = utterance_id
            yield result
        bar.update(x.shape[0])
    bar.close()

def init_scorer(scorer_name: str,
                model_checkpoint,
                use_gpu: bool=False,
                use_npu: bool=False
                ) -> BaseScorer:
    scorer_info = SCORER_MAP.get(scorer_name)
    if not scorer_info:
        raise ValueError(f"Unsupported scorer: {scorer_name}")
    scorer_cn, scorer_path = scorer_info
    try:
        module_name, class_name = scorer_path.rsplit(".", 1)
        scorer_module = importlib.import_module(module_name)
        scorer_class = getattr(scorer_module, class_name)
    except ImportError as e:
        if scorer_name == "hubert-mlm":
            LOGGER.warning(f"⚠️ fairseq is not installed! HubertMLMScorer will be disabled. Install fairseq to use this scorer.")
        raise ImportError(f"Failed to import scorer {scorer_name}: {str(e)}") from e
    scorer = scorer_class(model_checkpoint, use_gpu, use_npu)
    LOGGER.warning(f"✅ 初始化评分器:{scorer_cn}(设备:{scorer.device})")
    return scorer

def get_args():
    parser = ArgumentParser()
    parser.add_argument("-a", "--audio",
                        type=str,
                        nargs="+",
                        help="支持3种输入方式:1. 单个文件:-a audio.wav 2. 多个文件:-a audio1.wav audio2.wav 3. 文件夹:-a ./audio_folder",
                        required=True)
    parser.add_argument("-m", "--model_checkpoint",
                        type=str,
                        default="openai/whisper-base.en",
                        required=False)
    parser.add_argument("-p", "--processor_checkpoint",
                        type=str,
                        required=False)
    parser.add_argument("-s", "--scorer",
                        type=str,
                        choices=SCORER_MAP.keys(),
                        default="whisper-clm",
                        required=False)
    parser.add_argument("-b", "--batch_size",
                        type=int,
                        default=16,
                        required=False)
    parser.add_argument("-d", "--padding",
                        type=str,
                        default="max_length",
                        required=False)
    parser.add_argument("-l", "--max_length",
                        type=int,
                        required=False,
                        default=None)
    parser.add_argument('--use-gpu',
                        action="store_true",
                        default=False)
    parser.add_argument('--use-npu',
                        action="store_true",
                        default=False)
    parser.add_argument("-o", "--output",
                        type=str,
                        default="batch_results.csv",
                        required=False)
    parser.add_argument("--keep-converted",
                        action="store_true",
                        default=False)
    parser.add_argument("--disable-tqdm",
                        action="store_true",
                        default=False)
    return parser.parse_args()

def print_smart_preview(df: pd.DataFrame):
    total_rows = len(df)
    print("\n========== 推理结果预览 ==========")
    if total_rows == 0:
        print("⚠️ 无推理结果")
    elif total_rows <= PREVIEW_CONFIG["max_preview_rows"]:
        print(f"📊 共{total_rows}条结果(全部显示):")
        print(df.to_string(index=False))
    else:
        print(f"📊 共{total_rows}条结果(显示前{PREVIEW_CONFIG['max_preview_rows']}条):")
        print(df.head(PREVIEW_CONFIG["max_preview_rows"]).to_string(index=False))
    print("==================================\n")

def main():
    args = get_args()
    output_path = Path(args.output)
    output_path.parent.mkdir(exist_ok=True, parents=True)
    PERFORMANCE_CONFIG["disable_tqdm_bar"] = args.disable_tqdm
    if not check_ffmpeg_installed():
        exit(1)
    scorer = init_scorer(args.scorer, args.model_checkpoint, args.use_gpu, args.use_npu)
    processor_checkpoint = args.model_checkpoint if args.processor_checkpoint is None else args.processor_checkpoint
    input_audios: List[Path] = []
    for audio_path_str in args.audio:
        audio_path = Path(audio_path_str)
        if audio_path.is_file():
            input_audios.append(audio_path)
        elif audio_path.is_dir():
            audio_extensions = ["*.wav", "*.mp3", "*.flac", "*.m4a", "*.ogg", "*.aac", "*.wma"]
            for ext in audio_extensions:
                folder_files = glob.glob(str(audio_path / ext), recursive=False)
                input_audios.extend([Path(f) for f in folder_files])
        else:
            LOGGER.warning(f"⚠️ 无效路径(忽略):{audio_path.absolute()}")
    input_audios = list(set(input_audios))
    if not input_audios:
        raise ValueError("未找到任何有效音频文件!")
    print(f"📥 待处理音频总数:{len(input_audios)}")
    
    final_audios: List[Path] = []
    temp_files: List[Path] = []
    compatible_count = 0
    converted_count = 0
    failed_count = 0
    print(f"🔍 开始音频格式校验+选择性转换(目标:{WHISPER_AUDIO_CONFIG['sample_rate']}Hz 单声道 PCM16)...")
    
    for audio_path in tqdm(input_audios, desc="音频处理进度", disable=args.disable_tqdm):
        try:
            is_compatible, _ = is_audio_whisper_compatible(audio_path)
            if is_compatible:
                final_audios.append(audio_path)
                compatible_count += 1
            else:
                if args.keep_converted:
                    output_dir = audio_path.parent / "converted_audios"
                    output_dir.mkdir(exist_ok=True)
                    output_audio_path = output_dir / f"{audio_path.stem}_converted.wav"
                    converted_path = convert_audio(audio_path, output_audio_path)
                else:
                    converted_path = convert_audio(audio_path)
                    temp_files.append(converted_path)
                final_audios.append(converted_path)
                converted_count += 1
        except RuntimeError:
            failed_count += 1
            continue
    
    if not final_audios:
        raise RuntimeError("没有可用的音频文件用于推理!")
    print(f"✅ 音频处理完成:符合格式{compatible_count}个 | 成功转换{converted_count}个 | 处理失败{failed_count}个")
    
    print(f"🚀 开始批量推理(设备:{scorer.device},批次大小:{args.batch_size})...")
    try:
        dataloader = DataLoader(final_audios, processor_checkpoint)
        results = run_scorer(scorer,
                             dataloader,
                             padding=args.padding,
                             max_length=args.max_length,
                             batch_size=args.batch_size)
        df = pd.DataFrame(results)
        
        utterance_map = {}
        for idx, audio_path in enumerate(final_audios):
            orig_audio = input_audios[idx]
            if audio_path.stem.startswith("whisper_converted_"):
                utterance_map[audio_path.stem] = orig_audio.name
            else:
                utterance_map[audio_path.stem] = orig_audio.name
        
        def map_utterance_id(uid):
            uid_stem = Path(uid).stem
            return utterance_map.get(uid_stem, uid)
        
        df["utterance_id"] = df["utterance_id"].apply(map_utterance_id)
        df.to_csv(output_path, index=None, encoding="utf-8", chunksize=10000)
        print(f"🎉 推理完成!结果保存至:{output_path.absolute()}")
        print_smart_preview(df)
    finally:
        if not args.keep_converted and temp_files:
            for temp_file in temp_files:
                if temp_file.exists():
                    temp_file.unlink()
            print(f"🗑️  临时文件已清理")

if __name__ == "__main__":
    main()

2、base_clm_scorer.py脚本

使用vi编辑speechscorer/clm/base_clm_scorer.py脚本

from ..base_scorer import BaseScorer
from logging import Logger
from typing import Union
from pathlib import Path
import torch
import logging

class BaseConditionalLanguageModelScorer(BaseScorer):
    """
    Base scorer for conditional language models.\
    The logits are the scores over the vocabulary\
    at each step of the predicted transcription.\
    So computing the entropy on those logits can be\
    seen as the hesitation of the model at transcribing\
    the input speech.
    """
    def __init__(self, model_checkpoint: Union[Path, str], use_gpu: bool = False, use_npu: bool = False):
        super().__init__(model_checkpoint, use_gpu)
        self.use_npu = use_npu
        self.logger = logging.getLogger(__name__)
        self.logger.info(f"=== NPU初始化排查 - 开关状态 ===")
        self.logger.info(f"use_npu参数是否开启: {self.use_npu}")
        self.logger.info(f"use_gpu参数是否开启: {use_gpu}")

    def load_model(self,
                   model_class,
                   logger: Logger) -> None:
        logger.info("Loading the model...")
        self.model = model_class.from_pretrained(self.model_checkpoint).eval()
        
        logger.info(f"\n=== NPU设备排查详情 ===")
        # 条件1:use_npu开关是否开启
        cond1 = self.use_npu
        logger.info(f"条件1 - use_npu开关开启: {cond1} (需为True)")
        # 条件2:PyTorch是否包含NPU扩展(hasattr(torch, 'npu'))
        cond2 = hasattr(torch, 'npu')
        logger.info(f"条件2 - PyTorch有NPU扩展: {cond2} (需为True,否则未装NPU版PyTorch)")
        # 条件3:NPU硬件/驱动是否可用(torch.npu.is_available())
        cond3 = torch.npu.is_available() if cond2 else False
        logger.info(f"条件3 - NPU硬件可用: {cond3} (需为True,否则驱动/硬件问题)")

        if cond1 and cond2 and cond3:
            self.device = torch.device('npu')
            logger.info(f"\n✅ NPU三个条件均满足,使用NPU设备!")
        elif self.use_gpu and torch.cuda.is_available():
            self.device = torch.device('cuda')
            logger.info(f"\n⚠️ NPU条件不满足,降级到GPU设备")
        else:
            self.device = torch.device('cpu')
            logger.info(f"\n❌ NPU/GPU均不可用,使用CPU设备")

        logger.info(f"Using device {self.device}")
        self.model.to(self.device)

3、whisper_clm_scorer.py脚本

使用vi编辑speechscorer/clm/whisper_clm_scorer.py脚本

from .base_clm_scorer import BaseConditionalLanguageModelScorer
from typing import Union, Optional, Dict
from pathlib import Path
import logging

from transformers import WhisperForConditionalGeneration  
from torch import Tensor
import torch

LOGGER = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG)

class WhisperConditionalLanguageModelScorer(BaseConditionalLanguageModelScorer):
    """
    This class implements a Whisper conditional language model\
    based scorer.
    """
    def __init__(self, model_checkpoint: Union[Path, str], use_gpu: bool = False, use_npu: bool = False):
        LOGGER.info(f"\n=== Whisper评分器初始化 ===")
        LOGGER.info(f"接收到的use_npu参数: {use_npu}")
        LOGGER.info(f"接收到的use_gpu参数: {use_gpu}")
        super().__init__(model_checkpoint, use_gpu, use_npu)
        self.load_model(model_class=WhisperForConditionalGeneration, logger=LOGGER)
        self.generate_kwargs = {
            "language": "en",
            "task": "transcribe",
            "return_dict_in_generate": True,
            "output_scores": True,
            "num_beams": 1,
            "do_sample": False,
            "max_length": 80,
            "pad_token_id": self.model.config.pad_token_id,
            "eos_token_id": self.model.config.eos_token_id,
            "forced_decoder_ids": None,
        }
    
    def forward_model(self,
                      x: Tensor,
                      **kwargs
                      ) -> Dict[str, Tensor]:
        with torch.no_grad():
            x = x.to(self.device)
            
            # 兼容维度:[batch_size, seq_len]、[batch_size, channel, seq_len]、[batch_size, 80, seq_len](Whisper标准输入)
            batch_size = x.shape[0]  # 第0维固定为batch_size
            seq_len = x.shape[-1]    # 最后一维固定为序列长度(seq_len)
            attention_mask = torch.ones((batch_size, seq_len), device=self.device)
            
            # 确保输入维度符合Whisper要求([batch_size, feature_dim, seq_len])
            # 若输入为[batch_size, seq_len],添加feature_dim维(Whisper默认需要80维Mel特征,此处自适应)
            if len(x.shape) == 2:
                x = x.unsqueeze(1)  # 变为[batch_size, 1, seq_len],模型会自动处理特征维度
            # 若输入为[batch_size, seq_len, feature_dim],转置为[batch_size, feature_dim, seq_len]
            elif len(x.shape) == 3 and x.shape[1] != 80:
                x = x.permute(0, 2, 1) 
           
            outputs = self.model.generate(
                input_features=x,            
                attention_mask=attention_mask, 
                **self.generate_kwargs        
            )
            
            logits = outputs.scores
            logits = torch.stack(logits, 1)
            logits[logits == float("-Inf")] = 1e-12

        return logits

4、输入输出参数说明

A、输入参数

参数缩写参数全称功能描述可选值/格式默认值常用示例
-a--audio必选,指定输入音频(单文件/多文件/文件夹)文件路径/文件夹路径--a ./dataset、-a audio1.wav audio2.mp3
-m--model_checkpoint指定模型权重路径(本地/远程)本地路径/ Hugging Face地址openai/whisper-base.en-m ./local_whisper、-m openai/whisper-small
-p--processor_checkpoint指定音频处理器路径(默认复用模型路径)本地路径/ Hugging Face地址与-m一致-p ./custom_processor
-s--scorer选择评分器类型hubert-mlm/wavlm-mlm/whisper-clm/wavlm-clm/hubert-clmwhisper-clm-s whisper-clm
-b--batch_size批量推理批次大小(影响速度和内存)正整数16-b 32(NPU推荐)、-b 64(高显存GPU)
-d--padding音频序列填充策略填充策略名称max_length-d max_length
-l--max_length音频序列最大长度(可选,统一输入长度)正整数None(自动适配)-l 3000
---use-gpu启用GPU加速(布尔值,无需传值)-FALSE--use-gpu
---use-npu启用昇腾NPU加速(布尔值,无需传值)-FALSE--use-npu
-o--output推理结果保存路径(CSV格式)文件路径batch_results.csv-o ./results/score.csv
---keep-converted保留格式转换后的音频文件-FALSE--keep-converted
---disable-tqdm禁用进度条(大规模处理时提升效率)-FALSE--disable-tqdm

B、输出参数

utterance_idperplexityentropy
音频文件名衡量模型对“已知数据”的“困惑程度”——困惑度越低,模型越能准确预测数据,对数据的拟合/理解越好;困惑度越高,模型对数据越陌生,预测越不确定。衡量“数据本身的不确定性”或“包含的信息量”——熵越高,数据的随机性越强、信息量越大;熵越低,数据越有规律、信息量越小。

六、执行推理脚本

1、单条语音文件推理

$ python -m speechscorer.main -a ./dataset/input.wav -s whisper-clm --use-npu -b 16 -o ./batch_results.csv -m ./model

图片描述

2、多条语音批量推理

$ python -m speechscorer.main -a ./dataset -s whisper-clm --use-npu -b 16 -o ./batch_results.csv -m ./model

图片描述

七、检查结果

图片描述