HuggingFace镜像/whisper-medium-french
模型介绍文件和版本分析
下载使用量0

Model architecture Model size Language

适用于法语语音识别的微调 whisper-medium 模型

本模型是 openai/whisper-medium 的微调版本,训练数据来自一个综合数据集,包含超过 2200 小时的法语语音音频。训练及验证数据均来源于 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech 和 African Accented French 的训练集和验证集。使用本模型时,请确保语音输入的采样率为 16Khz。本模型不预测大小写或标点符号。

性能表现

以下是预训练模型在 Common Voice 9.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的词错误率(WER)。这些结果均来自原始论文。

模型Common Voice 9.0MLSVoxPopuliFleurs
openai/whisper-small22.716.215.715.0
openai/whisper-medium16.08.912.28.7
openai/whisper-large14.78.911.07.7
openai/whisper-large-v213.97.311.48.3

以下是微调模型在 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的词错误率(WER)。请注意,这些评估数据集均经过筛选和预处理,仅保留法语字母字符,并去除了撇号以外的标点符号。表中结果格式为 WER(贪婪搜索)/ WER(束宽为 5 的束搜索)。

模型Common Voice 11.0MLSVoxPopuliFleurs
bofenghuang/whisper-small-cv11-french11.76 / 10.999.65 / 8.9114.45 / 13.6610.76 / 9.83
bofenghuang/whisper-medium-cv11-french9.03 / 8.546.34 / 5.8611.64 / 11.357.13 / 6.85
bofenghuang/whisper-medium-french9.03 / 8.734.60 / 4.449.53 / 9.466.33 / 5.94
bofenghuang/whisper-large-v2-cv11-french8.05 / 7.675.56 / 5.2811.50 / 10.695.42 / 5.05
bofenghuang/whisper-large-v2-french8.15 / 7.834.20 / 4.039.10 / 8.665.22 / 4.98

使用方法

使用🤗 Pipeline进行推理

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary

使用 🤗 低级 API 进行推理

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary