本模型是 openai/whisper-medium 的微调版本,训练数据来自一个综合数据集,包含超过 2200 小时的法语语音音频。训练及验证数据均来源于 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech 和 African Accented French 的训练集和验证集。使用本模型时,请确保语音输入的采样率为 16Khz。本模型不预测大小写或标点符号。
以下是预训练模型在 Common Voice 9.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的词错误率(WER)。这些结果均来自原始论文。
| 模型 | Common Voice 9.0 | MLS | VoxPopuli | Fleurs |
|---|---|---|---|---|
| openai/whisper-small | 22.7 | 16.2 | 15.7 | 15.0 |
| openai/whisper-medium | 16.0 | 8.9 | 12.2 | 8.7 |
| openai/whisper-large | 14.7 | 8.9 | 11.0 | 7.7 |
| openai/whisper-large-v2 | 13.9 | 7.3 | 11.4 | 8.3 |
以下是微调模型在 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的词错误率(WER)。请注意,这些评估数据集均经过筛选和预处理,仅保留法语字母字符,并去除了撇号以外的标点符号。表中结果格式为 WER(贪婪搜索)/ WER(束宽为 5 的束搜索)。
| 模型 | Common Voice 11.0 | MLS | VoxPopuli | Fleurs |
|---|---|---|---|---|
| bofenghuang/whisper-small-cv11-french | 11.76 / 10.99 | 9.65 / 8.91 | 14.45 / 13.66 | 10.76 / 9.83 |
| bofenghuang/whisper-medium-cv11-french | 9.03 / 8.54 | 6.34 / 5.86 | 11.64 / 11.35 | 7.13 / 6.85 |
| bofenghuang/whisper-medium-french | 9.03 / 8.73 | 4.60 / 4.44 | 9.53 / 9.46 | 6.33 / 5.94 |
| bofenghuang/whisper-large-v2-cv11-french | 8.05 / 7.67 | 5.56 / 5.28 | 11.50 / 10.69 | 5.42 / 5.05 |
| bofenghuang/whisper-large-v2-french | 8.15 / 7.83 | 4.20 / 4.03 | 9.10 / 8.66 | 5.22 / 4.98 |
使用🤗 Pipeline进行推理
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-french", device=device)
# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")
# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search
# Normalise predicted sentences if necessary使用 🤗 低级 API 进行推理
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-french", language="french", task="transcribe")
# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")
# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate
# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
# Resample
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search
# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Normalise predicted sentences if necessary