Whisper Large 中文（普通话）

该模型是 openai/whisper-large-v2 在中文（普通话）上的微调版本，使用了 Common Voice 11 的训练集和验证集进行微调。训练过程中并未使用全部验证集数据，我从验证集中提取了 1k 个样本，用于微调期间的评估。

用法


from transformers import pipeline

transcriber = pipeline(
  "automatic-speech-recognition", 
  model="jonatasgrosman/whisper-large-zh-cv11"
)

transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="zh", 
    task="transcribe"
  )
)

transcription = transcriber("path/to/my_audio.wav")

评估

我使用两个数据集的测试集对模型进行了评估，分别是Common Voice 11（与微调时使用的数据集相同）和Fleurs（微调过程中未见过的数据集）。由于Whisper能够转录大小写和标点符号，我在两种不同场景下对模型进行了评估：一种使用原始文本，另一种使用归一化文本（小写+去除标点符号）。此外，对于Fleurs数据集，我还评估了一种不包含数值转录的场景，因为该数据集中数值的描述方式与微调所用数据集（Common Voice）中的描述方式不同，因此预计这种数值描述方式的差异会影响模型在Fleurs中此类转录的性能。

Common Voice 11

	字符错误率（CER）	词错误率（WER）
jonatasgrosman/whisper-large-zh-cv11	9.31	55.94
jonatasgrosman/whisper-large-zh-cv11 + 文本归一化	9.55	55.02
openai/whisper-large-v2	33.33	101.80
openai/whisper-large-v2 + 文本归一化	29.90	95.91

Fleurs

	字符错误率（CER）	词错误率（WER）
jonatasgrosman/whisper-large-zh-cv11	15.00	93.45
jonatasgrosman/whisper-large-zh-cv11 + 文本归一化	11.76	70.63
jonatasgrosman/whisper-large-zh-cv11 + 仅保留非数值样本	10.95	87.91
jonatasgrosman/whisper-large-zh-cv11 + 文本归一化 + 仅保留非数值样本	7.83	62.12
openai/whisper-large-v2	23.49	101.28
openai/whisper-large-v2 + 文本归一化	17.58	83.22
openai/whisper-large-v2 + 仅保留非数值样本	21.03	101.95
openai/whisper-large-v2 + 文本归一化 + 仅保留非数值样本	15.22	79.28

用法


from transformers import pipeline

transcriber = pipeline(
  "automatic-speech-recognition", 
  model="jonatasgrosman/whisper-large-zh-cv11"
)

transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="zh", 
    task="transcribe"
  )
)

transcription = transcriber("path/to/my_audio.wav")

评估

Common Voice 11

字符错误率（CER）

词错误率（WER）

jonatasgrosman/whisper-large-zh-cv11

9.31

55.94

jonatasgrosman/whisper-large-zh-cv11 + 文本归一化

9.55

55.02

openai/whisper-large-v2

33.33