Whisper

Whisper 是一个用于自动语音识别（ASR）和语音翻译的预训练模型。该模型在 68 万小时的标注数据上进行训练，无需微调便能在众多数据集和领域中展现出强大的泛化能力。

Whisper 由 OpenAI 的 Alec Radford 等人在论文《Robust Speech Recognition via Large-Scale Weak Supervision》（https://arxiv.org/abs/2212.04356）中提出。原始代码仓库可通过以下链接访问：https://github.com/openai/whisper。

与 Whisper large 模型相比，large-v2 模型的训练轮次增加了 2.5 倍，并加入了正则化处理，从而提升了性能。

免责声明：本模型卡片的部分内容由 Hugging Face 团队撰写，部分内容复制粘贴自原始模型卡片。

模型详情

Whisper 是一个基于 Transformer 的编码器 - 解码器模型，也被称为“序列到序列”模型。它在 68 万小时的标注语音数据上进行训练，这些数据通过大规模弱监督方式进行标注。

模型的训练数据分为仅英语数据和多语言数据。仅英语模型专门针对语音识别任务进行训练。多语言模型则同时训练语音识别和语音翻译任务。在语音识别任务中，模型会预测与音频相同语言的转录文本。在语音翻译任务中，模型会预测与音频不同语言的转录文本。

Whisper checkpoint 有五种不同模型大小的配置。其中最小的四种模型既可以在仅英语数据上训练，也可以在多语言数据上训练。最大的 checkpoint 仅支持多语言。以下表格总结了各 checkpoint 的信息，并提供了指向 Hub 上模型的链接：

使用方法

与 openMind 配合使用

环境变量

# source environment variable
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export OPENMIND_FRAMEWORK=pt

pip install openMind Library

OpenMind Library 可通过 pip 进行安装，请根据实际环境选择相应命令进行安装。

需要注意的是，由于 torch npu 依赖于 torch，在 aarch64 环境下可直接通过 pip 安装，但在 x86 环境下需要特定 URL 下载 CPU 版本，因此两种环境下的安装命令有所不同。具体安装代码已在下文进行区分呈现。

# aarch64
pip install openmind[all]
# x86
pip install openmind[all] --extra-index-url https://download.pytorch.org/whl/cpu

推理

from openmind import AutoTokenizer, AutoModelForCausalLM
import torch
import torch_npu

model_dir = "HangZhou_Ascend/whisper-large-v2"
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto",  trust_remote_code=True, torch_dtype=torch.float16)
model = model.eval()
response, history = model.chat(tokenizer, "1+1=", history=[], meta_instruction="")
print(response)

WhisperProcessor 用于：

对音频输入进行预处理（将其转换为模型所需的对数梅尔频谱图）
对模型输出进行后处理（将其从 tokens 转换为文本）

通过传入相应的“上下文 tokens”，可以告知模型要执行的任务（转录或翻译）。这些上下文 tokens 是在解码过程开始时提供给解码器的一系列 tokens，其顺序如下：

转录始终以 <|startoftranscript|> token 开头
第二个 token 是语言 token（例如，英语为 <|en|>）
第三个 token 是“任务 token”，它可以取以下两个值之一：用于语音识别的 <|transcribe|> 或用于语音翻译的 <|translate|>
此外，如果模型不应包含时间戳预测，则会添加 <|notimestamps|> token

因此，典型的上下文 tokens 序列可能如下所示：

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

这会指示模型以英语进行解码，执行语音识别任务，并且不预测时间戳。

这些标记可以是强制的，也可以是非强制的。如果是强制的，模型将在每个位置预测每个标记。这允许用户控制Whisper模型的输出语言和任务。如果是非强制的，Whisper模型将自动预测输出语言和任务。

可以相应地设置上下文标记：

model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")

这会强制模型在语音识别任务中以英语进行预测。

转录

英语到英语

在此示例中，上下文标记为“未强制”，即模型会自动预测输出语言（英语）和任务（转录）。

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
model.config.forced_decoder_ids = None

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# [' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

通过设置 skip_special_tokens=True，可以从转录结果的开头移除上下文标记。

法语到法语

以下示例通过适当设置解码器 ID，展示了法语到法语的转录过程。

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

# load streaming dataset and read first audio sample
ds = load_dataset("common_voice", "fr", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids)
# ['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# [' Un vrai travail intéressant va enfin être mené sur ce sujet.']

翻译

将任务设置为“translate”会强制 Whisper 模型执行语音翻译。

法语到英语

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

# load streaming dataset and read first audio sample
ds = load_dataset("common_voice", "fr", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# [' A very interesting work, we will finally be given on this subject.']

长音频转录

Whisper 模型本质上设计用于处理时长不超过 30 秒的音频样本。然而，通过使用分块算法，它可以用于转录任意长度的音频样本。这可以通过 Transformers 方法实现。在实例化管道时，将 chunk_length_s=30 即可启用分块功能。启用分块后，管道可以进行批量推理。通过传递 return_timestamps=True，还可以扩展以预测序列级别的时间戳：

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

prediction = pipe(sample.copy(), batch_size=8)["text"]
# " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
# [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
#  'timestamp': (0.0, 5.44)}]

微调

预训练的 Whisper 模型展现出了在不同数据集和领域上的强大泛化能力。然而，通过“微调”，其预测能力在特定语言和任务上可以得到进一步提升。该博客文章提供了一个逐步指南，介绍如何使用仅5小时的标记数据来微调 Whisper 模型。

预期用途

这些模型的主要目标用户是研究当前模型的鲁棒性、泛化能力、功能、偏差和局限性的 AI 研究人员。不过，Whisper 作为一种自动语音识别（ASR）解决方案，对开发人员也可能非常有用，尤其是在英语语音识别方面。我们认识到，一旦模型发布，就不可能将访问权限仅限制在“预期”用途上，也无法就哪些是研究或哪些不是研究制定合理的指导方针。

这些模型主要针对自动语音识别（ASR）和语音翻译成英语的任务进行训练和评估。它们在约10种语言上显示出强大的 ASR 结果。它们可能还具备其他功能，特别是如果在某些任务（如语音活动检测、说话人分类或说话人区分）上进行微调的话，但在这些领域尚未经过严格评估。我们强烈建议用户在特定背景和领域中对模型进行严格评估后再部署使用。

特别是，我们提醒用户不要使用 Whisper 模型来转录未经个人同意录制的音频，也不要声称将这些模型用于任何类型的主观分类。我们不建议在高风险领域（如决策环境）中使用，因为准确性缺陷可能导致结果出现显著偏差。这些模型旨在转录和翻译语音，将模型用于分类不仅未经评估，而且是不合适的，特别是用于推断人类属性。

训练数据

这些模型是在从互联网上收集的680,000小时音频及其相应的转录文本上进行训练的。其中65%的数据（即438,000小时）是英语音频和匹配的英语转录文本，大约18%（即126,000小时）是非英语音频和英语转录文本，而最后的17%（即117,000小时）是非英语音频和相应的非英语转录文本。这些非英语数据涵盖了98种不同的语言。

正如相关论文中所讨论的，我们发现特定语言的转录性能与我们在该语言上使用的训练数据量直接相关。

性能与局限性

研究表明，与众多现有的语音识别系统相比，本模型在应对口音、背景噪音、专业术语方面的稳健性有所提升，同时还具备将多种语言零样本翻译成英语的能力；其语音识别和翻译的准确率已接近当前最先进水平。

然而，由于模型是在大规模含噪数据上通过弱监督方式训练的，因此其预测结果中可能包含音频输入中实际并未说出的文本（即幻觉现象）。我们推测，出现这种情况的原因是，凭借其对语言的整体认知，模型在试图预测音频中的下一个词的同时，也在尝试对音频本身进行转录。

我们的模型在不同语言上的表现参差不齐。我们观察到，在低资源和/或低普及率语言，或者我们训练数据较少的语言上，模型的准确率较低。对于特定语言的不同口音和方言，模型的表现也存在差异，这可能包括在不同性别、种族、年龄或其他人口统计学特征的说话者之间，词错误率会更高。完整的评估结果已在本版本附带的论文中呈现。

此外，模型的序列到序列架构使其容易生成重复文本。虽然通过束搜索和温度调度可以在一定程度上缓解这一问题，但无法完全消除。关于这些局限性的进一步分析，请参见论文。这种重复行为和幻觉现象在低资源和/或低普及率语言上可能会更为严重。

更广泛的影响

我们预计，Whisper模型的转录能力可用于改进辅助功能工具。尽管Whisper模型无法直接用于实时转录，但其速度和大小表明，其他人或许能够在其基础上构建应用程序，以实现近实时的语音识别和翻译。基于Whisper模型构建的有益应用的真正价值，意味着这些模型的性能差异可能会带来实际的经济影响。

发布Whisper也带来了潜在的双重用途担忧。虽然我们希望这项技术主要用于有益目的，但提高语音识别技术的可及性可能会使更多主体能够构建功能强大的监控技术或扩大现有监控工作的规模，因为其速度和准确性使得对大量音频通信进行经济实惠的自动转录和翻译成为可能。此外，这些模型可能具备一些开箱即用的特定个体识别能力，这反过来又带来了与双重用途和性能差异相关的安全隐患。实际上，我们预计转录成本并非扩大监控项目规模的限制因素。

BibTeX 条目和引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}