Whisper

Whisper 是一款顶尖的自动语音识别（ASR）与语音翻译模型，由 OpenAI 的 Alec Radford 等人在论文《Robust Speech Recognition via Large-Scale Weak Supervision》（https://huggingface.co/papers/2212.04356）中提出。该模型在超过 500 万小时的标记数据上进行训练，展现出在零样本设置下对多种数据集和领域的强大泛化能力。

Whisper large-v3 与之前的 large 和 large-v2 模型架构相同，但存在以下细微差异：

频谱图输入采用 128 个梅尔频率 bins，而非之前的 80 个
新增了粤语语言令牌

Whisper large-v3 模型的训练数据包括 100 万小时的弱标记音频和 400 万小时使用 Whisper large-v2 收集的伪标记音频。该模型在这一混合数据集上训练了 2.0 个 epoch。

large-v3 模型在多种语言上均表现出性能提升，与 Whisper large-v2 相比，错误率降低了 10% 至 20%。有关可用的不同检查点的更多详细信息，请参阅模型详情部分。

免责声明：本模型卡片的内容部分由 🤗 Hugging Face 团队撰写，部分从原始模型卡片复制粘贴而来。

使用方法

Whisper large-v3 已在 Hugging Face 🤗 Transformers 中得到支持。要运行该模型，请首先安装 Transformers 库。在本示例中，我们还将安装 🤗 Datasets 以从 Hugging Face Hub 加载示例音频数据集，并安装 🤗 Accelerate 以缩短模型加载时间：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

该模型可与 pipeline 类配合使用，以转录任意长度的音频：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用 pipeline 时传入音频文件的路径即可：

result = pipe("audio.mp3")

通过将多个音频文件指定为列表并设置 batch_size 参数，可以并行转录这些文件：

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers 兼容所有 Whisper 解码策略，例如温度回退和基于先前 tokens 的条件解码。以下示例展示了如何启用这些启发式方法：

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Whisper 会自动预测源音频的语言。如果源音频的语言是事先已知的，则可以将其作为参数传递给管道：

result = pipe(sample, generate_kwargs={"language": "english"})

默认情况下，Whisper 执行的是“语音转录”任务，即源音频语言与目标文本语言相同。若要执行“语音翻译”任务（目标文本为英语），请将任务设置为 "translate"：

result = pipe(sample, generate_kwargs={"task": "translate"})

最后，可让模型预测时间戳。若需句子级时间戳，请传入 return_timestamps 参数：

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

而对于词级时间戳：

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

上述参数可单独使用，也可组合使用。例如，要执行语音转录任务，且源音频为法语，同时希望返回句子级时间戳，可使用以下方式：

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

若需更精细地控制生成参数，请直接使用 model + processor API：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

额外的速度与内存优化

您可以对 Whisper 应用额外的速度和内存优化，以进一步降低推理速度和显存需求。

分块长音频处理

Whisper 的感受野为 30 秒。要转录长于此时长的音频，需使用以下两种长音频算法之一：

顺序式（Sequential）：采用“滑动窗口”进行缓冲推理，逐个转录 30 秒的音频片段。
分块式（Chunked）：将长音频文件分割为较短的片段（片段间有少量重叠），独立转录每个片段，然后在边界处拼接转录结果。

在以下任一情况下，应使用顺序式长音频算法：

转录准确性是最重要的因素，而速度则次要考虑。
您正在转录批量的长音频文件，此时顺序式的延迟与分块式相当，但准确率（WER）高出约 0.5%。

相反，在以下情况下应使用分块式算法：

转录速度是最重要的因素。
您正在转录单个长音频文件。

默认情况下，Transformers 使用顺序式算法。要启用分块式算法，请将 chunk_length_s 参数传递给 pipeline。对于 large-v3 模型，30 秒的分块长度是最优的。要对长音频文件启用批处理，请传递参数 batch_size：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch 编译

Whisper 的前向传播与 torch.compile 兼容，可实现 4.5 倍的速度提升。

注意： torch.compile 目前与分块长文本算法或 Flash Attention 2 不兼容 ⚠️

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())

print(result["text"])

Flash Attention 2

如果您的 GPU 支持 Flash Attention 2 且未使用 torch.compile，我们建议使用 Flash-Attention 2。操作方法如下：首先安装 Flash Attention：

pip install flash-attn --no-build-isolation

然后将 attn_implementation="flash_attention_2" 传递给 from_pretrained：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Torch 缩放点积注意力（SDPA）

如果您的 GPU 不支持 Flash Attention，我们建议使用 PyTorch 的缩放点积注意力（SDPA）。对于 PyTorch 2.1.1 或更高版本，此注意力实现是默认启用的。要检查您是否拥有兼容的 PyTorch 版本，请运行以下 Python 代码片段：

from transformers.utils import is_torch_sdpa_available

print(is_torch_sdpa_available())

如果上述代码返回 True，则说明你已安装有效的 PyTorch 版本，且 SDPA 已默认激活。如果返回 False，你需要按照官方说明升级 PyTorch 版本。

安装有效的 PyTorch 版本后，SDPA 会默认激活。你也可以通过如下方式显式设置：指定 attn_implementation="sdpa"。

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")

有关如何使用 SDPA 的更多信息，请参阅 Transformers SDPA 文档。

模型详情

Whisper 是一个基于 Transformer 的编码器 - 解码器模型，也被称为“序列到序列”模型。Whisper 模型有两种类型：仅支持英语和多语言。仅支持英语的模型是针对英语语音识别任务进行训练的。多语言模型则同时进行了多语言语音识别和语音翻译的训练。在语音识别任务中，模型会预测与音频语言相同的转录文本。而在语音翻译任务中，模型会预测与音频语言不同的转录文本。

Whisper 检查点有五种不同模型大小的配置。其中最小的四种既有仅支持英语的版本，也有多语言版本。最大的检查点则仅支持多语言。所有十个预训练检查点都可在 Hugging Face Hub 上获取。以下表格总结了这些检查点，并提供了它们在 Hub 上的模型链接：

大小	参数数量	仅支持英语版本	多语言版本
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓
large-v3	1550 M	x	✓

微调

预训练的 Whisper 模型在不同数据集和领域上展现出强大的泛化能力。然而，通过微调，其在特定语言和任务上的预测性能可以进一步提升。博客文章《使用 🤗 Transformers 微调 Whisper》提供了详细的分步指南，即使用仅 5 小时的标注数据来微调 Whisper 模型。

预期用途

这些模型的主要目标用户是研究当前模型的鲁棒性、泛化能力、性能、偏差和局限性的 AI 研究人员。不过，Whisper 作为一种自动语音识别（ASR）解决方案，对开发人员也可能非常有用，尤其是在英语语音识别方面。我们认识到，一旦模型发布，就不可能将访问权限仅限制在“预期”用途，也无法就哪些是研究用途、哪些不是研究用途制定合理的指导方针。

这些模型主要针对 ASR 任务和语音到英语的翻译任务进行训练和评估。它们在约 10 种语言上显示出优异的 ASR 结果。它们可能还具备其他能力，特别是在经过某些任务（如语音活动检测、说话人分类或说话人区分）的微调后，但在这些领域尚未经过充分的鲁棒性评估。我们强烈建议用户在特定的环境和领域中对模型进行充分的评估后再部署使用。

特别需要注意的是，我们不建议使用 Whisper 模型转录未经个人同意录制的音频，也不建议将这些模型用于任何形式的主观分类。我们不建议在高风险领域（如决策场景）中使用，因为准确性的缺陷可能导致结果出现严重偏差。这些模型旨在转录和翻译语音，将其用于分类任务不仅未经评估，而且是不合适的，尤其是用于推断人类属性。

训练数据

large-v3 checkpoint 是在 100 万小时的弱标注音频和 400 万小时的伪标注音频上训练而成的，其中伪标注音频是使用 Whisper large-v2 收集的。

正如相关论文中所讨论的，我们发现特定语言的转录性能与该语言的训练数据量直接相关。

性能与局限性

研究表明，与众多现有ASR系统相比，该模型在口音、背景噪音、专业术语的鲁棒性方面均有提升，同时支持从多种语言到英语的零样本翻译；其语音识别和翻译准确率已接近当前最先进水平。

然而，由于模型是在大规模噪声数据上通过弱监督方式训练的，其预测结果可能包含音频输入中实际未出现的文本（即幻觉现象）。我们推测，这是因为模型在具备通用语言知识的前提下，会同时尝试预测音频中的下一个词和对音频本身进行转录。

我们的模型在不同语言上的表现存在差异，在低资源和/或低普及率语言，或我们训练数据较少的语言上，准确率相对较低。模型在特定语言的不同口音和方言上的表现也不一致，这可能导致不同性别、种族、年龄或其他人口统计特征的说话者的词错误率更高。完整的评估结果详见本版本附带的论文。

此外，模型的序列到序列架构使其容易生成重复文本。虽然可以通过波束搜索和温度调度在一定程度上缓解这一问题，但无法完全消除。论文中对这些局限性进行了进一步分析。我们认为，在低资源和/或低普及率语言上，这种重复现象和幻觉问题可能会更为严重。

更广泛的影响

我们预计Whisper模型的转录能力可用于改进辅助功能工具。尽管Whisper模型无法直接用于实时转录，但其速度和体量表明，开发者有可能基于它们构建应用程序，以实现近实时的语音识别和翻译。基于Whisper模型构建的有益应用所带来的实际价值，意味着这些模型表现上的差异可能会产生切实的经济影响。

发布Whisper也带来了潜在的双重使用担忧。虽然我们希望该技术主要用于有益目的，但提高ASR技术的可访问性可能会使更多主体能够构建高性能的监控技术或扩大现有监控工作的规模，因为其速度和准确性使得对大量音频通信进行经济高效的自动转录和翻译成为可能。此外，这些模型可能具备一些开箱即用的特定个体识别能力，这反过来又带来了与双重使用和性能差异相关的安全隐患。实际上，我们认为转录成本并非扩大监控项目规模的限制因素。

BibTeX 条目和引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}