Cohere Transcribe

Cohere Transcribe 是一个开源发布的 20 亿参数专用音频输入、文本输出自动语音识别（ASR）模型。该模型支持 14 种语言。

开发方：Cohere 与 Cohere Labs。联系方式：Cohere Labs。

名称	cohere-transcribe-03-2026
架构	基于 conformer 的编码器-解码器
输入	音频波形 → 对数梅尔频谱图。预处理阶段会自动将音频重采样至 16kHz（如有必要）。同样，多声道（立体声）输入会被平均处理为单声道信号。
输出	转录文本
模型大小	20 亿参数
模型	大型 Conformer 编码器提取声学特征表示，随后由轻量级 Transformer 解码器进行 token 生成
训练目标	对输出 token 进行有监督交叉熵训练；从零开始训练
语言	针对 14 种语言进行训练：欧洲语言：英语、法语、德语、意大利语、西班牙语、葡萄牙语、希腊语、荷兰语、波兰语 AIPAC：中文（普通话）、日语、韩语、越南语 MENA：阿拉伯语
许可证	Apache 2.0

✨体验 Cohere Transcribe 演示✨

使用方法

Cohere Transcribe 在 transformers 中得到原生支持。这是用于离线推理的推荐模型使用方式。如需在线推理，请参见下文的 vLLM 集成示例。

pip install transformers>=5.4.0 torch huggingface_hub soundfile librosa sentencepiece protobuf
pip install datasets  # only needed for long-form and non-English examples

测试是在 torch==2.10.0 环境下进行的，但预计在其他版本中也能正常运行。

快速开始 🤗

只需几行代码即可转录任何音频文件：

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
from huggingface_hub import hf_hub_download

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")

audio_file = hf_hub_download(
    repo_id="CohereLabs/cohere-transcribe-03-2026",
    filename="demo/voxpopuli_test_en_demo.wav",
)
audio = load_audio(audio_file, sampling_rate=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

长音频转录

当音频时长超过特征提取器的 max_audio_clip_s 参数时，特征提取器会自动将波形分割成多个片段。处理器利用返回的 audio_chunk_index，将每个片段的转录结果重新组合。

以下示例转录了一段 55 分钟的 earnings call：

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from datasets import load_dataset
import time

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")

ds = load_dataset("distil-whisper/earnings22", "full", split="test", streaming=True)
sample = next(iter(ds))

audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
duration_s = len(audio_array) / sr
print(f"Audio duration: {duration_s / 60:.1f} minutes")

inputs = processor(audio=audio_array, sampling_rate=sr, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en")[0]
elapsed = time.time() - start
rtfx = duration_s / elapsed
print(f"Transcribed in {elapsed:.1f}s — RTFx: {rtfx:.1f}")
print(f"Transcription ({len(text.split())} words):")
print(text[:500] + "...")

标点控制

传递 punctuation=False 可获取不带标点符号的小写输出。

inputs_pnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=True)
inputs_nopnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=False)

默认情况下，标点符号功能已启用。

批量推理

单次调用可处理多个音频文件。当批次中同时包含短音频和长音频时，处理器会自动进行分块和重组。

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")

audio_short = load_audio(
    "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
    sampling_rate=16000,
)
audio_long = load_audio(
    "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
    sampling_rate=16000,
)

inputs = processor([audio_short, audio_long], sampling_rate=16000, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
    outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en"
)
print(text)

非英语语音转写

指定语言代码，即可转写 14 种支持语言中的任意一种。以下示例转写 FLEURS 数据集中的日语音频：

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto")

ds = load_dataset("google/fleurs", "ja_jp", split="test", streaming=True)
ds_iter = iter(ds)
samples = [next(ds_iter) for _ in range(3)]

for sample in samples:
    audio = sample["audio"]["array"]
    sr = sample["audio"]["sampling_rate"]

    inputs = processor(audio, sampling_rate=sr, return_tensors="pt", language="ja")
    inputs.to(model.device, dtype=model.dtype)

    outputs = model.generate(**inputs, max_new_tokens=256)
    text = processor.decode(outputs, skip_special_tokens=True)
    print(f"REF: {sample['transcription']}\nHYP: {text}\n")

vLLM 集成

对于生产环境部署，我们建议按照以下说明通过 vLLM 运行。

通过 vLLM 运行 cohere-transcribe-03-2026

首先安装 vLLM（请参考 vLLM 安装说明）：

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
uv pip install vllm[audio]
uv pip install librosa

启动 vLLM 服务器

vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

发送请求

curl -v -X POST http://localhost:8000/v1/audio/transcriptions \
 -H "Authorization: Bearer $VLLM_API_KEY" \
-F "file=@$(realpath ${AUDIO_PATH})" \
-F "model=CohereLabs/cohere-transcribe-03-2026"

结果

英语语音识别排行榜（截至2026年3月26日）

模型	平均字错误率（WER）	AMI	Earnings 22	Gigaspeech	LS clean	LS other	SPGISpeech	Tedlium	Voxpopuli
Cohere Transcribe	5.42	8.15	10.84	9.33	1.25	2.37	3.08	2.49	5.87
Zoom Scribe v1	5.47	10.03	9.53	9.61	1.63	2.81	1.59	3.22	5.37
IBM Granite 4.0 1B Speech	5.52	8.44	8.48	10.14	1.42	2.85	3.89	3.10	5.84
NVIDIA Canary Qwen 2.5B	5.63	10.19	10.45	9.43	1.61	3.10	1.90	2.71	5.66
Qwen3-ASR-1.7B	5.76	10.56	10.25	8.74	1.63	3.40	2.84	2.28	6.35
ElevenLabs Scribe v2	5.83	11.86	9.43	9.11	1.54	2.83	2.68	2.37	6.80
Kyutai STT 2.6B	6.40	12.17	10.99	9.81	1.70	4.32	2.03	3.35	6.79
OpenAI Whisper Large v3	7.44	15.95	11.29	10.02	2.01	3.91	2.94	3.86	9.54
Voxtral Mini 4B Realtime 2602	7.68	17.07	11.84	10.38	2.08	5.52	2.42	3.79	8.34

实时排行榜链接：Open ASR Leaderboard。

人类偏好结果

在人工评估中，我们观察到了同样出色的性能。训练有素的标注员会从准确性、连贯性和可用性等方面评估真实世界音频的转录质量。自动化指标与人类判断之间的一致性表明，该模型的改进不仅体现在受控基准测试中，还能转化到实际的转录场景中。

图：模型转录文本的人类偏好评估。在两两对比中，标注员需要对转录结果表达偏好，主要考量标准是是否保留了原始含义，同时也会关注是否避免了幻觉、正确识别了命名实体，以及是否提供了格式恰当的逐字转录文本。得分达到50%或以上表明，Cohere Transcribe在对比中平均更受偏好。

各语言字错误率（WER）

图：各语言错误率，为FLEURS、Common Voice 17.0、MLS和Wenet测试集（与特定语言相关的测试集）的平均值。中文、日文、韩文为字符错误率（CER），其他语言为字错误率（WER）

资源

如需了解更多详情和结果：

技术博客文章包含字错误率（WER）和其他质量指标。
公告博客文章提供有关该模型的更多信息。
英语、欧盟语言和长音频转录的字错误率（WER）/实时因子（RTFx）可在Open ASR Leaderboard上查看。

优势与局限性

Cohere Transcribe 是一款性能卓越的专用自动语音识别（ASR）模型，旨在实现高效的语音转录。

优势

Cohere Transcribe 在 14 种语言中展现出一流的转录准确性。作为专用的语音识别模型，它同样具有高效性，其实时因子（RTF）比同尺寸范围内的其他专用 ASR 模型快三倍。该模型从零开始训练，从一开始，我们就刻意专注于最大限度地提高转录准确性，同时将生产就绪性放在首位。

局限性

单一语言：当模型在其支持的 14 种语言范围内，处理单一预先指定语言的音频时，性能最佳。它不具备显式的自动语言检测功能，并且在代码切换（code-switched）音频上的表现不一致。
时间戳/说话人分离：该模型不具备这两项功能。
静音：与大多数基于音频编码器-解码器（AED）的语音模型一样，Cohere Transcribe 倾向于对音频进行转录，即使是非语音声音。因此，为防止低音量的底噪被错误转录（产生幻觉文本），建议在该模型前添加噪声门或语音活动检测（VAD）模型。

生态系统支持 🚀

Cohere Transcribe 支持以下库/平台：

transformers（参见上文快速入门）。
vLLM（参见上文vLLM 集成）。
mlx-audio，适用于 Apple Silicon。
Rust 实现：cohere_transcribe_rs
浏览器演示 ✨demo✨（通过 transformers.js 和 WebGPU）
Chrome 扩展：cohere_transcribe_extension
Whisper Memos（iOS 应用）。
Whisperian（Android 应用）。

如果您在上述未提及的地方添加了对该模型的支持，请提交 issue/PR！

如果您在使用这些库时发现任何问题，请向相应的库提交 issue。

模型卡片联系方式

如对本模型卡片中的内容有错误反馈或其他疑问，请联系 labs@cohere.com 或提交 issue。

使用条款：我们希望通过向全球研究人员开放这一性能卓越的 20 亿参数模型的权重，使基于社区的研究工作更加易于开展。本模型受 Apache 2.0 许可协议约束。