CrisperWhisper

简介

CrisperWhisper 是 OpenAI whisper-large-v3 的一个精调版本，旨在实现高质量的逐字语音识别。它支持德语（de）和英语（en）的转录，并具有更高的时间戳准确性和词级对齐能力。

CrisperWhisper 模型通过原生 WhisperForConditionalGeneration 架构在 vLLM-Ascend 上得到支持。本文档提供了在昇腾 NPU 上的部署和验证说明。

支持特性

特性	状态
ACLGraph (PIECEWISE)	支持
转录	支持
翻译	支持
语言检测	支持
片段时间戳	支持
张量并行	支持
多 NPU（TP）	支持

环境准备

模型权重

nyrahealth/CrisperWhisper（FP16/BF16）：下载模型权重

您可以使用 ModelScope 或 HuggingFace 镜像下载模型：

export HF_ENDPOINT=https://hf-mirror.com

安装

使用官方 vllm-ascend Docker 镜像或通过 pip 安装：

pip install vllm-ascend

确保已安装 vllm[audio] 依赖以进行音频特征提取：

VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"

部署

单NPU

export HF_ENDPOINT=https://hf-mirror.com
export MODEL_PATH="nyrahealth/CrisperWhisper"

vllm serve "${MODEL_PATH}" \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name crisper-whisper \
  --dtype bfloat16 \
  --max-model-len 448 \
  --max-num-seqs 5 \
  --block-size 128 \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}'

:::{note} Ascend 上的编码器-解码器模型目前需要 PIECEWISE ACLGraph 模式（由 vllm-ascend 平台自动强制执行）。Whisper 不支持 FULL_DECODE_ONLY。 :::

多 NPU（张量并行）

export HF_ENDPOINT=https://hf-mirror.com
export MODEL_PATH="nyrahealth/CrisperWhisper"
export TP_SIZE=2

vllm serve "${MODEL_PATH}" \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name crisper-whisper \
  --dtype bfloat16 \
  --max-model-len 448 \
  --max-num-seqs 5 \
  --block-size 128 \
  --trust-remote-code \
  --tensor-parallel-size ${TP_SIZE} \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}'

功能验证

就绪状态检查

curl -sf http://127.0.0.1:8000/v1/models

转录（英文）

curl http://localhost:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F file="@sample_en.wav" \
  -F model="crisper-whisper" \
  -F language="en" \
  -F response_format="verbose_json"

转录（德语）

curl http://localhost:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F file="@sample_de.wav" \
  -F model="crisper-whisper" \
  -F language="de" \
  -F response_format="verbose_json"

翻译（至英文）

curl http://localhost:8000/v1/audio/translations \
  -H "Content-Type: multipart/form-data" \
  -F file="@sample_de.wav" \
  -F model="crisper-whisper" \
  -F response_format="verbose_json"

聊天补全（含音频）

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "crisper-whisper",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Transcribe this audio."},
          {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.wav"}}
        ]
      }
    ],
    "max_completion_tokens": 200,
    "temperature": 0.0
  }'

回退命令

若推理失败，请按以下顺序尝试：

禁用 ACLGraph（即时模式）：
```
--enforce-eager
```
减小批处理大小：
```
--max-num-seqs 1
```
降低内存利用率：
```
--gpu-memory-utilization 0.7
```

准确性与性能说明

CrisperWhisper 基于 whisper-large-v3（15.5 亿参数）构建。
在昇腾 NPU 上，编码器采用 Conv1d + Transformer 架构，解码器采用因果自注意力 + 交叉注意力机制。
所有注意力类型（编码器自注意力、解码器自注意力、编码器 - 解码器交叉注意力）均受 vllm - ascend 的 attention_v1.py 后端支持。
如需逐字转录基准测试，请参考 CrisperWhisper 论文。