MOSS-Transcribe-preview-2B

MOSS-Transcribe-preview-2B 是一款英语语音转文本模型，它将 Qwen3-1.7B-base 语言模型主干与 Qwen3-Omni-MoE 音频编码器相结合。一个门控 MLP 适配器将音频特征投影到语言模型的嵌入空间中。该模型在公开的英语 ASR 语料库上进行训练，并在 Open ASR Leaderboard 的训练分割集上通过强化学习进行微调。

该模型拥有约 24 亿参数，以单个 bfloat16 safetensors 分片形式发布，大小约为 4.84 GB。

模型详情

开发团队： OpenMOSS Team
模型类型： 自动语音识别（Automatic Speech Recognition）/ 语音转文本模型
语言： 英语
许可证： Apache-2.0
库： Transformers
主干网络： Qwen3-1.7B-base，28 层，隐藏层大小 2048
音频编码器： Qwen3-Omni-MoE 音频编码器
适配器： 门控 MLP 适配器，隐藏层大小 8192
参数规模： 约 24 亿
检查点格式： bfloat16 safetensors

预期用途

本模型旨在用于英语自动语音识别，包括为研究和评估目的对英语语音音频进行转录。

评估

在Open ASR Leaderboard测试集上进行评估。预测采用贪婪解码（num_beams=1，max_new_tokens=512）、单一的数据集无关聊天模板，并使用排行榜的标准化评分（英语标准化器 + 带复合词合并的词级编辑距离）。TED-LIUM目前未纳入排行榜运行，因此未包含在内。

数据集	WER（%）
AMI	8.37
Earnings22	7.84
GigaSpeech	6.78
LibriSpeech test.clean	1.21
LibriSpeech test.other	2.84
SPGISpeech	1.63
VoxPopuli	5.39
平均值	4.87

推理

import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module

REPO = "OpenMOSS-Team/MOSS-Transcribe-preview-2B"
DEVICE = "cuda:0"

model = AutoModelForCausalLM.from_pretrained(
    REPO, dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)

MossProcessor = get_class_from_dynamic_module("processing_Moss.MossProcessor", REPO)
MelConfig = get_class_from_dynamic_module("processing_Moss.MelConfig", REPO)

mel_cfg = MelConfig(
    mel_sr=16000,
    mel_dim=128,
    mel_n_fft=400,
    mel_hop_length=160,
)
processor = MossProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))

waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)

with torch.no_grad():
    out_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        num_beams=1,
        use_cache=True,
        eos_token_id=[processor.end_token_id],
    )

new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)

音频前端

采样率：16 kHz
特征：Whisper 对数梅尔滤波器组
梅尔 bins：128
FFT 大小：400
跳变长度：160

训练

该模型在公开的英语 ASR 语料库上进行训练，并使用 Open ASR Leaderboard 的训练分割数据通过强化学习进行微调。

局限性

本模型专为英语语音识别设计。在处理非英语语音、浓重口音、嘈杂录音、重叠说话人、远场音频、特定领域术语或与训练和评估数据差异显著的音频条件时，其性能可能会下降。在高风险场景中使用前，输出内容应经过人工审核。

引用

@misc{moss_transcribe_2025,
  title        = {{MOSS-Transcribe-preview-2B}},
  author       = {{OpenMOSS Team}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/OpenMOSS-Team/MOSS-Transcribe-preview-2B}}
}

许可证

本模型基于 Apache-2.0 许可证发布。