Granite-Speech-4.1-2B

模型摘要： Granite Speech 4.1 2B 是一款紧凑高效的语音语言模型，专为多语言自动语音识别（ASR）和双向自动语音翻译（AST）设计，支持英语、法语、德语、西班牙语、葡萄牙语和日语。

该模型在 174,000 小时的音频数据上进行了训练，这些数据来源于用于 ASR 和 AST 的公共语料库，以及为支持日语 ASR、关键词偏向 ASR 和语音翻译而定制的合成数据集。 Granite Speech 4.1 2B 的训练过程是将 granite-4.0-1b-base 的中间检查点通过模态对齐到语音，所使用的公开可用开源语料库包含音频输入和文本目标。与前代模型 granite-4.0-1b-speech 相比，本模型具有相同的参数数量（新的命名约定反映了实际的而非基础 LLM 大小），并提供了更多功能和改进：

由于采用了新颖的双头部 CTC 编码器（同时具有字形和 BPE 输出）以及帧重要性采样以聚焦音频的信息部分，多语言 ASR 的转录准确性更高
通过简单的提示词更改，可在所有语言的 ASR 和 AST 中实现标点符号和大小写转换（包括德语名词大写）
更强的关键词列表偏向能力，可增强对名称、首字母缩写词和技术术语的识别

另外两个模型变体探索了不同的功能和推理优化：

granite-speech-4.1-2b-plus 增加了说话人归因 ASR 和词级时间戳
granite-speech-4.1-2b-nar 引入了新颖的非自回归架构，以实现更高的吞吐量

评估：

我们在标准基准上将 granite-speech-4.1-2b 与其他参数规模小于 80 亿的语音语言模型以及专用的 ASR 和 AST 系统进行了对比评估。评估涵盖了多个公共基准，特别侧重于英语 ASR 任务，同时也包括了多语言 ASR 和 X-En 与 En-X 翻译的 AST。
granite-speech-4.1-2b-wer1-crop
granite-speech-4.1-2b-wer2-crop
granite-speech-4.1-2b-bleu1-crop
granite-speech-4.1-2b-bleu2-crop
在 Open ASR 排行榜上的表现（截至 2026 年 4 月）： rtfx_wer

我们通过比较推理时应用和不应用关键词列表偏向（KWB）的性能，评估了模型的关键词列表偏向能力。我们报告了 ASR 任务中已转录关键词的 F1 分数，评估时排除了常见词。 kwb-f1.v2

我们还在各种语料库上评估了模型的标点符号和大小写转换能力。我们按照 LibriSpeech-PC 中定义的指标进行报告。PER（标点符号错误率）衡量标点符号（句号、逗号和问号）的插入、删除或替换错误。Cap-F1（大小写 F1）衡量模型在输出中对相关单词进行大小写转换的准确性。请注意，我们的 Cap-F1 是在 Levenshtein 对齐的匹配词对上计算的，而非完全匹配的句子，即使存在 ASR 错误也能进行评估。

测试集	PER（↓）	Cap-F1（↑）
LScln	25.70	89.71
LSoth	22.27	91.26
VoxPopuli	24.86	95.35
Earnings-22	22.87	95.19
CV-EN	9.13	96.75
CV-DE	3.66	99.50†
CV-ES	11.61	95.68
CV-FR	11.00	97.25
CV-PT	7.86	98.51

† 我们报告德语的 Cap-F1 为 99.5，其中名词需要大写。

发布日期：2026 年 4 月 29 日

许可证： Apache 2.0

支持语言： 英语、法语、德语、西班牙语、葡萄牙语、日语

预期用途： 该模型旨在用于涉及语音输入处理的企业应用。特别是，该模型非常适合英语、法语、德语、西班牙语、葡萄牙语和日语的语音转文本，以及这些语言与英语之间的语音翻译，此外还包括英语到意大利语和英语到普通话的语音翻译。

用法：

Granite Speech 模型在 transformers>=4.52.1 中得到原生支持。以下是使用 granite-speech-4.1-2b 模型的简单示例。

与 `transformers` 配合使用

首先，请确保安装了最新版本的 transformers：

pip install transformers torchaudio soundfile

然后运行代码：

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, device_map=device, torch_dtype=torch.bfloat16
)

# Load audio
audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000  # mono, 16kHz

# Create text prompt
user_prompt = "<|audio|>transcribe the speech with proper punctuation and capitalization."
chat = [
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Run the processor + model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(
    **model_inputs, max_new_tokens=200, do_sample=False, num_beams=1
)

# Transformers includes the input IDs in the response
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = model_outputs[0, num_input_tokens:].unsqueeze(0)
output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0]}")

与 `vLLM` 配合使用

首先，请确保已安装 vLLM：

pip install vllm

离线模式代码：

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_id = "ibm-granite/granite-speech-4.1-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

model = LLM(
    model=model_id,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

在线模式代码：

"""
Launch the vLLM server with the following command:

vllm serve ibm-granite/granite-speech-4.1-2b \
    --api-key token-abc123 \
    --max-model-len 2048
"""

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model_name = "ibm-granite/granite-speech-4.1-2b"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=model_name,
)


print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

按任务推荐的提示词：

任务	提示词	说明
ASR（原始转录文本）	`can you transcribe the speech into a written format?`	支持多语言提示词，例如 `Pouvez‑vous reconnaître le contenu de la parole ?`
ASR（带标点）	`transcribe the speech with proper punctuation and capitalization.`	非英语ASR需要使用英语提示词
ASR（带关键词偏向）	`transcribe the speech to text. Keywords: <kw1>, <kw2>, ...`	非英语ASR需要使用英语提示词
AST（原始转录文本）	`translate the speech to <language>.`	`<language>`= English, French, German, Spanish, Japanese, Italian, Mandarin
AST（带标点）	`translate the speech to <language> with proper punctuation and capitalization.`	仅支持英语提示词
AST（带关键词偏向）	`translate the speech to <language>. Keywords: <kw1>, <kw2>, ...`	仅支持英语提示词

模型架构：

granite-speech-4.1-2b 的架构包含以下组件：

(1) 语音编码器：16个Conformer块，使用连接主义时序分类（CTC）进行训练，带有两个分类头（字符和BPE单元），训练数据为仅包含ASR语料库的子集（见下文配置）。字符词汇表包含欧洲语言的前256个ASCII条目，以及日语的92个语音片假名字符集；而BPE单元则来自granite 4.0分词器。此外，我们的CTC编码器使用块注意力机制，处理4秒音频块，并采用来自中间层的自条件CTC。中间层还提供非空白概率，用于BPE分类的帧级后验加权池化，窗口大小为4。

配置参数	值
输入维度	160（80个logmels x 2）
层数	16
隐藏维度	1024
注意力头数	8
注意力头大小	128
卷积核大小	15
输出维度（字符）	348
输出维度（BPE）	100353

(2) 语音投影器和时间下采样器（语音-文本模态适配器）：我们使用一个2层窗口查询Transformer（q-former），其作用于从语音编码器最后一个Conformer块输出的15个1024维声学嵌入块。这些嵌入通过每层每个块3个可训练查询进行5倍下采样。总时间下采样因子为10（编码器提供2倍，投影器提供5倍），从而为LLM提供10Hz的声学嵌入速率。投影器和LLM LoRA适配器在训练数据部分提及的所有语料库上进行联合训练。

(3) 大型语言模型：granite-4.0-1b-base的中间 checkpoint，上下文长度为128k（https://huggingface.co/ibm-granite/granite-4.0-1b-base），在**训练数据**部分提及的所有语料库上进行微调。

训练数据：

总体而言，我们的训练数据主要由两个关键来源组成：(1) 公开可用数据集 (2) 基于公开可用数据集创建的合成数据，专门针对日语ASR、关键词列表提示ASR和语音翻译任务。训练数据集的详细说明如下表所示：

名称	任务	小时数	来源
CommonVoice-17 英语、德语、西班牙语、法语、葡萄牙语、日语	ASR	5700	https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS 英语、德语、西班牙语、法语、葡萄牙语	ASR	48000	https://huggingface.co/datasets/facebook/multilingual_librispeech
Librispeech 英语	ASR	1000	https://huggingface.co/datasets/openslr/librispeech_asr
Librispeech-PC 英语	ASR	1000	https://huggingface.co/datasets/yoom618/librispeech_pc
LibriHeavy Large 英语	ASR	46000	https://huggingface.co/datasets/anyspeech/libri-heavy
VoxPopuli 英语、德语、法语、西班牙语	ASR	1100	https://huggingface.co/datasets/facebook/voxpopuli
VoxPopuli Granary 英语	ASR	24000	https://huggingface.co/datasets/nvidia/Granary
AMI 英语	ASR	100	https://huggingface.co/datasets/edinburghcstr/ami
YODAS 英语	ASR	10000	https://huggingface.co/datasets/espnet/yodas
YODAS 日语	ASR	1400	https://huggingface.co/datasets/espnet/yodas
Earnings-22 英语	ASR	105	https://huggingface.co/datasets/esb/datasets
Switchboard 英语	ASR	260	https://catalog.ldc.upenn.edu/LDC97S62
CallHome 英语	ASR	18	https://catalog.ldc.upenn.edu/LDC97T14
Fisher 英语	ASR	2000	https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part I 英语	ASR	40	https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part II 英语	ASR	40	https://catalog.ldc.upenn.edu/LDC2002S35
ReazonSpeech	ASR	3000	https://huggingface.co/datasets/reazon-research/reazonspeech
Fineweb-2 TTS 日语	ASR	9600	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 和 Kokoro-82M TTS
CommonVoice-17 德语、西班牙语、法语、葡萄牙语->英语	AST	3000	使用Granite-3 和 Phi-4 进行翻译
CommonVoice-17 英语->德语、西班牙语、法语、意大利语、日语、葡萄牙语、中文	AST	18000	使用Phi-4 和 MADLAD 进行翻译

基础设施： 我们使用IBM的超级计算集群Blue Vela训练Granite Speech，该集群配备了NVIDIA H100 GPU。该集群为我们在数千个GPU上训练模型提供了可扩展且高效的基础设施。此特定模型的训练在8个H100 GPU上用30天完成（26天用于编码器 + 4天用于投影器）。

伦理考量与局限性：

大型语音和语言模型的使用可能引发某些风险和伦理考量。尽管我们的对齐流程包含安全考量，但该模型在某些情况下可能会对用户提示产生不准确、有偏见、冒犯性或不受欢迎的响应。此外，较小的模型是否可能因其尺寸减小而在生成场景中更容易产生幻觉，从而限制其生成连贯且上下文准确的响应的能力，这一点尚不确定。这方面目前是一个活跃的研究领域，我们预计在该领域会有更严格的探索、理解和缓解措施。

IBM建议将此模型用于自动语音识别和翻译任务。该模型的设计通过限制音频输入对系统的影响来提高安全性。如果收到不熟悉或格式错误的提示，模型会简单地忽略它并执行转录，这是默认的回退模式。这最大限度地降低了对抗性输入的风险，不像直接解释音频的集成模型那样可能更容易受到此类攻击。请注意，更通用的语音任务可能会带来更高的触发不需要的输出的固有风险。

为了增强安全性，我们建议将granite-speech-4.1-2b与Granite Guardian一起使用。Granite Guardian是一个微调的指令模型，旨在检测和标记IBM AI风险图谱中概述的关键维度上的提示和响应中的风险。

资源

📄 阅读我们的论文：
🔧 笔记本：推测解码，在自定义数据上微调
⭐️ 了解Granite的最新更新：https://www.ibm.com/granite
🚀 开始使用教程、最佳实践和提示词工程建议：https://www.ibm.com/granite/docs/
💡 了解最新的Granite学习资源：https://ibm.biz/granite-learning-resources

引用

@misc{granite-speech-4.1-2b,
  title={Granite 4.1 Speech},
  author={IBM Granite Speech Team},
  year={2026},
  url={https://huggingface.co/ibm-granite/granite-speech-4.1-2b}
}

Granite-Speech-4.1-2B

由于采用了新颖的双头部 CTC 编码器（同时具有字形和 BPE 输出）以及帧重要性采样以聚焦音频的信息部分，多语言 ASR 的转录准确性更高
通过简单的提示词更改，可在所有语言的 ASR 和 AST 中实现标点符号和大小写转换（包括德语名词大写）
更强的关键词列表偏向能力，可增强对名称、首字母缩写词和技术术语的识别

另外两个模型变体探索了不同的功能和推理优化：

granite-speech-4.1-2b-plus 增加了说话人归因 ASR 和词级时间戳
granite-speech-4.1-2b-nar 引入了新颖的非自回归架构，以实现更高的吞吐量

评估：

测试集	PER（↓）	Cap-F1（↑）
LScln	25.70	89.71
LSoth	22.27	91.26
VoxPopuli	24.86	95.35
Earnings-22	22.87	95.19
CV-EN	9.13	96.75
CV-DE	3.66	99.50†
CV-ES	11.61	95.68
CV-FR	11.00	97.25
CV-PT	7.86	98.51

† 我们报告德语的 Cap-F1 为 99.5，其中名词需要大写。

发布日期：2026 年 4 月 29 日

许可证： Apache 2.0

支持语言： 英语、法语、德语、西班牙语、葡萄牙语、日语

用法：

Granite Speech 模型在 transformers>=4.52.1 中得到原生支持。以下是使用 granite-speech-4.1-2b 模型的简单示例。

与 `transformers` 配合使用

首先，请确保安装了最新版本的 transformers：

pip install transformers torchaudio soundfile

然后运行代码：

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, device_map=device, torch_dtype=torch.bfloat16
)

# Load audio
audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000  # mono, 16kHz

# Create text prompt
user_prompt = "<|audio|>transcribe the speech with proper punctuation and capitalization."
chat = [
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Run the processor + model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(
    **model_inputs, max_new_tokens=200, do_sample=False, num_beams=1
)

# Transformers includes the input IDs in the response
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = model_outputs[0, num_input_tokens:].unsqueeze(0)
output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0]}")

与 `vLLM` 配合使用

首先，请确保已安装 vLLM：

pip install vllm

离线模式代码：

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_id = "ibm-granite/granite-speech-4.1-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

model = LLM(
    model=model_id,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

在线模式代码：

"""
Launch the vLLM server with the following command:

vllm serve ibm-granite/granite-speech-4.1-2b \
    --api-key token-abc123 \
    --max-model-len 2048
"""

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model_name = "ibm-granite/granite-speech-4.1-2b"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=model_name,
)


print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

按任务推荐的提示词：

任务	提示词	说明
ASR（原始转录文本）	`can you transcribe the speech into a written format?`	支持多语言提示词，例如 `Pouvez‑vous reconnaître le contenu de la parole ?`
ASR（带标点）	`transcribe the speech with proper punctuation and capitalization.`	非英语ASR需要使用英语提示词
ASR（带关键词偏向）	`transcribe the speech to text. Keywords: <kw1>, <kw2>, ...`	非英语ASR需要使用英语提示词
AST（原始转录文本）	`translate the speech to <language>.`	`<language>`= English, French, German, Spanish, Japanese, Italian, Mandarin
AST（带标点）	`translate the speech to <language> with proper punctuation and capitalization.`	仅支持英语提示词
AST（带关键词偏向）	`translate the speech to <language>. Keywords: <kw1>, <kw2>, ...`	仅支持英语提示词

模型架构：

granite-speech-4.1-2b 的架构包含以下组件：

配置参数	值
输入维度	160（80个logmels x 2）
层数	16
隐藏维度	1024
注意力头数	8
注意力头大小	128
卷积核大小	15
输出维度（字符）	348
输出维度（BPE）	100353

训练数据：

名称	任务	小时数	来源
CommonVoice-17 英语、德语、西班牙语、法语、葡萄牙语、日语	ASR	5700	https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS 英语、德语、西班牙语、法语、葡萄牙语	ASR	48000	https://huggingface.co/datasets/facebook/multilingual_librispeech
Librispeech 英语	ASR	1000	https://huggingface.co/datasets/openslr/librispeech_asr
Librispeech-PC 英语	ASR	1000	https://huggingface.co/datasets/yoom618/librispeech_pc
LibriHeavy Large 英语	ASR	46000	https://huggingface.co/datasets/anyspeech/libri-heavy
VoxPopuli 英语、德语、法语、西班牙语	ASR	1100	https://huggingface.co/datasets/facebook/voxpopuli
VoxPopuli Granary 英语	ASR	24000	https://huggingface.co/datasets/nvidia/Granary
AMI 英语	ASR	100	https://huggingface.co/datasets/edinburghcstr/ami
YODAS 英语	ASR	10000	https://huggingface.co/datasets/espnet/yodas
YODAS 日语	ASR	1400	https://huggingface.co/datasets/espnet/yodas
Earnings-22 英语	ASR	105	https://huggingface.co/datasets/esb/datasets
Switchboard 英语	ASR	260	https://catalog.ldc.upenn.edu/LDC97S62
CallHome 英语	ASR	18	https://catalog.ldc.upenn.edu/LDC97T14
Fisher 英语	ASR	2000	https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part I 英语	ASR	40	https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part II 英语	ASR	40	https://catalog.ldc.upenn.edu/LDC2002S35
ReazonSpeech	ASR	3000	https://huggingface.co/datasets/reazon-research/reazonspeech
Fineweb-2 TTS 日语	ASR	9600	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 和 Kokoro-82M TTS
CommonVoice-17 德语、西班牙语、法语、葡萄牙语->英语	AST	3000	使用Granite-3 和 Phi-4 进行翻译
CommonVoice-17 英语->德语、西班牙语、法语、意大利语、日语、葡萄牙语、中文	AST	18000	使用Phi-4 和 MADLAD 进行翻译

伦理考量与局限性：

资源

📄 阅读我们的论文：
🔧 笔记本：推测解码，在自定义数据上微调
⭐️ 了解Granite的最新更新：https://www.ibm.com/granite
🚀 开始使用教程、最佳实践和提示词工程建议：https://www.ibm.com/granite/docs/
💡 了解最新的Granite学习资源：https://ibm.biz/granite-learning-resources

引用

@misc{granite-speech-4.1-2b,
  title={Granite 4.1 Speech},
  author={IBM Granite Speech Team},
  year={2026},
  url={https://huggingface.co/ibm-granite/granite-speech-4.1-2b}
}

Granite-Speech-4.1-2B

用法：

与 transformers 配合使用

与 vLLM 配合使用

Granite-Speech-4.1-2B

用法：

与 transformers 配合使用

与 vLLM 配合使用

与 `transformers` 配合使用

与 `vLLM` 配合使用

与 `transformers` 配合使用

与 `vLLM` 配合使用