zehnmindai Gemma 4 (E4B) — 乌兹别克语语音转文本（LoRA）

由zehnmindai为乌兹别克语（uz）自动语音识别微调的Gemma 4（E4B，指令型） LoRA适配器。给定一段乌兹别克语语音片段，该模型可生成清晰的文本转录结果。

开发方： zehnmindai
模型类型： 基于多模态（音频 + 文本）指令模型的LoRA适配器
基础模型： unsloth/gemma-4-e4b-it-unsloth-bnb-4bit（unsloth/gemma-4-E4B-it的预量化4位版本）
任务： 自动语音识别（音频 → 乌兹别克语文本）
语言： 乌兹别克语（uz）
许可证： Apache-2.0（适配器权重）。基础Gemma 4模型仍受Google Gemma使用条款约束。
训练框架： Unsloth + TRL SFTTrainer
使用Unsloth实现2倍速训练。

快速开始

通过Unsloth进行推理速度最快，它能以4位精度加载基础模型，并通过单次调用附加此LoRA适配器。

import librosa
from unsloth import FastModel
from transformers import TextStreamer

# 1. Load base + adapter in 4-bit
model, processor = FastModel.from_pretrained(
    model_name = "zehnmindai/gemma_4_uzbek_stt_lora",
    max_seq_length = 8192,
    load_in_4bit = True,
)
FastModel.for_inference(model)

# 2. Load your Uzbek audio (any sample rate; librosa resamples to 16 kHz)
audio_array, _ = librosa.load("your_uzbek_audio.wav", sr = 16000)

# 3. Build the chat template the model was trained on
messages = [
    {
        "role": "system",
        "content": [{"type": "text",
                     "text": "You are an assistant that transcribes speech accurately."}],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio_array},
            {"type": "text",  "text": "Please transcribe this audio."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt = True,
    tokenize = True,
    return_dict = True,
    return_tensors = "pt",
).to("cuda")

# 4. Generate
_ = model.generate(
    **inputs,
    max_new_tokens = 256,
    do_sample = False,
    streamer = TextStreamer(processor, skip_prompt = True),
)

清晰语音的预期输出如下（示例）：assalomu alaykum. mening ismim Kamoliddin.

预期用途

适用范围

转录中短时长的乌兹别克语语音片段（对话式、朗读式语音）。
构建乌兹别克语语音界面、听写工具、字幕生成流水线以及数据标注辅助工具。
针对低资源语言的多模态Gemma变体研究。

不适用范围/不推荐用途

文本转语音或语音克隆（本模型仅支持自动语音识别）。
乌兹别克语以外的其他语言；适配器仅基于乌兹别克语音频训练。
未经分块处理，超出8,192 token上下文窗口的长文本转录。
安全关键型、医疗、法律或生物特征识别等应用场景。
在与训练数据分布差异极大的条件下（大量语码转换、极端噪音、极低比特率电话语音）进行音频转录，且未经过下游评估。

训练数据

适配器在zehnmindai精心整理的私有乌兹别克语语音语料库上进行了微调，包含约650,429个音频-转录对，涵盖不同的说话人、录音条件和领域。该数据集未公开。每个样本均采用Gemma 4多模态聊天模板进行格式化：

系统： 转录指令。
用户： {audio} + "请转录此音频。"。
助手： 参考转录文本。

训练过程

LoRA配置

r = 8，lora_alpha = 16，lora_dropout = 0
bias = "none"，use_rslora = False
目标模块（语言塔）：q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
目标模块（音频塔）：post, linear_start, linear_end, embedding_projection, ffw_layer_1, ffw_layer_2, output_proj

监督微调

设置	值
训练器	TRL `SFTTrainer`（通过Unsloth）
轮次	1
总步数	81,304
每设备批大小	8
梯度累积	1
学习率	5e-5
调度器	cosine，预热比例0.03
优化器	`adamw_8bit`
最大序列长度	8,192
精度	bf16
种子	3407
基础量化	4-bit（bitsandbytes，通过Unsloth）

硬件与资源占用

GPU： NVIDIA GeForce RTX 5090 笔记本电脑 GPU（24 GB）
峰值显存： 训练期间约 9.4 GB（4 位基础模型 + LoRA）
发布的模型产物： 仅包含 LoRA 适配器 — 基础模型权重未重新分发。加载时需要网络访问 unsloth/gemma-4-E4B-it。

评估

此版本未发布正式的 WER/CER 基准测试结果。在部署前，强烈建议用户在其目标领域的预留数据集上进行评估 — 例如 Common Voice 或 FLEURS 的乌兹别克语拆分数据集 — 以及内部录制的音频。

局限性与偏差

方言和口音覆盖范围 受限于训练语料库；不太常见的方言或浓重的口音可能导致转录准确性下降。
嘈杂音频、远场麦克风、电话编解码器以及频繁的语码转换（乌兹别克语 ↔ 俄语/英语）可能会降低质量。
该模型继承了 Gemma 4 的一般局限性和偏差；它未针对事实准确性进行校准，不应被视为超出字面转录任务范围的真相来源。
由于它是带有 ASR 头的指令调优 LLM，分布外的提示可能会产生非转录行为。请保持上述聊天模板。

许可

本仓库中的 LoRA 适配器权重根据 Apache License 2.0 发布。

没有基础模型，适配器就没有价值，而基础模型 — unsloth/gemma-4-e4b-it-unsloth-bnb-4bit，衍生自 Google 的 Gemma 4 — 仍受 Gemma 使用条款 约束，包括 Gemma 的禁止使用政策。将此适配器与基础模型一起使用，即表示您同意这两个许可。

引用

如果您使用此适配器，请引用 Gemma 和 Unsloth，并注明 zehnmindai。

@misc{zehnmindai_gemma4_uzbek_stt_lora,
  author       = {zehnmindai},
  title        = {Gemma 4 (E4B) Uzbek Speech-to-Text LoRA},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/zehnmindai/gemma_4_uzbek_stt_lora}},
  note         = {LoRA adapter over unsloth/gemma-4-E4B-it}
}

@misc{gemma_2024,
  author = {Google},
  title  = {Gemma},
  year   = {2024},
  url    = {https://ai.google.dev/gemma}
}

@software{unsloth,
  author = {Daniel Han and Michael Han and Unsloth team},
  title  = {Unsloth},
  year   = {2023},
  url    = {https://github.com/unslothai/unsloth}
}

致谢

Google — 感谢其发布 Gemma 4 基础模型。
Unsloth — 感谢其提供的快速 4 位微调框架，使得本训练可在单台笔记本电脑 GPU 上运行。
Hugging Face TRL — 感谢其提供的 SFTTrainer。
zehnmindai — 感谢其进行数据集整理、模型训练及发布工作。

框架版本

PEFT 0.18.1