Ming-omni-tts-0.5B NPU

Ming-omni-tts-0.5B 的昇腾 NPU 适配版本，基于 Ascend 910 芯片完成推理适配与精度验证。

模型简介

Ming-omni-tts 是 inclusionAI 开源的高性能统一音频生成模型，支持语音、环境音、音乐的单一通道合成。该模型基于 BailingMM 架构，采用自定义 12.5Hz 连续分词器和 Patch-by-Patch 压缩策略，在保证音频自然度的同时实现高效推理。

属性	值
参数量	1.48B
骨干网络	Qwen2 (24层)
隐藏维度	896
注意力头数	14 (KV: 2)
音频分词器	AudioVAE (12.5Hz)
DiT 头	8层 DiT Block
原始精度	bfloat16
任务类型	text-to-speech

环境要求

硬件要求

项目	规格
NPU 型号	Ascend 910
最小显存	16 GB (bf16) / 32 GB (fp32)
驱动版本	Ascend HDK 23.0+

软件依赖

# 基础框架
pip install torch==2.9.0
pip install torch_npu==2.9.0.post1

# 推理依赖
pip install transformers>=4.45.0
pip install safetensors

# 可选：端到端 TTS 推理
pip install modelscope

快速开始

1. 下载模型

# 从 ModelScope 下载
pip install modelscope
modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir ./Ming-omni-tts-0.5B

# 克隆 NPU 适配代码
git clone https://gitcode.com/your-org/Ming-omni-tts-0.5B-NPU.git
cd Ming-omni-tts-0.5B-NPU

2. 运行推理

import torch
import torch_npu
from configuration_bailingmm import BailingMMConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# 加载模型
config = BailingMMConfig.from_pretrained(".")
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    ".", config=config, torch_dtype=torch.bfloat16,
)

# 移至 NPU
model = model.to("npu:0")
model.eval()

# 推理
input_ids = torch.randint(0, 151936, (1, 32)).to("npu:0")
with torch.no_grad():
    output = model(input_ids=input_ids)
print(f"Output shape: {output.shape}")

3. 运行精度验证

python inference.py --mode benchmark --device-id 0 --seq-len 32

模型架构

Ming-omni-tts-0.5B (BailingMM)
├── model (Qwen2ForCausalLM)        # 文本骨干网络，24层
├── audio                           # 音频 VAE
│   ├── encoder (AudioVAEEncoder)   # Mel → 64维潜在空间
│   │   ├── encoder (Qwen2Model)    #   24层编码器
│   │   └── aggregator (Qwen2Model) #   4层聚合器 + CLS Token
│   └── decoder (AudioVAEDecoder)   # 潜在空间 → 波形
│       ├── decoder (Qwen2Model)    #   24层解码器
│       └── head (ISTFTHead)        #   ISTFT 波形合成 (n_fft=3528)
├── linear_proj_audio (DiTModel)    # DiT 音频细化，8层
├── spk_head (Linear)               # 说话人编码投影
├── stop_head (Linear)              # 停止预测
└── flowloss (FlowLoss)             # CFM 训练模型（仅用于权重加载）

NPU 适配说明

适配范围

组件	FP32 精度	BF16 推理	说明
LLM Backbone (Qwen2)	✅	✅	文本编码，24层自回归Transformer
Audio Encoder	✅	✅	Mel频谱编码，24+4层 Qwen2
DiT Head	✅	✅	8层 DiT Block，潜在空间细化
Audio Decoder	✅	✅	24层 Qwen2 + ISTFT
CFM Model	✅	-	训练专用，推理时不需要

权重映射

模型权重完全基于 safetensors 逆向工程，所有 1.48B 参数已正确加载：

无缺失权重
无未匹配权重
DiT MLP 使用 ModuleDict 匹配 checkpoint 命名 ff.0 / ff.2
ISTFT Head: n_fft=3528, hop_length=220, freq_bins=1765
内部 LayerNorm 使用 bias=False 匹配原始模型格式
移除了不必要的独立 aggregator_norm / decoder_norm（使用 Qwen2Model 内置 norm）

精度验证

在 Ascend 910 上使用 FP32 精度对比 CPU 推理结果：

组件	Mean Rel Diff (FP32)	Max Abs Diff (FP32)	判定
LLM Backbone	0.0403%	0.003502	✅ PASS
Audio Encoder	0.0253%	2.71e-06	✅ PASS
DiT Head	0.0005%	0.000114	✅ PASS

精度验证标准：FP32 平均相对差异 < 1%

性能测试

Ascend 910 vs CPU (Intel Xeon) 推理性能对比：

组件	CPU (fp32)	NPU (bf16)	加速比
LLM Backbone	1.022s	0.023s	43.6x
Audio Encoder	1.629s	0.030s	54.8x
DiT Head	0.115s	0.004s	30.7x

测试条件：

NPU: Ascend 910, bf16, seq_len=32
CPU: Intel Xeon, fp32
测试日期: 2025-05-15

测试日志

完整测试日志见 npu_benchmark.log

======================================================================
  SUMMARY - FP32 Accuracy (NPU vs CPU)
======================================================================
  LLM_Backbone         | FP32 mean rel diff: 0.00040307 (0.0403%) | BF16 speedup: 43.64x | PASS
  AudioEncoder         | FP32 mean rel diff: 0.00025325 (0.0253%) | BF16 speedup: 54.78x | PASS
  DiT_Head             | FP32 mean rel diff: 0.00000472 (0.0005%) | BF16 speedup: 30.74x | PASS

Overall FP32 accuracy: ALL PASSED
Accuracy tolerance: FP32 mean relative difference < 1%

文件清单

Ming-omni-tts-0.5B-NPU/
├── inference.py                    # NPU 推理与精度验证脚本
├── modeling_bailingmm.py           # 模型架构实现
├── configuration_bailingmm.py      # 模型配置类
├── config.json                     # 模型配置参数
├── model.safetensors               # 模型权重（软链接）
├── tokenizer.json                  # 分词器
├── vocab.json                      # 词表
├── merges.txt                      # BPE 合并规则
├── special_tokens_map.json         # 特殊 token 映射
├── tokenizer_config.json           # 分词器配置
├── chat_template.jinja             # 对话模板
├── npu_evaluation_report.json      # 精度评测报告
├── npu_benchmark.log               # 运行日志
└── README.md                       # 本文档

致谢

原始模型: inclusionAI/Ming-omni-tts-0.5B
昇腾 NPU 适配: Ascend 910 + torch_npu
参考项目: Ascend-SACT

许可证

Apache 2.0