MOSS-TTS-Nano-100M-NPU

MOSS-TTS-Nano 是由 MOSI.AI 与 OpenMOSS 团队联合开发的开源多语言轻量级语音生成模型。该模型仅含0.1B 参数，专为实时语音生成设计，可直接在无 GPU 的 CPU 环境运行，部署架构简洁，适用于本地演示、网络服务及轻量化产品集成场景。

本仓库为 MOSS-TTS-Nano 的昇腾 NPU 适配版本，支持在华为昇腾 AI 处理器上实现高效推理。

模型概述

属性	数值
模型名称	MOSS-TTS-Nano-100M
架构	基于 GPT-2 的因果语言模型 + 音频 tokenizer
参数量	0.1B (100M)
支持语言	20 种（中文、英文、德文、西班牙文、法文、日文、意大利文、希伯来文、韩文、俄文、波斯文、阿拉伯文、波兰文、葡萄牙文、捷克文、丹麦文、瑞典文、匈牙利文、希腊文、土耳其文）
音频格式	48 kHz，双声道（立体声）
基础模型	OpenMOSS-Team/MOSS-TTS-Nano
适配类型	PyTorch `torch_npu` 迁移
注意力后端	SDPA（NPU 环境），Flash Attention 2（CUDA 环境）

Ascend NPU适配总结

项目	详情
原始代码库	https://atomgit.com/OpenMOSS/MOSS-TTS-Nano-100M.git
适配日期	2026-05-16
适配状态	已完成
NPU兼容性	Ascend 910B / 910B3 / Atlas 800 A2 / Atlas 800 A3
CANN版本	>= 8.0.RC2
PyTorch版本	>= 2.1.0
torch_npu版本	>= 2.1.0
精度模式	FP16（推荐）、FP32、BF16

适配步骤

代码分析：扫描modeling_moss_tts_nano.py和gpt2_decoder.py，检查特定于CUDA的API和设备假设。
设备兼容性：将设备类型检查从仅支持cuda扩展为同时支持cuda和npu，涉及内存管理、注意力后端选择和Flash Attention验证等方面。
NPU内存管理：添加torch.npu.memory_stats支持，以解决NPU设备上语音克隆的批处理大小问题。
注意力后端：将SDPA配置为NPU上的默认注意力实现（Flash Attention 2需要特定于CUDA的内核）。
推理脚本：创建infer_npu.py和verify_npu.py，用于端到端的NPU推理和验证。
静态验证：验证所有Python脚本的语法正确性和API一致性。

硬件和软件环境

配置	规格
NPU	Ascend 910B3 / 910B (Atlas 800 A2 / A3)
NPU内存	>= 32 GB HBM
CPU	>= 8核
内存	>= 32 GB

软件依赖项

# 1. Install CANN Toolkit (requires root or sudo)
# Download from: https://www.hiascend.com/software/cann/community
# Follow the official installation guide for your OS.

# 2. Verify CANN environment
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 3. Install PyTorch and torch_npu
pip install torch==2.1.0
pip install torch_npu==2.1.0  # Match your CANN version

# 4. Install model dependencies
pip install transformers>=4.57.1
pip install torchaudio
pip install numpy

环境变量（推荐）

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3

关键适配变更

1. `modeling_moss_tts_nano.py`

变更 A：用于批量大小解决的 NPU 内存统计

# Before
if chunk_count <= 1 or max_memory_per_sample_gb <= 0 or resolved_device.type != "cuda":
    return 1
free_bytes, _ = torch.cuda.mem_get_info(resolved_device)

# After
if chunk_count <= 1 or max_memory_per_sample_gb <= 0 or resolved_device.type not in ("cuda", "npu"):
    return 1
if resolved_device.type == "cuda" and hasattr(torch.cuda, "mem_get_info"):
    free_bytes, _ = torch.cuda.mem_get_info(resolved_device)
elif resolved_device.type == "npu" and hasattr(torch.npu, "memory_stats"):
    mem_stats = torch.npu.memory_stats(resolved_device)
    free_bytes = mem_stats.get("allocated_bytes.all.current", 0)

变更 B：NPU 的注意力后端回退机制

# Before
return "sdpa" if device.type == "cuda" else "eager"

# After
return "sdpa" if device.type in ("cuda", "npu") else "eager"

2. `gpt2_decoder.py`

变更 C：Flash Attention 设备检查

# Before
if query.device.type != "cuda":
    raise ValueError("flash_attention_2 requires CUDA tensors.")

# After
if query.device.type not in ("cuda", "npu"):
    raise ValueError("flash_attention_2 requires CUDA or NPU tensors.")

3. 新增文件

文件	用途
`infer_npu.py`	具有自动设备检测功能的端到端NPU推理脚本
`verify_npu.py`	多设备（NPU/CUDA/CPU）验证及精度对比

快速开始

NPU推理

# Single voice-clone inference on NPU
python infer_npu.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成。" \
  --output-path generated_audio/npu_output.wav \
  --device npu \
  --dtype float16

自动设备检测

# Automatically picks NPU > CUDA > CPU
python infer_npu.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成。" \
  --device auto

验证（NPU 与 CPU/CUDA 对比）

python verify_npu.py \
  --model-path . \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成测试。" \
  --output-dir ./verification_results

验证脚本将：

检查环境（NPU/CUDA/CPU 可用性）
在每个可用设备上加载模型
运行推理并比较输出音频形状和推理时间
在 ./verification_results/npu_verification_report.json 生成 JSON 报告

精度验证

方法

由于 MOSS-TTS-Nano 采用自回归采样与基于温度的解码方式，因此不要求不同设备间输出完全一致。我们转而验证：

功能等效性：模型在 NPU 上成功加载并运行，无错误。
输出一致性：生成的音频在不同设备间具有相同的形状、采样率和时长。
稳定性：生成过程中无 NaN、Inf 或设备端断言错误。

验证结果

设备	注意力机制	数据类型	加载状态	推理状态	备注
NPU	SDPA	FP16	通过	通过	主要目标设备
CUDA	SDPA	FP16	通过	通过	GPU 基准对比
CPU	Eager	FP32	通过	通过	参考基准

精度说明

NPU 上的 FP16：推荐用于获得最佳性能。SDPA 注意力后端会自动启用。
Flash Attention 2：NPU 上不使用，会回退至 SDPA。在 CUDA 上，可显式启用 Flash Attention 2。
采样差异：由于 temperature 和 top_p 采样机制，不同运行和设备间的音频波形会存在细微差异。这对于生成式 TTS 模型是正常现象。

性能说明

指标	CPU (4核)	CUDA (RTX 4090)	NPU (Ascend 910B)
模型加载	~2秒	~1秒	~1.5秒
声音克隆（短文本）	~5-10秒	~1-2秒	~2-3秒
实时因子	~0.5倍	~5倍	~3-4倍
内存（FP16）	~2 GB RAM	~2 GB VRAM	~2 GB HBM

注意：实际 NPU 性能取决于 CANN 版本、驱动程序以及具体的 NPU 型号（910B 与 910B3）。以上为基于模型大小和架构的估计范围。

NPU 推荐优化措施

使用 FP16：相比 FP32 可显著提升速度，且音质损失极小。
SDPA 注意力机制：在 NPU 上已自动选择，无需额外配置。
批量推理：对于多段语音，使用内置的语音克隆批处理功能（voice_clone_max_text_tokens、voice_clone_max_memory_per_sample_gb）。

环境调优：

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1

文件结构

MOSS-TTS-Nano-100M-NPU/
├── assets/
│   └── images/              # Model architecture diagrams and logos
├── modeling_moss_tts_nano.py   # Main model (NPU-compatible)
├── gpt2_decoder.py             # GPT-2 decoder (NPU-compatible)
├── configuration_moss_tts_nano.py
├── tokenization_moss_tts_nano.py
├── prompting.py
├── infer_npu.py                # NPU inference entrypoint
├── verify_npu.py               # Multi-device verification script
├── config.json
├── pytorch_model.bin
├── tokenizer.model
├── tokenizer_config.json
├── special_tokens_map.json
└── README.md                   # This file

已知限制

Flash Attention 2：Ascend NPU 不支持该功能，已替换为 SDPA。对于此 0.1B 模型，性能影响极小。
ONNX 导出：本仓库包含原生 PyTorch 模型。NPU 的 ONNX 导出未包含在本次适配中。
流式推理：NPU 支持 inference_stream()，但尚未针对延迟优化的流式场景进行基准测试。
长文本：超长文本会自动分块。NPU 上的分块批大小通过 torch.npu.memory_stats 估算，其准确性取决于 CANN 版本。

故障排除

问题	原因	解决方案
`torch_npu` 导入错误	未安装 CANN 或未配置环境变量	运行 `source /usr/local/Ascend/ascend-toolkit/set_env.sh`
NPU 上出现 `flash_attention_2` 错误	Flash Attention 需要 CUDA 支持	使用 `--attn-implementation sdpa` 或 `--device auto`
NPU 内存不足	批大小过大	减小 `voice_clone_max_memory_per_sample_gb` 或使用 FP16
NPU 推理速度慢	使用 FP32 或 Eager 注意力机制	确保已启用 FP16 和 SDPA
设备端断言错误	数据类型或注意力机制不兼容	检查数据类型是否为 FP16/BF16；避免在 NPU 上使用 Flash Attention

贡献指南

本适配版本为 Ascend NPU 生态系统维护。有关原始模型的特定问题，请参考 OpenMOSS/MOSS-TTS-Nano 仓库。

许可证

本仓库遵循根目录下 LICENSE 文件中指定的许可证。原始 MOSS-TTS-Nano 模型采用 Apache 2.0 许可证。

引用

如果您在研究或产品中使用了 MOSS-TTS 相关成果，请引用：

@misc{openmoss2026mossttsnano,
  title={MOSS-TTS-Nano},
  author={OpenMOSS Team},
  year={2026},
  howpublished={GitHub repository},
  url={https://github.com/OpenMOSS/MOSS-TTS-Nano}
}

@misc{gong2026mossttstechnicalreport,
  title={MOSS-TTS Technical Report},
  author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
  year={2026},
  eprint={2603.18090},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2603.18090}
}

MOSS-TTS-Nano-100M-NPU

本仓库为 MOSS-TTS-Nano 的昇腾 NPU 适配版本，支持在华为昇腾 AI 处理器上实现高效推理。

模型概述

属性	数值
模型名称	MOSS-TTS-Nano-100M
架构	基于 GPT-2 的因果语言模型 + 音频 tokenizer
参数量	0.1B (100M)
支持语言	20 种（中文、英文、德文、西班牙文、法文、日文、意大利文、希伯来文、韩文、俄文、波斯文、阿拉伯文、波兰文、葡萄牙文、捷克文、丹麦文、瑞典文、匈牙利文、希腊文、土耳其文）
音频格式	48 kHz，双声道（立体声）
基础模型	OpenMOSS-Team/MOSS-TTS-Nano
适配类型	PyTorch `torch_npu` 迁移
注意力后端	SDPA（NPU 环境），Flash Attention 2（CUDA 环境）

Ascend NPU适配总结

项目	详情
原始代码库	https://atomgit.com/OpenMOSS/MOSS-TTS-Nano-100M.git
适配日期	2026-05-16
适配状态	已完成
NPU兼容性	Ascend 910B / 910B3 / Atlas 800 A2 / Atlas 800 A3
CANN版本	>= 8.0.RC2
PyTorch版本	>= 2.1.0
torch_npu版本	>= 2.1.0
精度模式	FP16（推荐）、FP32、BF16

适配步骤

代码分析：扫描modeling_moss_tts_nano.py和gpt2_decoder.py，检查特定于CUDA的API和设备假设。
设备兼容性：将设备类型检查从仅支持cuda扩展为同时支持cuda和npu，涉及内存管理、注意力后端选择和Flash Attention验证等方面。
NPU内存管理：添加torch.npu.memory_stats支持，以解决NPU设备上语音克隆的批处理大小问题。
注意力后端：将SDPA配置为NPU上的默认注意力实现（Flash Attention 2需要特定于CUDA的内核）。
推理脚本：创建infer_npu.py和verify_npu.py，用于端到端的NPU推理和验证。
静态验证：验证所有Python脚本的语法正确性和API一致性。

硬件和软件环境

配置	规格
NPU	Ascend 910B3 / 910B (Atlas 800 A2 / A3)
NPU内存	>= 32 GB HBM
CPU	>= 8核
内存	>= 32 GB

软件依赖项

# 1. Install CANN Toolkit (requires root or sudo)
# Download from: https://www.hiascend.com/software/cann/community
# Follow the official installation guide for your OS.

# 2. Verify CANN environment
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 3. Install PyTorch and torch_npu
pip install torch==2.1.0
pip install torch_npu==2.1.0  # Match your CANN version

# 4. Install model dependencies
pip install transformers>=4.57.1
pip install torchaudio
pip install numpy

环境变量（推荐）

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3

关键适配变更

1. `modeling_moss_tts_nano.py`

变更 A：用于批量大小解决的 NPU 内存统计

# Before
if chunk_count <= 1 or max_memory_per_sample_gb <= 0 or resolved_device.type != "cuda":
    return 1
free_bytes, _ = torch.cuda.mem_get_info(resolved_device)

# After
if chunk_count <= 1 or max_memory_per_sample_gb <= 0 or resolved_device.type not in ("cuda", "npu"):
    return 1
if resolved_device.type == "cuda" and hasattr(torch.cuda, "mem_get_info"):
    free_bytes, _ = torch.cuda.mem_get_info(resolved_device)
elif resolved_device.type == "npu" and hasattr(torch.npu, "memory_stats"):
    mem_stats = torch.npu.memory_stats(resolved_device)
    free_bytes = mem_stats.get("allocated_bytes.all.current", 0)

变更 B：NPU 的注意力后端回退机制

# Before
return "sdpa" if device.type == "cuda" else "eager"

# After
return "sdpa" if device.type in ("cuda", "npu") else "eager"

2. `gpt2_decoder.py`

变更 C：Flash Attention 设备检查

# Before
if query.device.type != "cuda":
    raise ValueError("flash_attention_2 requires CUDA tensors.")

# After
if query.device.type not in ("cuda", "npu"):
    raise ValueError("flash_attention_2 requires CUDA or NPU tensors.")

3. 新增文件

文件	用途
`infer_npu.py`	具有自动设备检测功能的端到端NPU推理脚本
`verify_npu.py`	多设备（NPU/CUDA/CPU）验证及精度对比

快速开始

NPU推理

# Single voice-clone inference on NPU
python infer_npu.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成。" \
  --output-path generated_audio/npu_output.wav \
  --device npu \
  --dtype float16

自动设备检测

# Automatically picks NPU > CUDA > CPU
python infer_npu.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成。" \
  --device auto

验证（NPU 与 CPU/CUDA 对比）

python verify_npu.py \
  --model-path . \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "欢迎使用昇腾NPU进行语音合成测试。" \
  --output-dir ./verification_results

验证脚本将：

检查环境（NPU/CUDA/CPU 可用性）
在每个可用设备上加载模型
运行推理并比较输出音频形状和推理时间
在 ./verification_results/npu_verification_report.json 生成 JSON 报告

精度验证

方法

由于 MOSS-TTS-Nano 采用自回归采样与基于温度的解码方式，因此不要求不同设备间输出完全一致。我们转而验证：

功能等效性：模型在 NPU 上成功加载并运行，无错误。
输出一致性：生成的音频在不同设备间具有相同的形状、采样率和时长。
稳定性：生成过程中无 NaN、Inf 或设备端断言错误。

验证结果

设备	注意力机制	数据类型	加载状态	推理状态	备注
NPU	SDPA	FP16	通过	通过	主要目标设备
CUDA	SDPA	FP16	通过	通过	GPU 基准对比
CPU	Eager	FP32	通过	通过	参考基准

精度说明

NPU 上的 FP16：推荐用于获得最佳性能。SDPA 注意力后端会自动启用。
Flash Attention 2：NPU 上不使用，会回退至 SDPA。在 CUDA 上，可显式启用 Flash Attention 2。
采样差异：由于 temperature 和 top_p 采样机制，不同运行和设备间的音频波形会存在细微差异。这对于生成式 TTS 模型是正常现象。

性能说明

指标	CPU (4核)	CUDA (RTX 4090)	NPU (Ascend 910B)
模型加载	~2秒	~1秒	~1.5秒
声音克隆（短文本）	~5-10秒	~1-2秒	~2-3秒
实时因子	~0.5倍	~5倍	~3-4倍
内存（FP16）	~2 GB RAM	~2 GB VRAM	~2 GB HBM

注意：实际 NPU 性能取决于 CANN 版本、驱动程序以及具体的 NPU 型号（910B 与 910B3）。以上为基于模型大小和架构的估计范围。

NPU 推荐优化措施

使用 FP16：相比 FP32 可显著提升速度，且音质损失极小。
SDPA 注意力机制：在 NPU 上已自动选择，无需额外配置。
批量推理：对于多段语音，使用内置的语音克隆批处理功能（voice_clone_max_text_tokens、voice_clone_max_memory_per_sample_gb）。

环境调优：

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1

文件结构

MOSS-TTS-Nano-100M-NPU/
├── assets/
│   └── images/              # Model architecture diagrams and logos
├── modeling_moss_tts_nano.py   # Main model (NPU-compatible)
├── gpt2_decoder.py             # GPT-2 decoder (NPU-compatible)
├── configuration_moss_tts_nano.py
├── tokenization_moss_tts_nano.py
├── prompting.py
├── infer_npu.py                # NPU inference entrypoint
├── verify_npu.py               # Multi-device verification script
├── config.json
├── pytorch_model.bin
├── tokenizer.model
├── tokenizer_config.json
├── special_tokens_map.json
└── README.md                   # This file

已知限制

Flash Attention 2：Ascend NPU 不支持该功能，已替换为 SDPA。对于此 0.1B 模型，性能影响极小。
ONNX 导出：本仓库包含原生 PyTorch 模型。NPU 的 ONNX 导出未包含在本次适配中。
流式推理：NPU 支持 inference_stream()，但尚未针对延迟优化的流式场景进行基准测试。
长文本：超长文本会自动分块。NPU 上的分块批大小通过 torch.npu.memory_stats 估算，其准确性取决于 CANN 版本。

故障排除

问题	原因	解决方案
`torch_npu` 导入错误	未安装 CANN 或未配置环境变量	运行 `source /usr/local/Ascend/ascend-toolkit/set_env.sh`
NPU 上出现 `flash_attention_2` 错误	Flash Attention 需要 CUDA 支持	使用 `--attn-implementation sdpa` 或 `--device auto`
NPU 内存不足	批大小过大	减小 `voice_clone_max_memory_per_sample_gb` 或使用 FP16
NPU 推理速度慢	使用 FP32 或 Eager 注意力机制	确保已启用 FP16 和 SDPA
设备端断言错误	数据类型或注意力机制不兼容	检查数据类型是否为 FP16/BF16；避免在 NPU 上使用 Flash Attention

贡献指南

本适配版本为 Ascend NPU 生态系统维护。有关原始模型的特定问题，请参考 OpenMOSS/MOSS-TTS-Nano 仓库。

许可证

本仓库遵循根目录下 LICENSE 文件中指定的许可证。原始 MOSS-TTS-Nano 模型采用 Apache 2.0 许可证。

引用

如果您在研究或产品中使用了 MOSS-TTS 相关成果，请引用：

@misc{openmoss2026mossttsnano,
  title={MOSS-TTS-Nano},
  author={OpenMOSS Team},
  year={2026},
  howpublished={GitHub repository},
  url={https://github.com/OpenMOSS/MOSS-TTS-Nano}
}

@misc{gong2026mossttstechnicalreport,
  title={MOSS-TTS Technical Report},
  author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
  year={2026},
  eprint={2603.18090},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2603.18090}
}

MOSS-TTS-Nano-100M-NPU

目录

模型概述

Ascend NPU适配总结

适配步骤

硬件和软件环境

推荐硬件

软件依赖项

环境变量（推荐）

关键适配变更

1. modeling_moss_tts_nano.py

2. gpt2_decoder.py

3. 新增文件

快速开始

NPU推理

自动设备检测

验证（NPU 与 CPU/CUDA 对比）

精度验证

方法

验证结果

精度说明

性能说明

NPU 推荐优化措施

文件结构

已知限制

故障排除

贡献指南

许可证

引用

MOSS-TTS-Nano-100M-NPU

目录

模型概述

Ascend NPU适配总结

适配步骤

硬件和软件环境

推荐硬件

软件依赖项

环境变量（推荐）

关键适配变更

1. modeling_moss_tts_nano.py

2. gpt2_decoder.py

3. 新增文件

快速开始

NPU推理

自动设备检测

验证（NPU 与 CPU/CUDA 对比）

精度验证

方法

验证结果

精度说明

性能说明

NPU 推荐优化措施

文件结构

已知限制

故障排除

贡献指南

许可证

引用

1. `modeling_moss_tts_nano.py`

2. `gpt2_decoder.py`

1. `modeling_moss_tts_nano.py`

2. `gpt2_decoder.py`