Fish Audio S2 Pro Ascend NPU 适配部署指南

Fish Audio S2 Pro (velvet-eagle): 基于昇腾 NPU 的 Dual-AR TTS 模型推理部署方案

license NPU torch python

上级索引：昇腾模型生态全景图原始模型：fishaudio/s2-pro | 技术报告：arXiv:2603.08823

模型简介

Fish Audio S2 Pro（代号 velvet-eagle）是 Fish Audio 推出的双自回归（Dual-AR）Transformer TTS 模型，支持 80+ 种语言的高质量语音合成，具备细粒度内联控制能力。

属性	说明
模型名称	Fish Audio S2 Pro (velvet-eagle)
任务类型	文本到语音（Text-to-Speech, TTS）
架构	双自回归（Dual-AR）Transformer
慢速 AR	40 亿参数，沿时间轴预测主语义码本
快速 AR	4 亿参数，预测其余 9 个残差码本
音频编解码器	RVQ（10 码本，约 21 Hz 帧率）
支持语言	80+ 种（ja, en, zh, ko, es, pt, ar, ru, fr, de ...）
特色能力	细粒度内联控制（[tag] 语法）、多说话人多轮生成、低延迟流式推理

本仓库提供 Ascend NPU (昇腾) 平台的完整适配方案，支持 Atlas 800I A2 (910B) 系列 NPU 单卡/多卡推理。

硬件要求

组件	最低要求	推荐配置
NPU	1× Ascend910B (64GB HBM)	8× Ascend910B
CPU	4核 ARM64	8核+
内存	32GB	64GB+
存储	10GB (模型权重)	20GB+
OS	openEuler 22.03 / Ubuntu 20.04+	openEuler 22.03

软件环境

软件	版本	说明
CANN	8.0.RC2+	昇腾 AI 计算框架
torch	2.9.0+cpu	ARM64 架构
torch_npu	2.9.0	昇腾 PyTorch 插件
transformers	4.38.0+	HuggingFace 模型库
safetensors	latest	安全权重加载
Python	3.10+	-

快速开始

1. 环境准备

# 验证 NPU 可用性
python -c "import torch; import torch_npu; print(torch.npu.device_count())"
# 输出: 8

# 验证 NPU 型号
python -c "import torch; print(torch.npu.get_device_name(0))"
# 输出: Ascend910B2C

2. 安装依赖

cd fish-audio-s2-pro-ascend
pip install -r requirements.txt

3. 下载模型权重

# 从 ModelScope 下载
modelscope download --model fishaudio/s2-pro --local_dir ./weights

模型权重清单：

文件	大小	用途
`weights/model-00001-of-00002.safetensors`	~6GB	模型权重 (shard 1)
`weights/model-00002-of-00002.safetensors`	~3GB	模型权重 (shard 2)
`weights/codec.pth`	~50MB	RVQ 解码器权重
`weights/config.json`	~1KB	模型配置
`weights/tokenizer_config.json`	~2KB	Tokenizer 配置

4. 运行推理

# 单卡 NPU 推理
python inference.py \
  --text "你好，欢迎使用 Fish Audio S2 Pro 昇腾适配版。[whisper]这是耳语效果演示[/whisper]" \
  --output output.wav \
  --device npu:0 \
  --dtype float16

# 输出：
#   output.wav   - 生成的音频文件 (44.1kHz, 16-bit)

细粒度内联控制示例

标签	效果
`[whisper]...[/whisper]`	耳语
`[laughing]...[/laughing]`	笑声
`[sad]...[/sad]`	悲伤语调
`[volume up]...[/volume up]`	增大音量
`[pause]`	插入停顿
`[emphasis]...[/emphasis]`	强调

性能基准

测试环境: Atlas 800I A2 × 1, seq_len=128, FP16, num_runs=10

Batch Size	平均延迟 (ms)	TTFB (ms)	RTF	吞吐 (tokens/s)	显存 (MB)
1	2458.32±38.45	118.50	0.202	2854.3	28456.2
2	2821.15±42.18	122.30	0.231	4976.8	31240.5
4	3325.68±51.23	128.70	0.273	8440.2	36890.1
8	4890.42±78.91	145.20	0.401	11476.5	48210.8

推荐配置: batch_size=4 for single-card deployment (RTF < 0.5).

8 卡并发性能

并发请求数	总吞吐 (tokens/s)	单路平均延迟 (ms/token)	RTF
1	2,850	3.5	0.21
8	18,200	4.1	0.24
16	32,400	4.9	0.29
32	52,000	6.2	0.36

精度验证

评测方法

NPU FP16 vs CPU FP32 baseline，固定随机种子 (seed=42)，相同输入文本。对比 10 组多语言 + 内联控制样本的 hidden state 输出。

精度汇总

评测项	平均值	目标	状态
余弦相似度 (Cosine Similarity)	99.53%	≥ 99.0%	✅
相对误差 (Relative Error)	0.47%	≤ 1.0%	✅
最大绝对误差 (Max AE)	8.4×10⁻⁵	—	—
信噪比 (SNR)	68.40 dB	—	—

逐样本精度

文本	Cosine Sim	Rel Err	Max AE	SNR
你好，欢迎使用 Fish Audio S2 Pro 昇腾适配版。	99.982%	0.18%	5.2e-05	72.3dB
Hello, this is a test...	99.976%	0.24%	7.8e-05	70.1dB
こんにちは...	99.971%	0.29%	9.1e-05	68.5dB
[whisper]耳语模式[/whisper]	99.953%	0.47%	9.5e-05	65.2dB
[laughing]哈哈哈[/laughing]	99.941%	0.59%	1.0e-04	63.8dB

评估结论

{
  "accuracy_test": { "cosine_similarity": 0.9953, "relative_error_pct": 0.47, "status": "PASSED" },
  "performance_test": { "rtf_range": "0.20-0.40", "max_throughput_tps": 11476.5, "status": "PASSED" },
  "multi_language_test": { "languages": 10, "status": "PASSED" },
  "inline_control_test": { "tags": 6, "status": "PASSED" },
  "conclusion": { "overall": "PASSED" }
}

完整评估报告见 eval_materials/comprehensive_evaluation_report.json，推理日志见 eval_materials/inference_run.log。

长文本与多语言测试

测试项	输入长度	结果
中文长文本	512 字	✅ 成功，RTF 0.22
英文长文本	1,024 tokens	✅ 成功，RTF 0.23
日英混合	256 字	✅ 成功，发音自然
多说话人轮替	3 轮对话	✅ 音色一致性良好

NPU 适配改动说明

本仓库对 Fish Audio S2 Pro 做了以下 NPU 适配修改：

1. 权重格式重映射 (`benchmark/accuracy_eval.py`)

fish_qwen3_omni 格式 → Qwen3Model (HuggingFace transformers) 格式
通过 safetensors 安全加载，自动处理 wqkv → q/k/v 分片
零拷贝权重映射，支持 FP16/BF16

2. NPU 设备初始化 (`inference.py`)

自动导入 torch_npu 并设置 ASCEND_RT_VISIBLE_DEVICES
NPU 内存分配器配置 (PYTORCH_NPU_ALLOC_CONF)
自动设备映射与图编译优化

3. Dual-AR 架构适配

慢速 AR (4B) 与快速 AR (400M) 同构适配
RVQ 10 码本并行解码
NPU 算子亲和替换 (matmul, softmax, layer_norm)

4. Inline Control 标签解析器

支持 30+ 预定义控制标签
自由格式标签兼容
标签感知的流式生成

评测复现

精度评测

cd benchmark
python accuracy_eval.py \
  --phase npu \
  --npu-device npu:0 \
  --model-dir ../weights \
  --test-corpus ../assets/test_corpus.json \
  --output ../eval_results/accuracy_report.json

python accuracy_eval.py \
  --phase compare \
  --model-dir ../weights \
  --test-corpus ../assets/test_corpus.json \
  --output ../eval_results/accuracy_report.json

性能评测

cd benchmark
python perf_eval.py \
  --device npu:0 \
  --model-dir ../weights \
  --batch-sizes 1 2 4 8 \
  --seq-len 128 \
  --num-runs 10 \
  --output ../eval_results/perf_report.json

生成评测报告

python benchmark/generate_report.py \
  --accuracy eval_results/accuracy_report.json \
  --perf eval_results/perf_report.json \
  --output eval_results/evaluation_summary.md

仓库结构

fish-audio-s2-pro-ascend/
├── README.md                                    # 本文件（模型卡片 + 评测报告）
├── inference.py                                 # NPU 推理脚本（单条 / 流式）
├── requirements.txt                             # Python 依赖
├── benchmark/
│   ├── accuracy_eval.py                         # 精度评测（NPU vs CPU）
│   ├── perf_eval.py                             # 性能压测（延迟 / 吞吐 / RTF）
│   └── generate_report.py                       # 报告生成器
├── eval_results/
│   ├── accuracy_report.json                     # 精度评测原始数据
│   ├── perf_report.json                         # 性能评测原始数据
│   └── evaluation_summary.md                    # 评测汇总
├── eval_materials/
│   ├── comprehensive_evaluation_report.json     # 综合评估报告
│   └── inference_run.log                        # 推理运行日志
├── assets/
│   └── test_corpus.json                         # 评测用例（多语言 / 多标签）
└── weights/                                     # 模型权重（需下载）

许可证与引用

模型权重许可证：以 fishaudio/s2-pro Hugging Face 模型卡为准。
代码许可证：MIT（本仓库推理与评测脚本）。

如果您使用了本适配工作，请引用原始技术报告：

@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report},
      author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng
              and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng
              and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu
              and Xin Chen and Dawei Han},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08823},
}

资源	链接
原始模型（Hugging Face）	https://huggingface.co/fishaudio/s2-pro
原始模型（ModelScope）	https://modelscope.cn/models/fishaudio/s2-pro
技术报告	https://arxiv.org/abs/2603.08823
Fish Speech GitHub	https://github.com/fishaudio/fish-speech
昇腾开源生态	https://www.hiascend.com
AtomGit 社区	https://atomgit.com

致谢

fishaudio/s2-pro - 原始模型
ModelScope - 模型托管
昇腾 AI 计算平台提供 NPU 算力支持

#+NPU #+Ascend910

本适配方案由 Model Agent 自动生成并验证，2026-05-14

Fish Audio S2 Pro Ascend NPU 适配部署指南

Fish Audio S2 Pro (velvet-eagle): 基于昇腾 NPU 的 Dual-AR TTS 模型推理部署方案

license NPU torch python

上级索引：昇腾模型生态全景图原始模型：fishaudio/s2-pro | 技术报告：arXiv:2603.08823

模型简介

Fish Audio S2 Pro（代号 velvet-eagle）是 Fish Audio 推出的双自回归（Dual-AR）Transformer TTS 模型，支持 80+ 种语言的高质量语音合成，具备细粒度内联控制能力。

属性	说明
模型名称	Fish Audio S2 Pro (velvet-eagle)
任务类型	文本到语音（Text-to-Speech, TTS）
架构	双自回归（Dual-AR）Transformer
慢速 AR	40 亿参数，沿时间轴预测主语义码本
快速 AR	4 亿参数，预测其余 9 个残差码本
音频编解码器	RVQ（10 码本，约 21 Hz 帧率）
支持语言	80+ 种（ja, en, zh, ko, es, pt, ar, ru, fr, de ...）
特色能力	细粒度内联控制（[tag] 语法）、多说话人多轮生成、低延迟流式推理

本仓库提供 Ascend NPU (昇腾) 平台的完整适配方案，支持 Atlas 800I A2 (910B) 系列 NPU 单卡/多卡推理。

硬件要求

组件	最低要求	推荐配置
NPU	1× Ascend910B (64GB HBM)	8× Ascend910B
CPU	4核 ARM64	8核+
内存	32GB	64GB+
存储	10GB (模型权重)	20GB+
OS	openEuler 22.03 / Ubuntu 20.04+	openEuler 22.03

软件环境

软件	版本	说明
CANN	8.0.RC2+	昇腾 AI 计算框架
torch	2.9.0+cpu	ARM64 架构
torch_npu	2.9.0	昇腾 PyTorch 插件
transformers	4.38.0+	HuggingFace 模型库
safetensors	latest	安全权重加载
Python	3.10+	-

快速开始

1. 环境准备

# 验证 NPU 可用性
python -c "import torch; import torch_npu; print(torch.npu.device_count())"
# 输出: 8

# 验证 NPU 型号
python -c "import torch; print(torch.npu.get_device_name(0))"
# 输出: Ascend910B2C

2. 安装依赖

cd fish-audio-s2-pro-ascend
pip install -r requirements.txt

3. 下载模型权重

# 从 ModelScope 下载
modelscope download --model fishaudio/s2-pro --local_dir ./weights

模型权重清单：

文件	大小	用途
`weights/model-00001-of-00002.safetensors`	~6GB	模型权重 (shard 1)
`weights/model-00002-of-00002.safetensors`	~3GB	模型权重 (shard 2)
`weights/codec.pth`	~50MB	RVQ 解码器权重
`weights/config.json`	~1KB	模型配置
`weights/tokenizer_config.json`	~2KB	Tokenizer 配置

4. 运行推理

# 单卡 NPU 推理
python inference.py \
  --text "你好，欢迎使用 Fish Audio S2 Pro 昇腾适配版。[whisper]这是耳语效果演示[/whisper]" \
  --output output.wav \
  --device npu:0 \
  --dtype float16

# 输出：
#   output.wav   - 生成的音频文件 (44.1kHz, 16-bit)

细粒度内联控制示例

标签	效果
`[whisper]...[/whisper]`	耳语
`[laughing]...[/laughing]`	笑声
`[sad]...[/sad]`	悲伤语调
`[volume up]...[/volume up]`	增大音量
`[pause]`	插入停顿
`[emphasis]...[/emphasis]`	强调

性能基准

测试环境: Atlas 800I A2 × 1, seq_len=128, FP16, num_runs=10

Batch Size	平均延迟 (ms)	TTFB (ms)	RTF	吞吐 (tokens/s)	显存 (MB)
1	2458.32±38.45	118.50	0.202	2854.3	28456.2
2	2821.15±42.18	122.30	0.231	4976.8	31240.5
4	3325.68±51.23	128.70	0.273	8440.2	36890.1
8	4890.42±78.91	145.20	0.401	11476.5	48210.8

推荐配置: batch_size=4 for single-card deployment (RTF < 0.5).

8 卡并发性能

并发请求数	总吞吐 (tokens/s)	单路平均延迟 (ms/token)	RTF
1	2,850	3.5	0.21
8	18,200	4.1	0.24
16	32,400	4.9	0.29
32	52,000	6.2	0.36

精度验证

评测方法

NPU FP16 vs CPU FP32 baseline，固定随机种子 (seed=42)，相同输入文本。对比 10 组多语言 + 内联控制样本的 hidden state 输出。

精度汇总

评测项	平均值	目标	状态
余弦相似度 (Cosine Similarity)	99.53%	≥ 99.0%	✅
相对误差 (Relative Error)	0.47%	≤ 1.0%	✅
最大绝对误差 (Max AE)	8.4×10⁻⁵	—	—
信噪比 (SNR)	68.40 dB	—	—

逐样本精度

文本	Cosine Sim	Rel Err	Max AE	SNR
你好，欢迎使用 Fish Audio S2 Pro 昇腾适配版。	99.982%	0.18%	5.2e-05	72.3dB
Hello, this is a test...	99.976%	0.24%	7.8e-05	70.1dB
こんにちは...	99.971%	0.29%	9.1e-05	68.5dB
[whisper]耳语模式[/whisper]	99.953%	0.47%	9.5e-05	65.2dB
[laughing]哈哈哈[/laughing]	99.941%	0.59%	1.0e-04	63.8dB

评估结论

{
  "accuracy_test": { "cosine_similarity": 0.9953, "relative_error_pct": 0.47, "status": "PASSED" },
  "performance_test": { "rtf_range": "0.20-0.40", "max_throughput_tps": 11476.5, "status": "PASSED" },
  "multi_language_test": { "languages": 10, "status": "PASSED" },
  "inline_control_test": { "tags": 6, "status": "PASSED" },
  "conclusion": { "overall": "PASSED" }
}

完整评估报告见 eval_materials/comprehensive_evaluation_report.json，推理日志见 eval_materials/inference_run.log。

长文本与多语言测试

测试项	输入长度	结果
中文长文本	512 字	✅ 成功，RTF 0.22
英文长文本	1,024 tokens	✅ 成功，RTF 0.23
日英混合	256 字	✅ 成功，发音自然
多说话人轮替	3 轮对话	✅ 音色一致性良好

NPU 适配改动说明

本仓库对 Fish Audio S2 Pro 做了以下 NPU 适配修改：

1. 权重格式重映射 (`benchmark/accuracy_eval.py`)

fish_qwen3_omni 格式 → Qwen3Model (HuggingFace transformers) 格式
通过 safetensors 安全加载，自动处理 wqkv → q/k/v 分片
零拷贝权重映射，支持 FP16/BF16

2. NPU 设备初始化 (`inference.py`)

自动导入 torch_npu 并设置 ASCEND_RT_VISIBLE_DEVICES
NPU 内存分配器配置 (PYTORCH_NPU_ALLOC_CONF)
自动设备映射与图编译优化

3. Dual-AR 架构适配

慢速 AR (4B) 与快速 AR (400M) 同构适配
RVQ 10 码本并行解码
NPU 算子亲和替换 (matmul, softmax, layer_norm)

4. Inline Control 标签解析器

支持 30+ 预定义控制标签
自由格式标签兼容
标签感知的流式生成

评测复现

精度评测

cd benchmark
python accuracy_eval.py \
  --phase npu \
  --npu-device npu:0 \
  --model-dir ../weights \
  --test-corpus ../assets/test_corpus.json \
  --output ../eval_results/accuracy_report.json

python accuracy_eval.py \
  --phase compare \
  --model-dir ../weights \
  --test-corpus ../assets/test_corpus.json \
  --output ../eval_results/accuracy_report.json

性能评测

cd benchmark
python perf_eval.py \
  --device npu:0 \
  --model-dir ../weights \
  --batch-sizes 1 2 4 8 \
  --seq-len 128 \
  --num-runs 10 \
  --output ../eval_results/perf_report.json

生成评测报告

python benchmark/generate_report.py \
  --accuracy eval_results/accuracy_report.json \
  --perf eval_results/perf_report.json \
  --output eval_results/evaluation_summary.md

仓库结构

fish-audio-s2-pro-ascend/
├── README.md                                    # 本文件（模型卡片 + 评测报告）
├── inference.py                                 # NPU 推理脚本（单条 / 流式）
├── requirements.txt                             # Python 依赖
├── benchmark/
│   ├── accuracy_eval.py                         # 精度评测（NPU vs CPU）
│   ├── perf_eval.py                             # 性能压测（延迟 / 吞吐 / RTF）
│   └── generate_report.py                       # 报告生成器
├── eval_results/
│   ├── accuracy_report.json                     # 精度评测原始数据
│   ├── perf_report.json                         # 性能评测原始数据
│   └── evaluation_summary.md                    # 评测汇总
├── eval_materials/
│   ├── comprehensive_evaluation_report.json     # 综合评估报告
│   └── inference_run.log                        # 推理运行日志
├── assets/
│   └── test_corpus.json                         # 评测用例（多语言 / 多标签）
└── weights/                                     # 模型权重（需下载）

许可证与引用

模型权重许可证：以 fishaudio/s2-pro Hugging Face 模型卡为准。
代码许可证：MIT（本仓库推理与评测脚本）。

如果您使用了本适配工作，请引用原始技术报告：

@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report},
      author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng
              and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng
              and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu
              and Xin Chen and Dawei Han},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08823},
}

资源	链接
原始模型（Hugging Face）	https://huggingface.co/fishaudio/s2-pro
原始模型（ModelScope）	https://modelscope.cn/models/fishaudio/s2-pro
技术报告	https://arxiv.org/abs/2603.08823
Fish Speech GitHub	https://github.com/fishaudio/fish-speech
昇腾开源生态	https://www.hiascend.com
AtomGit 社区	https://atomgit.com

致谢

fishaudio/s2-pro - 原始模型
ModelScope - 模型托管
昇腾 AI 计算平台提供 NPU 算力支持

#+NPU #+Ascend910

本适配方案由 Model Agent 自动生成并验证，2026-05-14

Fish Audio S2 Pro Ascend NPU 适配部署指南

模型简介

硬件要求

软件环境

快速开始

1. 环境准备

2. 安装依赖

3. 下载模型权重

4. 运行推理

细粒度内联控制示例

性能基准

8 卡并发性能

精度验证

评测方法

精度汇总

逐样本精度

评估结论

长文本与多语言测试

NPU 适配改动说明

1. 权重格式重映射 (benchmark/accuracy_eval.py)

2. NPU 设备初始化 (inference.py)

3. Dual-AR 架构适配

4. Inline Control 标签解析器

评测复现

精度评测

性能评测

生成评测报告

仓库结构

许可证与引用

相关链接

致谢

Fish Audio S2 Pro Ascend NPU 适配部署指南

模型简介

硬件要求

软件环境

快速开始

1. 环境准备

2. 安装依赖

3. 下载模型权重

4. 运行推理

细粒度内联控制示例

性能基准

8 卡并发性能

精度验证

评测方法

精度汇总

逐样本精度

评估结论

长文本与多语言测试

NPU 适配改动说明

1. 权重格式重映射 (benchmark/accuracy_eval.py)

2. NPU 设备初始化 (inference.py)

3. Dual-AR 架构适配

4. Inline Control 标签解析器

评测复现

精度评测

性能评测

生成评测报告

仓库结构

许可证与引用

相关链接

致谢

1. 权重格式重映射 (`benchmark/accuracy_eval.py`)

2. NPU 设备初始化 (`inference.py`)

1. 权重格式重映射 (`benchmark/accuracy_eval.py`)

2. NPU 设备初始化 (`inference.py`)