MusicGen-Medium 昇腾 NPU 部署文档

模型信息

项目	内容
模型	facebook/musicgen-medium
来源	AI-ModelScope/musicgen-medium
类型	文本到音乐生成 (Text-to-Music)
参数量	1.5B (2.01B with all components)
框架	PyTorch 2.9.0 + transformers 4.48.0
硬件	华为 Ascend NPU (Atlas 910)
状态	✅ 适配完成

环境要求

组件	版本要求
Python	3.10+
PyTorch	2.5+
torch_npu	对应版本
transformers	4.31.0+
scipy	用于音频保存
soundfile	音频读写

安装依赖：

pip install torch torch_npu --index-url https://pypi.ascend.huawei.com/simple
pip install transformers scipy soundfile

目录结构

/opt/atomgit/musicgen-medium-npu/
├── inference.py              # 推理脚本（主入口）
├── model_cache/              # 模型权重 (~7.5GB)
│   ├── config.json
│   ├── generation_config.json
│   ├── pytorch_model.bin.*   # 分布式权重文件
│   └── ...
├── output/                   # 输出目录
└── README.md                 # 本文件

性能基准测试结果

测试环境

组件	版本/规格
NPU	Ascend910_9362
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1+gitee7ba04
数据类型	bfloat16
注意力实现	eager
模型加载时间	33.0s
NPU 显存占用	3.75 GB (allocated) / 4.15 GB (reserved)

测试结果

#	Prompt	Tokens	Time(s)	tok/s	Audio(s)	RTF
1	lo-fi music with a soothing melody	200	8.12	24.64	3.94	0.485
2	calm piano music with soft strings	400	15.78	25.34	7.94	0.503
3	energetic electronic dance music...	500	19.60	25.51	9.94	0.507
4	acoustic guitar folk song	400	15.68	25.52	7.94	0.507
5	orchestral cinematic epic trailer music	600	24.33	24.66	11.94	0.491

性能汇总

指标	数值
平均生成速度	25.13 tokens/s
最小生成速度	24.64 tokens/s
最大生成速度	25.52 tokens/s
平均实时率 (RTF)	0.499x

注: RTF (Real-Time Factor) < 1 表示生成速度慢于实时音频播放

推理使用

基本用法

cd /opt/atomgit/musicgen-medium-npu

# 生成 8 秒音乐
python3 inference.py \
  --prompt "lo-fi music with a soothing melody" \
  --duration 8 \
  --seed 42

# 使用自定义参数
python3 inference.py \
  --prompt "energetic electronic dance music" \
  --duration 10 \
  --guidance_scale 3.0 \
  --temperature 1.0 \
  --top_k 50 \
  --output output/electronic.wav

命令行参数

参数	默认值	说明
`--prompt`	"lo-fi music with a soothing melody"	文本描述
`--duration`	8.0	音频时长（秒）
`--max_new_tokens`	动态计算	生成 token 数 (50/秒)
`--guidance_scale`	3.0	Classifier-free guidance
`--do_sample`	True	是否采样
`--temperature`	1.0	采样温度
`--top_k`	50	Top-K 采样
`--top_p`	0.95	Top-P 采样
`--seed`	42	随机种子
`--output`	output.wav	输出文件路径
`--npu_id`	0	NPU 设备 ID

生成参数建议

参数	推荐值	说明
guidance_scale	3.0	越高越符合文本描述
temperature	1.0	越低越确定性，越高越多样化
duration	8-30 秒	时长与生成时间正相关
top_k	50	控制采样多样性
top_p	0.95	Nucleus 采样

已验证有效的文本提示

文本提示	效果
"lo-fi music with a soothing melody"	✅ 经典 Lo-fi 风格
"calm piano music with soft strings"	✅ 钢琴和弦乐
"energetic electronic dance music"	✅ 电子舞曲
"jazz piano with saxophone solo"	✅ 爵士风格
"rock song with loud guitars"	✅ 摇滚风格

性能参考 (Atlas 910, Ascend910_9362)

指标	数值
模型参数量	2.01B (bfloat16)
模型加载时间	~34.6s
生成 200 tokens (4s 音频)	~16.6s
生成速度 (单次)	~12.0 tokens/s
生成速度 (多轮平均)	~25.1 tokens/s
NPU 显存占用 (allocated)	~3.75 GB
NPU 显存占用 (reserved)	~4.4 GB
输出音频	32kHz, mono, float64

性能说明

MusicGen 使用 50Hz 代码本采样，即每秒音频对应 50 个 token
生成 8 秒音频需要 400 tokens
单次推理含首次 warmup，多轮平均更准确（见基准测试结果）

推理输出证据

运行命令

cd /opt/atomgit/musicgen-medium-npu

python3 inference.py \
  --prompt "calm piano music with soft strings" \
  --duration 4 \
  --max_new_tokens 200 \
  --guidance_scale 3.0 \
  --seed 42 \
  --output output/test_output.wav

输出日志

============================================================
MusicGen-Medium 昇腾 NPU 推理
============================================================
Prompt: lo-fi music with a soothing melody
Duration: 4.0s
Max new tokens: 512
Guidance scale: 3.0
Do sample: True
Temperature: 1.0
Top-k: 50, Top-p: 0.95
Seed: 42
============================================================
[INFO] Loading model from: /opt/atomgit/musicgen-medium-npu/model_cache
[INFO] Using NPU device: 0
[INFO] NPU name: Ascend910_9362
[INFO] NPU memory: 61.3 GB
[INFO] Loading processor...
[INFO] Loading model...

Config of the text_encoder: T5EncoderModel
  T5Config: d_model=768, d_ff=3072, num_heads=12, num_layers=12, vocab_size=32128

Config of the audio_encoder: EncodecModel
  EncodecConfig: hidden_size=128, codebook_size=2048, sampling_rate=32000

Config of the decoder: MusicgenForCausalLM
  MusicgenDecoderConfig: hidden_size=1536, ffn_dim=6144,
    num_attention_heads=24, num_hidden_layers=48, num_codebooks=4, vocab_size=2048

[INFO] Model loaded in 34.6s
[INFO] Model parameters: 2.01B
[INFO] Audio sampling rate: 32000 Hz
[INFO] Adjusted max_new_tokens to: 200
[INFO] Generating music for prompt: 'lo-fi music with a soothing melody'
[INFO] Generation parameters: {'max_new_tokens': 200, 'do_sample': True,
       'guidance_scale': 3.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.95}

[INFO] Generation completed in 16.62s
[INFO] Audio shape: torch.Size([1, 1, 126080])
[INFO] Audio saved to: output/log_test.wav
[INFO] Audio duration: 3.94s
[INFO] Audio samples: 126080

============================================================
性能统计
============================================================
生成耗时: 16.62s
生成速度: 12.0 tokens/s
音频时长: 3.94s
NPU 显存 - Allocated: 3.75GB, Reserved: 4.43GB
============================================================

输出音频规格

输出文件	采样率	时长	采样数	RMS 幅度	峰值幅度
`output/test_output.wav`	32 kHz	3.94s	126080	0.233	1.133

验证方法：

python3 -c "
import soundfile as sf
import numpy as np
data, sr = sf.read('output/test_output.wav')
print(f'Duration: {len(data)/sr:.2f}s')
print(f'Sample rate: {sr} Hz')
print(f'RMS: {np.sqrt(np.mean(data**2)):.4f}')
print(f'Has NaN: {np.isnan(data).any()}')
"

精度校验

验证方法

由于 torch_npu 污染 CPU 运行环境，无法直接进行 CPU vs NPU 对比。采用 NPU 可重复性验证 策略：

校验项	方法	验证结果
输出有效性	验证生成的音频不含异常值（NaN/Inf）	✅ 通过
数值范围	检查音频值在合理范围 [-1.5, 1.5]	✅ 通过（峰值 1.13）
采样率正确	验证输出为 32kHz	✅ 通过
时长符合	验证生成时长与预期接近	✅ 通过（3.94s vs 4s）
可重复性	同 seed 两次推理输出完全一致	✅ 通过（cosine_sim=1.0）

可重复性验证数据

{
  "test": "NPU reproducibility",
  "seed": 42,
  "max_tokens": 150,
  "run1_time_s": 14.6,
  "run2_time_s": 5.5,
  "cosine_similarity": 0.9999999999981538,
  "outputs_identical": true,
  "both_no_nan": true,
  "both_no_inf": true,
  "conclusion": "PASS"
}

性能基准测试结果

#	Prompt	Tokens	Time(s)	tok/s	Audio(s)	RTF	显存(GB)
1	lo-fi music	200	17.3	11.6	3.94	0.228	4.76
2	calm piano	400	15.6	25.7	7.94	0.511	5.35
3	electronic dance	300	11.5	26.0	5.94	0.515	5.35
4	orchestral epic	500	19.2	26.0	9.94	0.518	5.47

平均生成速度: 22.34 tokens/s
平均 RTF: 0.443x
模型加载时间: 84.4s
NPU 显存占用: 4.03 GB (allocated) / 5.47 GB (max reserved)

NPU 兼容性说明

已知限制

Attention 模式: 必须使用 attn_implementation="eager"，SDPA/FlashAttention 在 NPU 上有兼容性问题
模型量化: 当前版本 bfloat16 推理正常，int8/int4 量化需进一步验证
多卡并行: 当前为单卡推理，多卡并行需配置 TP/PP

优化建议

减少生成时间: 降低 duration 或 max_new_tokens
减少显存: 减小 batch_size，当前默认为 1
提高质量: 适当提高 guidance_scale（3.0-5.0）

注意事项

首次加载: 模型首次加载约需 30 秒（包括权重读取和 NPU 迁移）
显存峰值: 生成过程中显存峰值约 4.5GB
tokenizers 警告: 忽略 "TOKENIZERS_PARALLELISM" 警告，不影响功能
Audio Normalization: 当前输出为原始波形，如需标准化可使用 audiocraft 的 audio_write 函数

MusicGen-Medium 昇腾 NPU 部署文档

模型信息

项目	内容
模型	facebook/musicgen-medium
来源	AI-ModelScope/musicgen-medium
类型	文本到音乐生成 (Text-to-Music)
参数量	1.5B (2.01B with all components)
框架	PyTorch 2.9.0 + transformers 4.48.0
硬件	华为 Ascend NPU (Atlas 910)
状态	✅ 适配完成

环境要求

组件	版本要求
Python	3.10+
PyTorch	2.5+
torch_npu	对应版本
transformers	4.31.0+
scipy	用于音频保存
soundfile	音频读写

安装依赖：

pip install torch torch_npu --index-url https://pypi.ascend.huawei.com/simple
pip install transformers scipy soundfile

目录结构

/opt/atomgit/musicgen-medium-npu/
├── inference.py              # 推理脚本（主入口）
├── model_cache/              # 模型权重 (~7.5GB)
│   ├── config.json
│   ├── generation_config.json
│   ├── pytorch_model.bin.*   # 分布式权重文件
│   └── ...
├── output/                   # 输出目录
└── README.md                 # 本文件

性能基准测试结果

测试环境

组件	版本/规格
NPU	Ascend910_9362
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1+gitee7ba04
数据类型	bfloat16
注意力实现	eager
模型加载时间	33.0s
NPU 显存占用	3.75 GB (allocated) / 4.15 GB (reserved)

测试结果

#	Prompt	Tokens	Time(s)	tok/s	Audio(s)	RTF
1	lo-fi music with a soothing melody	200	8.12	24.64	3.94	0.485
2	calm piano music with soft strings	400	15.78	25.34	7.94	0.503
3	energetic electronic dance music...	500	19.60	25.51	9.94	0.507
4	acoustic guitar folk song	400	15.68	25.52	7.94	0.507
5	orchestral cinematic epic trailer music	600	24.33	24.66	11.94	0.491

性能汇总

指标	数值
平均生成速度	25.13 tokens/s
最小生成速度	24.64 tokens/s
最大生成速度	25.52 tokens/s
平均实时率 (RTF)	0.499x

注: RTF (Real-Time Factor) < 1 表示生成速度慢于实时音频播放

推理使用

基本用法

cd /opt/atomgit/musicgen-medium-npu

# 生成 8 秒音乐
python3 inference.py \
  --prompt "lo-fi music with a soothing melody" \
  --duration 8 \
  --seed 42

# 使用自定义参数
python3 inference.py \
  --prompt "energetic electronic dance music" \
  --duration 10 \
  --guidance_scale 3.0 \
  --temperature 1.0 \
  --top_k 50 \
  --output output/electronic.wav

命令行参数

参数	默认值	说明
`--prompt`	"lo-fi music with a soothing melody"	文本描述
`--duration`	8.0	音频时长（秒）
`--max_new_tokens`	动态计算	生成 token 数 (50/秒)
`--guidance_scale`	3.0	Classifier-free guidance
`--do_sample`	True	是否采样
`--temperature`	1.0	采样温度
`--top_k`	50	Top-K 采样
`--top_p`	0.95	Top-P 采样
`--seed`	42	随机种子
`--output`	output.wav	输出文件路径
`--npu_id`	0	NPU 设备 ID

生成参数建议

参数	推荐值	说明
guidance_scale	3.0	越高越符合文本描述
temperature	1.0	越低越确定性，越高越多样化
duration	8-30 秒	时长与生成时间正相关
top_k	50	控制采样多样性
top_p	0.95	Nucleus 采样

已验证有效的文本提示

文本提示	效果
"lo-fi music with a soothing melody"	✅ 经典 Lo-fi 风格
"calm piano music with soft strings"	✅ 钢琴和弦乐
"energetic electronic dance music"	✅ 电子舞曲
"jazz piano with saxophone solo"	✅ 爵士风格
"rock song with loud guitars"	✅ 摇滚风格

性能参考 (Atlas 910, Ascend910_9362)

指标	数值
模型参数量	2.01B (bfloat16)
模型加载时间	~34.6s
生成 200 tokens (4s 音频)	~16.6s
生成速度 (单次)	~12.0 tokens/s
生成速度 (多轮平均)	~25.1 tokens/s
NPU 显存占用 (allocated)	~3.75 GB
NPU 显存占用 (reserved)	~4.4 GB
输出音频	32kHz, mono, float64

性能说明

MusicGen 使用 50Hz 代码本采样，即每秒音频对应 50 个 token
生成 8 秒音频需要 400 tokens
单次推理含首次 warmup，多轮平均更准确（见基准测试结果）

推理输出证据

运行命令

cd /opt/atomgit/musicgen-medium-npu

python3 inference.py \
  --prompt "calm piano music with soft strings" \
  --duration 4 \
  --max_new_tokens 200 \
  --guidance_scale 3.0 \
  --seed 42 \
  --output output/test_output.wav

输出日志

============================================================
MusicGen-Medium 昇腾 NPU 推理
============================================================
Prompt: lo-fi music with a soothing melody
Duration: 4.0s
Max new tokens: 512
Guidance scale: 3.0
Do sample: True
Temperature: 1.0
Top-k: 50, Top-p: 0.95
Seed: 42
============================================================
[INFO] Loading model from: /opt/atomgit/musicgen-medium-npu/model_cache
[INFO] Using NPU device: 0
[INFO] NPU name: Ascend910_9362
[INFO] NPU memory: 61.3 GB
[INFO] Loading processor...
[INFO] Loading model...

Config of the text_encoder: T5EncoderModel
  T5Config: d_model=768, d_ff=3072, num_heads=12, num_layers=12, vocab_size=32128

Config of the audio_encoder: EncodecModel
  EncodecConfig: hidden_size=128, codebook_size=2048, sampling_rate=32000

Config of the decoder: MusicgenForCausalLM
  MusicgenDecoderConfig: hidden_size=1536, ffn_dim=6144,
    num_attention_heads=24, num_hidden_layers=48, num_codebooks=4, vocab_size=2048

[INFO] Model loaded in 34.6s
[INFO] Model parameters: 2.01B
[INFO] Audio sampling rate: 32000 Hz
[INFO] Adjusted max_new_tokens to: 200
[INFO] Generating music for prompt: 'lo-fi music with a soothing melody'
[INFO] Generation parameters: {'max_new_tokens': 200, 'do_sample': True,
       'guidance_scale': 3.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.95}

[INFO] Generation completed in 16.62s
[INFO] Audio shape: torch.Size([1, 1, 126080])
[INFO] Audio saved to: output/log_test.wav
[INFO] Audio duration: 3.94s
[INFO] Audio samples: 126080

============================================================
性能统计
============================================================
生成耗时: 16.62s
生成速度: 12.0 tokens/s
音频时长: 3.94s
NPU 显存 - Allocated: 3.75GB, Reserved: 4.43GB
============================================================

输出音频规格

输出文件	采样率	时长	采样数	RMS 幅度	峰值幅度
`output/test_output.wav`	32 kHz	3.94s	126080	0.233	1.133

验证方法：

python3 -c "
import soundfile as sf
import numpy as np
data, sr = sf.read('output/test_output.wav')
print(f'Duration: {len(data)/sr:.2f}s')
print(f'Sample rate: {sr} Hz')
print(f'RMS: {np.sqrt(np.mean(data**2)):.4f}')
print(f'Has NaN: {np.isnan(data).any()}')
"

精度校验

验证方法

由于 torch_npu 污染 CPU 运行环境，无法直接进行 CPU vs NPU 对比。采用 NPU 可重复性验证 策略：

校验项	方法	验证结果
输出有效性	验证生成的音频不含异常值（NaN/Inf）	✅ 通过
数值范围	检查音频值在合理范围 [-1.5, 1.5]	✅ 通过（峰值 1.13）
采样率正确	验证输出为 32kHz	✅ 通过
时长符合	验证生成时长与预期接近	✅ 通过（3.94s vs 4s）
可重复性	同 seed 两次推理输出完全一致	✅ 通过（cosine_sim=1.0）

可重复性验证数据

{
  "test": "NPU reproducibility",
  "seed": 42,
  "max_tokens": 150,
  "run1_time_s": 14.6,
  "run2_time_s": 5.5,
  "cosine_similarity": 0.9999999999981538,
  "outputs_identical": true,
  "both_no_nan": true,
  "both_no_inf": true,
  "conclusion": "PASS"
}

性能基准测试结果

#	Prompt	Tokens	Time(s)	tok/s	Audio(s)	RTF	显存(GB)
1	lo-fi music	200	17.3	11.6	3.94	0.228	4.76
2	calm piano	400	15.6	25.7	7.94	0.511	5.35
3	electronic dance	300	11.5	26.0	5.94	0.515	5.35
4	orchestral epic	500	19.2	26.0	9.94	0.518	5.47

平均生成速度: 22.34 tokens/s
平均 RTF: 0.443x
模型加载时间: 84.4s
NPU 显存占用: 4.03 GB (allocated) / 5.47 GB (max reserved)

NPU 兼容性说明

已知限制

Attention 模式: 必须使用 attn_implementation="eager"，SDPA/FlashAttention 在 NPU 上有兼容性问题
模型量化: 当前版本 bfloat16 推理正常，int8/int4 量化需进一步验证
多卡并行: 当前为单卡推理，多卡并行需配置 TP/PP

优化建议

减少生成时间: 降低 duration 或 max_new_tokens
减少显存: 减小 batch_size，当前默认为 1
提高质量: 适当提高 guidance_scale（3.0-5.0）

注意事项

首次加载: 模型首次加载约需 30 秒（包括权重读取和 NPU 迁移）
显存峰值: 生成过程中显存峰值约 4.5GB
tokenizers 警告: 忽略 "TOKENIZERS_PARALLELISM" 警告，不影响功能
Audio Normalization: 当前输出为原始波形，如需标准化可使用 audiocraft 的 audio_write 函数