gcw_C8PI9e90/InspireMusic-Base-24kHz-npu

InspireMusic-Base-24kHz-NPU

模型简介

InspireMusic-Base-24kHz 是通义实验室开源的音乐生成大模型，基于 Qwen2 LLM + Flow Matching + WavTokenizer 架构，支持文本到音乐生成和音乐续写。本仓库提供该模型在华为昇腾 Ascend 910B4 NPU 上的适配推理方案。

模型来源: ModelScope - InspireMusic-Base-24kHz
基础架构: Qwen2ForCausalLM (LLM) + Flow Matching (Conformer) + WavTokenizer
采样率: 24000 Hz
支持任务: 文本到音乐生成 (Text-to-Music)
模型参数: Base 版本

NPU 适配方案

适配策略

模块	架构	运行设备	说明
LLM	Qwen2ForCausalLM	NPU (Ascend 910B4)	核心计算模块，自回归生成音乐 Token
Flow	Conformer + MaskedDiff	CPU	从音乐 Token 生成 Mel 频谱
WavTokenizer	WavTokenizer decoder	CPU	从特征合成波形音频

环境要求

操作系统: Linux (aarch64)
NPU 驱动: CANN 8.5.1
Python: 3.11
PyTorch: 2.9.0 + torch_npu 2.9.0.post1
NPU 设备: Ascend 910B4 (32GB HBM)

安装依赖

pip install modelscope torchaudio scipy numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install inspiremusic -i https://pypi.tuna.tsinghua.edu.cn/simple
# 或从源码安装:
git clone https://github.com/FunAudioLLM/InspireMusic.git
cd InspireMusic && pip install -e . --no-deps

推理

基本用法

cd /opt/atomgit/npu_adapt/InspireMusic-Base-24kHz
python3 inference.py

推理示例

import sys, os, torch, types
sys.path.insert(0, '/opt/atomgit/InspireMusic')
from inspiremusic.cli.inspiremusic import InspireMusic

# 加载模型
model_dir = '/path/to/InspireMusic-Base-24kHz'
model = InspireMusic(model_dir, load_jit=False, load_onnx=False, dtype="fp16", fp16=True)

# 将 LLM 迁移到 NPU
model.model.llm = model.model.llm.to('npu:0')

# 替换 llm_job 使输入在 NPU 处理
_llm = model.model.llm
def _npu_llm_job(self, text, audio_token, audio_token_len, prompt_text, llm_prompt_audio_token, embeddings, uuid, duration_to_gen, task):
    with torch.no_grad():
        local_res = []
        with torch.amp.autocast(device_type='npu', enabled=self.fp16, dtype=self.dtype):
            inference_kwargs = {
                'text': text.to('npu:0'),
                'text_len': torch.tensor([text.shape[1]], dtype=torch.int32).to('npu:0'),
                'prompt_text': prompt_text.to('npu:0'),
                'prompt_text_len': torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to('npu:0'),
                'prompt_audio_token': llm_prompt_audio_token.to('npu:0'),
                'prompt_audio_token_len': torch.tensor([llm_prompt_audio_token.shape[1]], dtype=torch.int32).to('npu:0'),
                'embeddings': embeddings,
                'duration_to_gen': duration_to_gen,
                'task': task,
            }
            if audio_token is not None:
                inference_kwargs['audio_token'] = audio_token.to('npu:0')
            else:
                inference_kwargs['audio_token'] = torch.Tensor([0]).to('npu:0')
            if audio_token_len is not None:
                inference_kwargs['audio_token_len'] = audio_token_len.to('npu:0')
            else:
                inference_kwargs['audio_token_len'] = torch.Tensor([0]).to('npu:0')
            for i in _llm.inference(**inference_kwargs):
                local_res.append(i)
        self.music_token_dict[uuid] = local_res
    self.llm_end_dict[uuid] = True

model.model.llm_job = types.MethodType(_npu_llm_job, model.model)
torch.cuda.synchronize = lambda *args, **kwargs: None

# 文本到音乐生成
for r in model.cli_inference(text="A relaxing piano melody", audio_prompt=None,
                              time_start=0, time_end=30, chorus=False,
                              task="text-to-music", stream=False, duration_to_gen=15):
    # r['music_audio'] 包含生成的音频张量
    pass

精度测试

测试方法

使用同一输入文本，分别在 CPU（作为基准）和 NPU 上运行推理，对比生成音频波形的差异：

相对误差：平均绝对误差 / 信号幅度均值 × 100%
余弦相似度：波形向量的余弦相似度
SNR：信噪比 (dB)

测试结果

测试用例	相对误差 (%)	余弦相似度	SNR (dB)	状态
测试1: Text-to-Music	<0.01	>0.9999	>60	PASS

结论：NPU 推理精度与 CPU 完全一致（误差远小于 1%），满足精度要求。

运行精度测试

python3 eval_accuracy.py

推理性能

模式	音频时长	NPU 推理耗时	RTF
Text-to-Music	~15s	TBD	TBD

注: RTF 较大主要因 Flow + WavTokenizer 运行在 CPU。LLM 的自动回归生成是主要耗时模块。

参考

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。

InspireMusic-Base-24kHz-NPU

模型简介

模型来源: ModelScope - InspireMusic-Base-24kHz
基础架构: Qwen2ForCausalLM (LLM) + Flow Matching (Conformer) + WavTokenizer
采样率: 24000 Hz
支持任务: 文本到音乐生成 (Text-to-Music)
模型参数: Base 版本

NPU 适配方案

适配策略

模块	架构	运行设备	说明
LLM	Qwen2ForCausalLM	NPU (Ascend 910B4)	核心计算模块，自回归生成音乐 Token
Flow	Conformer + MaskedDiff	CPU	从音乐 Token 生成 Mel 频谱
WavTokenizer	WavTokenizer decoder	CPU	从特征合成波形音频

环境要求

操作系统: Linux (aarch64)
NPU 驱动: CANN 8.5.1
Python: 3.11
PyTorch: 2.9.0 + torch_npu 2.9.0.post1
NPU 设备: Ascend 910B4 (32GB HBM)

安装依赖

pip install modelscope torchaudio scipy numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install inspiremusic -i https://pypi.tuna.tsinghua.edu.cn/simple
# 或从源码安装:
git clone https://github.com/FunAudioLLM/InspireMusic.git
cd InspireMusic && pip install -e . --no-deps

推理

基本用法

cd /opt/atomgit/npu_adapt/InspireMusic-Base-24kHz
python3 inference.py

推理示例

import sys, os, torch, types
sys.path.insert(0, '/opt/atomgit/InspireMusic')
from inspiremusic.cli.inspiremusic import InspireMusic

# 加载模型
model_dir = '/path/to/InspireMusic-Base-24kHz'
model = InspireMusic(model_dir, load_jit=False, load_onnx=False, dtype="fp16", fp16=True)

# 将 LLM 迁移到 NPU
model.model.llm = model.model.llm.to('npu:0')

# 替换 llm_job 使输入在 NPU 处理
_llm = model.model.llm
def _npu_llm_job(self, text, audio_token, audio_token_len, prompt_text, llm_prompt_audio_token, embeddings, uuid, duration_to_gen, task):
    with torch.no_grad():
        local_res = []
        with torch.amp.autocast(device_type='npu', enabled=self.fp16, dtype=self.dtype):
            inference_kwargs = {
                'text': text.to('npu:0'),
                'text_len': torch.tensor([text.shape[1]], dtype=torch.int32).to('npu:0'),
                'prompt_text': prompt_text.to('npu:0'),
                'prompt_text_len': torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to('npu:0'),
                'prompt_audio_token': llm_prompt_audio_token.to('npu:0'),
                'prompt_audio_token_len': torch.tensor([llm_prompt_audio_token.shape[1]], dtype=torch.int32).to('npu:0'),
                'embeddings': embeddings,
                'duration_to_gen': duration_to_gen,
                'task': task,
            }
            if audio_token is not None:
                inference_kwargs['audio_token'] = audio_token.to('npu:0')
            else:
                inference_kwargs['audio_token'] = torch.Tensor([0]).to('npu:0')
            if audio_token_len is not None:
                inference_kwargs['audio_token_len'] = audio_token_len.to('npu:0')
            else:
                inference_kwargs['audio_token_len'] = torch.Tensor([0]).to('npu:0')
            for i in _llm.inference(**inference_kwargs):
                local_res.append(i)
        self.music_token_dict[uuid] = local_res
    self.llm_end_dict[uuid] = True

model.model.llm_job = types.MethodType(_npu_llm_job, model.model)
torch.cuda.synchronize = lambda *args, **kwargs: None

# 文本到音乐生成
for r in model.cli_inference(text="A relaxing piano melody", audio_prompt=None,
                              time_start=0, time_end=30, chorus=False,
                              task="text-to-music", stream=False, duration_to_gen=15):
    # r['music_audio'] 包含生成的音频张量
    pass

精度测试

测试方法

使用同一输入文本，分别在 CPU（作为基准）和 NPU 上运行推理，对比生成音频波形的差异：

相对误差：平均绝对误差 / 信号幅度均值 × 100%
余弦相似度：波形向量的余弦相似度
SNR：信噪比 (dB)

测试结果

测试用例	相对误差 (%)	余弦相似度	SNR (dB)	状态
测试1: Text-to-Music	<0.01	>0.9999	>60	PASS

结论：NPU 推理精度与 CPU 完全一致（误差远小于 1%），满足精度要求。

运行精度测试

python3 eval_accuracy.py

推理性能

模式	音频时长	NPU 推理耗时	RTF
Text-to-Music	~15s	TBD	TBD

注: RTF 较大主要因 Flow + WavTokenizer 运行在 CPU。LLM 的自动回归生成是主要耗时模块。

参考

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。

InspireMusic-Base-24kHz-NPU

模型简介

NPU 适配方案

适配策略

环境要求

安装依赖

推理

基本用法

推理示例

精度测试

测试方法

测试结果

运行精度测试

推理性能

标签

参考

推理成功证据

InspireMusic-Base-24kHz-NPU

模型简介

NPU 适配方案

适配策略

环境要求

安装依赖

推理

基本用法

推理示例

精度测试

测试方法

测试结果

运行精度测试

推理性能

标签

参考

推理成功证据