Ascend-SACT/InspireMusic
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

引言

InspireMusic 是一个集成了音频标记化、自回归变换器与流匹配模型的统一音乐、歌曲及音频生成框架。该工具包旨在通过音乐、歌曲和音频创作,助力普通用户打造独特的声音景观,并提升研究过程中的协同性。它为基于 AI 的生成模型提供了训练与推理代码,能够生成高质量音乐。InspireMusic 采用统一框架,融合了音频标记器、自回归变换器和超分辨率流匹配建模技术,支持通过文本和音频提示可控地生成音乐、歌曲及音频。目前,该工具包已支持音乐生成功能,未来还将扩展至歌曲生成和音频生成。

一、准备运行环境

表1 硬件设备

设备型号NPU配置
Atlas 800I A28*64G
Atlas 800T A28*64G

表2 软件版本配套表

配套版本环境准备指导
cann8.5.1-
Python3.11.14-
torch2.6.0-
torch_npu2.6.0-
torchaudio2.6.0-
transformers4.48.3-
ruamel.yaml0.18.10-
soundfile0.13.1-

1.1 获取并安装vLLM Ascend镜像

1.1.1 软件包下载

方式一、通过vllm-ascend镜像安装部署

点击下载链接,打开网页后,选择 v0.18.0rc1 版本下载

1、执行以下命令下载

docker pull quay.io/ascend/vllm-ascend:v0.18.0rc1

2、执行以下命令查看镜像是否下载成功

docker images | grep v0.18.0rc1

二、下载权重

InspireMusic音乐生成模型权重及配置文件说明

模型权重
InspireMusic音乐生成模型-Base下载链接
InspireMusic音乐生成模型-1.5B-Long下载链接

三、运行指导

3.1 单机单卡部署

3.1.1 启动容器服务样例命令

docker run -itd -u 0  --ipc=host  --privileged \
--name inspiremusic-new \
--net=host \
--shm-size=256g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-v /home/:/home/ \
-v /opt/data/:/opt/data/ \
-p 8002:8002 \
-it 40417662a1ff bash

3.1.2 进入容器

docker exec -it -u root  inspiremusic-new bash

3.1.3 安装依赖包

下载推理代码:

git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
cd InspireMusic
git submodule update --recursive

创建依赖文件req.txt,填写以下依赖包:

setuptools
conformer==0.3.2
diffusers==0.27.2
gdown==5.1.0
gradio==5.5.0
grpcio==1.57.0
grpcio-tools==1.57.0
hydra-core==1.3.2
HyperPyYAML==1.2.2
inflect==7.3.1
librosa==0.10.2
lightning==2.2.4
matplotlib==3.7.5
modelscope==1.15.0
networkx==3.1
omegaconf==2.3.0
onnx==1.17.0
protobuf==4.25
pydantic==2.7.0
rich==13.7.1
tensorboard==2.14.0
uvicorn==0.30.0
wget==3.2
fastapi
fastapi-cli==0.0.4
#WeTextProcessing==1.0.3
accelerate
huggingface-hub==0.25.2
julius
onnxruntime==1.16.0
transformers
torch==2.6.0
torch-npu==2.6.0
torchaudio==2.6.0
ruamel.yaml==0.18.10
soundfile==0.13.1
Cython

更新依赖包:

pip uninstall -y vllm-ascend
pip uninstall -y vllm
pip uninstall -y torch
pip uninstall -y torch_npu
pip uninstall -y torchaudio
pip uninstall -y torchvision
pip uninstall -y transformers
pip install -r req.txt

修改 setup.py 将依赖包清空

    "install": [
    ],

安装InspireMusic、matcha-tts:

python setup.py install
cd third_party/Matcha-TTS/
echo "" > requirements.txt
python setup.py install
cd ../../

3.1.4 创建权重目录

cd /workspace/InspireMusic
mkdir pretrained_models
mv InspireMusic pretrained_models/InspireMusic-Base
mv InspireMusic-1.5B-Long pretrained_models

3.1.5 修改框架代码

修改/workspace/InspireMusic/inspiremusic/cli/inference.py、/workspace/InspireMusic/inspiremusic/bin/inference.py,引入以下代码:

import torch
from torch_npu.contrib import transfer_to_npu

修改/workspace/InspireMusic/inspiremusic/transformer/qwen_encoder.py

# Copyright (c) 2024 Alibaba Inc
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch.nn as nn
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from inspiremusic.utils.mask import make_pad_mask
from inspiremusic.utils.hinter import hint_once

class QwenEncoder(nn.Module):
    def __init__(
            self,
            input_size: int,
            dtype: str = "fp16",
            pretrain_path: str = "Qwen/Qwen2.0-0.5B",
            trainable: bool = False,
            do_fusion_emb: bool = False,
            fusion_drop_rate: float = 0.0,
    ):
        super(QwenEncoder, self).__init__()
        self.input_size = input_size
        self.trainable = trainable

        if dtype == "fp16":
            self.dtype = torch.float16
        elif dtype == "bf16":
            self.dtype = torch.bfloat16
        else:
            self.dtype = torch.float32

        self.model = AutoModelForCausalLM.from_pretrained(pretrain_path, device_map="auto", attn_implementation="flash_attention_2", torch_dtype=self.dtype)
        self._output_size = self.model.config.hidden_size
        self.do_fusion_emb = do_fusion_emb
        self.hidden_norm = torch.nn.LayerNorm(self._output_size)
        self.fusion_dropout = nn.Dropout(fusion_drop_rate)
        if do_fusion_emb:
            self.fusion_layer = torch.nn.Linear(self._output_size * 2, self._output_size)
            self.emb_norm = torch.nn.LayerNorm(self._output_size)
            self.fusion_norm = torch.nn.LayerNorm(self._output_size)
            from inspiremusic.transformer.activation import Swish
            self.fusion_act = Swish(self)

        if not self.trainable:
            self.model.eval()

    def output_size(self) -> int:
        return self._output_size

    def forward(
            self,
            input_ids: torch.Tensor,
            ilens: torch.Tensor,
    ):
        device = input_ids.device
        input_ids = torch.clamp(input_ids, min=0, max=None)
        input_masks = (~make_pad_mask(ilens)).to(device).long()
        if not self.trainable:
            with torch.no_grad():
                model_outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=input_masks,
                    output_hidden_states=True
                )
        else:
            model_outputs = self.model(
                input_ids=input_ids,
                attention_mask=input_masks,
                output_hidden_states=True
            )
        outs = model_outputs.hidden_states[-1]
        outs = self.hidden_norm(outs)
        if self.do_fusion_emb:
            hint_once("fuse embedding and LM outputs", "fuse_emb")
            outs = self.fusion_dropout(self.fusion_act(outs))
            emb = model_outputs.hidden_states[0]
            emb = self.fusion_dropout(self.fusion_act(self.emb_norm(emb)))
            outs = self.fusion_layer(
                torch.cat([outs, emb], dim=-1)
            )
            outs = self.fusion_act(self.fusion_norm(outs))

        return outs, ilens


class QwenEmbeddingEncoder(nn.Module):
    def __init__(
            self,
            input_size: int,
            dtype: str = "fp16",
            pretrain_path: str = "Qwen/Qwen2.0-0.5B",
    ):
        super(QwenEmbeddingEncoder, self).__init__()
        self.input_size = input_size
        if dtype == "fp16":
            self.dtype = torch.float16
        elif dtype == "bf16":
            self.dtype = torch.bfloat16
        else:
            self.dtype = torch.float32
        from transformers import Qwen2ForCausalLM
        #self.model = Qwen2ForCausalLM.from_pretrained(pretrain_path, device_map="auto", attn_implementation="flash_attention_2", torch_dtype=self.dtype)
        self.model = Qwen2ForCausalLM.from_pretrained(pretrain_path, device_map="auto", torch_dtype=self.dtype)
        self._output_size = self.model.config.hidden_size

    def output_size(self) -> int:
        return self._output_size

    def forward(
            self,
            input_embeds: torch.Tensor,
            ilens: torch.Tensor,
    ):
        input_masks = (~make_pad_mask(ilens)).to(input_embeds.device).long()

        outs = self.model(
            inputs_embeds=input_embeds,
            attention_mask=input_masks,
            output_hidden_states=True,
            return_dict=True,
        )

        return outs.hidden_states[-1], input_masks

    def forward_one_step(self, xs, masks, cache=None):

        outs = self.model(
            inputs_embeds=xs,
            attention_mask=masks,
            output_hidden_states=True,
            return_dict=True,
            use_cache=True,
            past_key_values=cache,
        )
        xs = outs.hidden_states[-1]
        new_cache = outs.past_key_values

        return xs, masks, new_cache


class QwenInputOnlyEncoder(nn.Module):
    def __init__(
            self,
            input_size: int,
            dtype: str = "fp16",
            pretrain_path: str = "Qwen/Qwen2.0-0.5B",
    ):
        super(QwenInputOnlyEncoder, self).__init__()
        self.input_size = input_size
        if dtype == "fp16":
            self.dtype = torch.float16
        elif dtype == "bf16":
            self.dtype = torch.bfloat16
        else:
            self.dtype = torch.float32
        from transformers import Qwen2ForCausalLM
        #model = Qwen2ForCausalLM.from_pretrained(pretrain_path, device_map="auto", attn_implementation="flash_attention_2", torch_dtype=self.dtype)
        model = Qwen2ForCausalLM.from_pretrained(pretrain_path, device_map="auto", torch_dtype=self.dtype)
        self.embed = model.model.embed_tokens
        for p in self.embed.parameters():
            p.requires_grad = False
            # set text embedding to non-trainable

        # self.post_embed = model.model.rotary_emb
        self._output_size = model.config.hidden_size

    def output_size(self) -> int:
        return self._output_size

    def forward(
            self,
            input_ids: torch.Tensor,
            ilens: torch.Tensor,
    ):
        input_masks = (~make_pad_mask(ilens)).to(input_ids.device).long()

        outs = self.embed(input_ids)

        return outs, input_masks

3.1.6 执行推理脚本

音乐生成:

cd examples/music_generation
export ASCEND_RT_VISIBLE_DEVICES=0
# custom the config like the following one-line command
python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav 

# without flow matching, use one-line command to get a quick try
python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True

音乐续写任务,需要提供一个audio_prompt.wav:

cd examples/music_generation
# with flow matching
python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
# without flow matching
python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav --fast True

3.1.7 在 exp/inspiremusic 目录下生成 *.wav 音频文件