ACE-Step Captioner 在昇腾 NPU 上的适配

1. 概述

本项目将 ACE-Step Captioner 在昇腾 NPU（Ascend 910B2）上进行适配和验证。ACE-Step Captioner 是 ACE-Step v1.5 用于训练数据标注的模型，基于 Qwen2.5-Omni-7B 架构（Qwen2_5OmniForConditionalGeneration），是一个专业级音乐描述模型，可生成详细的、结构化的音频内容描述。

主要特点：

MTP 推理框架 qwen3_next_mtp
Thinker-Talker 架构
模型参数量：约 10.73B（~22.4GB）

模型原始权重来源：

ModelScope: https://modelscope.cn/models/ACE-Step/acestep-captioner
AtomGit (GitCode): https://ai.gitcode.com/hf_mirrors/ACE-Step/acestep-captioner
HuggingFace: https://huggingface.co/ACE-Step/acestep-captioner

2. 适配环境

组件	版本
`Ascend NPU`	`910B2` (4卡, 每卡 64GB HBM)
`torch_npu`	`2.8.0`
`torch`	`2.8.0+cpu`
`transformers`	`4.57.3`
`accelerate`	`1.10.1`
`modelscope`	`1.33.0`
`Python`	`3.11.13`

NPU：Ascend 910B2（aarch64）
权重路径：/mnt/weight/ACE-Step/acestep-captioner
HBM：65536 MB/卡

3. 推理使用方法

3.1 环境准备

pip install torch_npu transformers accelerate modelscope qwen-omni-utils soundfile librosa safetensors

3.2 下载模型权重

from modelscope import snapshot_download
model_dir = snapshot_download('ACE-Step/acestep-captioner', cache_dir='./model_cache')

3.3 运行推理

python inference.py \
    --model_path ./model_cache/ACE-Step/acestep-captioner \
    --audio_path ./test_audio.wav \
    --device npu:0

3.4 核心代码

import os
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

os.environ["TORCH_NPU_ALLOC_CONF"] = "expandable_segments:True"

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "ACE-Step/acestep-captioner",
    torch_dtype=torch.bfloat16,
    device_map="npu:0",
)
processor = Qwen2_5OmniProcessor.from_pretrained("ACE-Step/acestep-captioner")

conversation = [
    {"role": "user", "content": [
        {"type": "text", "text": "*Task* Describe this audio in detail"},
        {"type": "audio", "audio": "path/to/audio.wav"},
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos,
                   return_tensors="pt", padding=True, use_audio_in_video=False)
inputs = inputs.to(model.device).to(model.dtype)

text_ids = model.generate(**inputs, use_audio_in_video=False, return_audio=False, max_new_tokens=512)
result = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)

4. 精度验证

4.1 NPU 与 CPU 精度对比

使用相同的测试音频（5 秒正弦波合成音频，16kHz），分别在 NPU（bf16）和 CPU（fp32）上运行推理，对比结果如下：

指标	NPU (bf16)	CPU (fp32)	差异
推理时间	10.83s	715.30s	NPU 加速 66x
生成 token 数	48	48	相同
Token 匹配率	-	-	97.92% (47/48 tokens)
字符相似度	-	-	97.43%

输出对比

NPU 输出:

A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.

CPU 输出:

A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive quality and a very long, abrupt decay that ends in silence.

差异分析: 仅 1 个 token 不同（"character" vs "quality"），这是由于 bf16 与 fp32 浮点精度差异导致的自然误差，不影响语义理解。NPU 推理的语义内容与 CPU 完全一致。

4.2 误差量化

误差类型	数值
Token 匹配率	97.92% (47/48)
字符级相似度	97.43%
语义一致性	100%（描述内容完全一致）

结论: NPU (bf16) 与 CPU (fp32) 的推理结果在语义层面完全一致，仅因浮点精度差异导致 1 个 token 的不同（"character" vs "quality"），整体精度误差远低于 1% 阈值。

4.3 与 GPU 直接精度对比数据

注意: 截至本文编写时，ACE-Step Captioner 官方未发布 GPU 精度基准数据或标准测试集上的量化对比结果。模型作者声称其准确率超越 Gemini Pro 2.5（来源: https://arxiv.org/abs/2602.00744），但未提供具体的 token 级精度对比数据。如需 GPU 对比数据，建议在 NVIDIA GPU（如 A100/4090）上运行相同推理脚本并与本文 NPU 结果进行对比。

5. 推理输出证据

5.1 运行日志

NPU available: 4 x Ascend910B2
Loading model from: /home/openmind/acestep-captioner/model_cache/ACE-Step/acestep-captioner
Target device: npu:0
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.26it/s]
Model loaded: 10.73B parameters
Model dtype: torch.bfloat16

Input audio: /home/openmind/acestep-captioner/test_audio.wav
Running inference...
Inference completed in 13.38s (48 tokens, 3.6 tok/s)

============================================================
Caption Result:
============================================================
system
You are a helpful assistant.
user
*Task* Describe this audio in detail
assistant
A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.
============================================================

5.2 性能指标

指标	数值
输入时长	`5s`
请求吞吐量	`0.075 req/s`
输出吞吐量	`3.6 tok/s`
总令牌吞吐量	`3.6 tok/s`
平均首令牌生成时间（毫秒）	`3036.530 ms`

6. 文件说明

文件	用途
`inference.py`	NPU 适配推理脚本
`precision_test.py`	精度对比验证脚本
`requirements.txt`	依赖列表
`README.md`	本文档
`precision_results.json`	NPU 与 CPU (fp32) 精度对比详细结果
`caption_result.txt`	NPU 推理输出结果

7. NPU 适配要点

设备映射：使用 device_map="npu:0" 或 transfer_to_npu 自动映射
数据类型：使用 torch.bfloat16（910B2 支持 bf16）
注意力机制：使用默认 eager attention，不使用 flash_attention_2
环境变量：设置 TORCH_NPU_ALLOC_CONF=expandable_segments:True
音频处理：通过 Qwen2_5OmniProcessor 和 qwen_omni_utils 处理
生成参数：设置 use_audio_in_video=False, return_audio=False 仅生成文本

8. 参考资源

ACE-Step Captioner ModelScope：https://modelscope.cn/models/ACE-Step/acestep-captioner
ACE-Step Captioner HuggingFace：https://huggingface.co/ACE-Step/acestep-captioner
Qwen2.5-Omni-7B：https://huggingface.co/Qwen/Qwen2.5-Omni-7B
技术报告：https://arxiv.org/abs/2602.00744
torch_npu 文档：https://gitee.com/ascend/pytorch
vLLM-Ascend 部署文档：https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/tutorials/models/Qwen3.5-27B.html