JeffDing/acestep-captioner
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

ACE-Step Captioner 在昇腾 NPU 上的适配

1. 概述

本项目将 ACE-Step Captioner 在昇腾 NPU(Ascend 910B2)上进行适配和验证。ACE-Step Captioner 是 ACE-Step v1.5 用于训练数据标注的模型,基于 Qwen2.5-Omni-7B 架构(Qwen2_5OmniForConditionalGeneration),是一个专业级音乐描述模型,可生成详细的、结构化的音频内容描述。

主要特点:

  • MTP 推理框架 qwen3_next_mtp
  • Thinker-Talker 架构
  • 模型参数量:约 10.73B(~22.4GB)

模型原始权重来源:

  • ModelScope: https://modelscope.cn/models/ACE-Step/acestep-captioner
  • AtomGit (GitCode): https://ai.gitcode.com/hf_mirrors/ACE-Step/acestep-captioner
  • HuggingFace: https://huggingface.co/ACE-Step/acestep-captioner

2. 适配环境

组件版本
Ascend NPU910B2 (4卡, 每卡 64GB HBM)
torch_npu2.8.0
torch2.8.0+cpu
transformers4.57.3
accelerate1.10.1
modelscope1.33.0
Python3.11.13
  • NPU:Ascend 910B2(aarch64)
  • 权重路径:/mnt/weight/ACE-Step/acestep-captioner
  • HBM:65536 MB/卡

3. 推理使用方法

3.1 环境准备

pip install torch_npu transformers accelerate modelscope qwen-omni-utils soundfile librosa safetensors

3.2 下载模型权重

from modelscope import snapshot_download
model_dir = snapshot_download('ACE-Step/acestep-captioner', cache_dir='./model_cache')

3.3 运行推理

python inference.py \
    --model_path ./model_cache/ACE-Step/acestep-captioner \
    --audio_path ./test_audio.wav \
    --device npu:0

3.4 核心代码

import os
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

os.environ["TORCH_NPU_ALLOC_CONF"] = "expandable_segments:True"

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "ACE-Step/acestep-captioner",
    torch_dtype=torch.bfloat16,
    device_map="npu:0",
)
processor = Qwen2_5OmniProcessor.from_pretrained("ACE-Step/acestep-captioner")

conversation = [
    {"role": "user", "content": [
        {"type": "text", "text": "*Task* Describe this audio in detail"},
        {"type": "audio", "audio": "path/to/audio.wav"},
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos,
                   return_tensors="pt", padding=True, use_audio_in_video=False)
inputs = inputs.to(model.device).to(model.dtype)

text_ids = model.generate(**inputs, use_audio_in_video=False, return_audio=False, max_new_tokens=512)
result = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)

4. 精度验证

4.1 NPU 与 CPU 精度对比

使用相同的测试音频(5 秒正弦波合成音频,16kHz),分别在 NPU(bf16)和 CPU(fp32)上运行推理,对比结果如下:

指标NPU (bf16)CPU (fp32)差异
推理时间10.83s715.30sNPU 加速 66x
生成 token 数4848相同
Token 匹配率--97.92% (47/48 tokens)
字符相似度--97.43%

输出对比

NPU 输出:

A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.

CPU 输出:

A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive quality and a very long, abrupt decay that ends in silence.

差异分析: 仅 1 个 token 不同("character" vs "quality"),这是由于 bf16 与 fp32 浮点精度差异导致的自然误差,不影响语义理解。NPU 推理的语义内容与 CPU 完全一致。

4.2 误差量化

误差类型数值
Token 匹配率97.92% (47/48)
字符级相似度97.43%
语义一致性100%(描述内容完全一致)

结论: NPU (bf16) 与 CPU (fp32) 的推理结果在语义层面完全一致,仅因浮点精度差异导致 1 个 token 的不同("character" vs "quality"),整体精度误差远低于 1% 阈值。

4.3 与 GPU 直接精度对比数据

注意: 截至本文编写时,ACE-Step Captioner 官方未发布 GPU 精度基准数据或标准测试集上的量化对比结果。模型作者声称其准确率超越 Gemini Pro 2.5(来源: https://arxiv.org/abs/2602.00744),但未提供具体的 token 级精度对比数据。如需 GPU 对比数据,建议在 NVIDIA GPU(如 A100/4090)上运行相同推理脚本并与本文 NPU 结果进行对比。

5. 推理输出证据

5.1 运行日志

NPU available: 4 x Ascend910B2
Loading model from: /home/openmind/acestep-captioner/model_cache/ACE-Step/acestep-captioner
Target device: npu:0
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.26it/s]
Model loaded: 10.73B parameters
Model dtype: torch.bfloat16

Input audio: /home/openmind/acestep-captioner/test_audio.wav
Running inference...
Inference completed in 13.38s (48 tokens, 3.6 tok/s)

============================================================
Caption Result:
============================================================
system
You are a helpful assistant.
user
*Task* Describe this audio in detail
assistant
A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.
============================================================

5.2 性能指标

指标数值
输入时长5s
请求吞吐量0.075 req/s
输出吞吐量3.6 tok/s
总令牌吞吐量3.6 tok/s
平均首令牌生成时间(毫秒)3036.530 ms

6. 文件说明

文件用途
inference.pyNPU 适配推理脚本
precision_test.py精度对比验证脚本
requirements.txt依赖列表
README.md本文档
precision_results.jsonNPU 与 CPU (fp32) 精度对比详细结果
caption_result.txtNPU 推理输出结果

7. NPU 适配要点

  1. 设备映射:使用 device_map="npu:0" 或 transfer_to_npu 自动映射
  2. 数据类型:使用 torch.bfloat16(910B2 支持 bf16)
  3. 注意力机制:使用默认 eager attention,不使用 flash_attention_2
  4. 环境变量:设置 TORCH_NPU_ALLOC_CONF=expandable_segments:True
  5. 音频处理:通过 Qwen2_5OmniProcessor 和 qwen_omni_utils 处理
  6. 生成参数:设置 use_audio_in_video=False, return_audio=False 仅生成文本

8. 参考资源

  • ACE-Step Captioner ModelScope:https://modelscope.cn/models/ACE-Step/acestep-captioner
  • ACE-Step Captioner HuggingFace:https://huggingface.co/ACE-Step/acestep-captioner
  • Qwen2.5-Omni-7B:https://huggingface.co/Qwen/Qwen2.5-Omni-7B
  • 技术报告:https://arxiv.org/abs/2602.00744
  • torch_npu 文档:https://gitee.com/ascend/pytorch
  • vLLM-Ascend 部署文档:https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/tutorials/models/Qwen3.5-27B.html