本项目将 ACE-Step Captioner 在昇腾 NPU(Ascend 910B2)上进行适配和验证。ACE-Step Captioner 是 ACE-Step v1.5 用于训练数据标注的模型,基于 Qwen2.5-Omni-7B 架构(Qwen2_5OmniForConditionalGeneration),是一个专业级音乐描述模型,可生成详细的、结构化的音频内容描述。
主要特点:
qwen3_next_mtp模型原始权重来源:
| 组件 | 版本 |
|---|---|
Ascend NPU | 910B2 (4卡, 每卡 64GB HBM) |
torch_npu | 2.8.0 |
torch | 2.8.0+cpu |
transformers | 4.57.3 |
accelerate | 1.10.1 |
modelscope | 1.33.0 |
Python | 3.11.13 |
Ascend 910B2(aarch64)/mnt/weight/ACE-Step/acestep-captioner65536 MB/卡pip install torch_npu transformers accelerate modelscope qwen-omni-utils soundfile librosa safetensorsfrom modelscope import snapshot_download
model_dir = snapshot_download('ACE-Step/acestep-captioner', cache_dir='./model_cache')python inference.py \
--model_path ./model_cache/ACE-Step/acestep-captioner \
--audio_path ./test_audio.wav \
--device npu:0import os
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
os.environ["TORCH_NPU_ALLOC_CONF"] = "expandable_segments:True"
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"ACE-Step/acestep-captioner",
torch_dtype=torch.bfloat16,
device_map="npu:0",
)
processor = Qwen2_5OmniProcessor.from_pretrained("ACE-Step/acestep-captioner")
conversation = [
{"role": "user", "content": [
{"type": "text", "text": "*Task* Describe this audio in detail"},
{"type": "audio", "audio": "path/to/audio.wav"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos,
return_tensors="pt", padding=True, use_audio_in_video=False)
inputs = inputs.to(model.device).to(model.dtype)
text_ids = model.generate(**inputs, use_audio_in_video=False, return_audio=False, max_new_tokens=512)
result = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)使用相同的测试音频(5 秒正弦波合成音频,16kHz),分别在 NPU(bf16)和 CPU(fp32)上运行推理,对比结果如下:
| 指标 | NPU (bf16) | CPU (fp32) | 差异 |
|---|---|---|---|
| 推理时间 | 10.83s | 715.30s | NPU 加速 66x |
| 生成 token 数 | 48 | 48 | 相同 |
| Token 匹配率 | - | - | 97.92% (47/48 tokens) |
| 字符相似度 | - | - | 97.43% |
NPU 输出:
A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.CPU 输出:
A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive quality and a very long, abrupt decay that ends in silence.差异分析: 仅 1 个 token 不同("character" vs "quality"),这是由于 bf16 与 fp32 浮点精度差异导致的自然误差,不影响语义理解。NPU 推理的语义内容与 CPU 完全一致。
| 误差类型 | 数值 |
|---|---|
| Token 匹配率 | 97.92% (47/48) |
| 字符级相似度 | 97.43% |
| 语义一致性 | 100%(描述内容完全一致) |
结论: NPU (bf16) 与 CPU (fp32) 的推理结果在语义层面完全一致,仅因浮点精度差异导致 1 个 token 的不同("character" vs "quality"),整体精度误差远低于 1% 阈值。
注意: 截至本文编写时,ACE-Step Captioner 官方未发布 GPU 精度基准数据或标准测试集上的量化对比结果。模型作者声称其准确率超越 Gemini Pro 2.5(来源: https://arxiv.org/abs/2602.00744),但未提供具体的 token 级精度对比数据。如需 GPU 对比数据,建议在 NVIDIA GPU(如 A100/4090)上运行相同推理脚本并与本文 NPU 结果进行对比。
NPU available: 4 x Ascend910B2
Loading model from: /home/openmind/acestep-captioner/model_cache/ACE-Step/acestep-captioner
Target device: npu:0
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.26it/s]
Model loaded: 10.73B parameters
Model dtype: torch.bfloat16
Input audio: /home/openmind/acestep-captioner/test_audio.wav
Running inference...
Inference completed in 13.38s (48 tokens, 3.6 tok/s)
============================================================
Caption Result:
============================================================
system
You are a helpful assistant.
user
*Task* Describe this audio in detail
assistant
A single, sustained, and heavily distorted sawtooth synthesizer note creates a harsh, buzzing texture. The sound is static and monolithic, with a gritty, abrasive character and a very long, abrupt decay that ends in silence.
============================================================| 指标 | 数值 |
|---|---|
| 输入时长 | 5s |
| 请求吞吐量 | 0.075 req/s |
| 输出吞吐量 | 3.6 tok/s |
| 总令牌吞吐量 | 3.6 tok/s |
| 平均首令牌生成时间(毫秒) | 3036.530 ms |
| 文件 | 用途 |
|---|---|
inference.py | NPU 适配推理脚本 |
precision_test.py | 精度对比验证脚本 |
requirements.txt | 依赖列表 |
README.md | 本文档 |
precision_results.json | NPU 与 CPU (fp32) 精度对比详细结果 |
caption_result.txt | NPU 推理输出结果 |
device_map="npu:0" 或 transfer_to_npu 自动映射torch.bfloat16(910B2 支持 bf16)TORCH_NPU_ALLOC_CONF=expandable_segments:TrueQwen2_5OmniProcessor 和 qwen_omni_utils 处理use_audio_in_video=False, return_audio=False 仅生成文本