Covo-Audio-Chat on Ascend NPU (昇腾适配)

1. 简介

本文档记录 Covo-Audio-Chat 在华为昇腾 NPU (Ascend 910) 上的适配与验证结果。

Covo-Audio-Chat 是腾讯开发的 8.4B 参数端到端音频语言模型，能直接处理连续音频输入并生成文本+音频输出。模型架构由以下组件构成：

LLM backbone: Qwen2-7B (28层, 3584 hidden, 28 attention heads, 4 KV heads)
Audio Encoder: Whisper-large-v3 (1280 d_model, 32 encoder layers)
Audio Adapter: 4层 Conv1d 下采样 (downsample=8)
Audio Decoder (token2wav): BigVGAN-based 声码器

原始论文: Covo-Audio Technical Report (arXiv:2602.09823)

2. 验证环境

组件	版本
硬件	2x Ascend910_9362 (各 61.3GB 显存)
`torch`	2.9.0+cpu
`torch_npu`	2.9.0.post1+gitee7ba04
`transformers`	4.57.6
`torchaudio`	2.9.0
CANN	8.5.1
Python	3.11.14

3. 适配步骤

3.1 环境准备

# 安装依赖
pip install transformers>=4.57.0 torchaudio<=2.8.0 soundfile numpy einops librosa==0.11.0 json5 torchdiffeq

3.2 获取模型权重

从 GitCode 或 HuggingFace 下载模型权重：

# GitCode
git clone https://gitcode.com/tencent_hunyuan/Covo-Audio-Chat.git

# 或 HuggingFace
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download tencent/Covo-Audio-Chat --local-dir ./Covo-Audio-Chat

3.3 获取推理代码

git clone https://github.com/Tencent/Covo-Audio.git
cd Covo-Audio

3.4 代码适配 (CUDA → NPU)

主要修改点：

设备映射: 将 cuda:0 / cuda:1 替换为 npu:0
环境变量: CUDA_VISIBLE_DEVICES → ASCEND_RT_VISIBLE_DEVICES
torchaudio 加载: 添加 soundfile 作为音频加载的后备方案
torch.autocast 注解: device_type="cuda" → device_type="cpu" (兼容 NPU)
SDKernel 配置: 为 NPU 设备添加 try-except 处理
torch.stft: 确保输入为 float32 类型（NPU 不支持 int64 的 cos 操作）
complex64 abs(): NPU 不支持 complex64 的 abs() 运算，使用 .real**2 + .imag**2 替代 .abs()**2

运行推理：

ASCEND_RT_VISIBLE_DEVICES=0 python run_inference.py \
    --model_dir ./Covo-Audio-Chat \
    --mode a2t \
    --device npu:0 \
    --max_new_tokens 512

4. 推理验证

4.1 NPU 推理结果

模型在 NPU 上成功加载并推理，两轮对话均正常输出：

Using device: npu:0
Start loading the Covo-Audio model...
Model initialization time: 229.87 seconds
  Loading model-00001-of-00004.safetensors...
  Loading model-00002-of-00004.safetensors...
  Loading model-00003-of-00004.safetensors...
  Loading model-00004-of-00004.safetensors...
Weight loading time: 48.50 seconds

=== Round 0 ===
Generation time (round 0): 2.93 seconds
[Round 0] Decoded text:  alguo 100% confio em vocês, e sei que vocês também confiam em mim.
Acredito que juntamente com vocês, vamos construir um futuro melhor para todos nós.

Translate to English

I fully trust you all, and I know that you also trust me.
I believe that together with you, we will build a better future for all of us.

=== Round 1 ===
Generation time (round 1): 1.18 seconds
[Round 1] Decoded text: I fully trust you all, and I know that you also trust me.
I believe that together with you, we will build a better future for all of us.

Inference completed successfully!

性能数据 (NPU, bf16):

阶段	耗时
模型初始化	229.87 秒
权重加载	48.50 秒
Round 0 生成	2.93 秒
Round 1 生成	1.18 秒

5. 精度对比

5.1 Mel 频谱计算精度 (CPU vs NPU)

使用相同的音频输入 (003000298.wav)，在 CPU 和 NPU 上分别计算 log-Mel 频谱图，对比结果：

指标	值
最大绝对误差	1.812e-05
平均绝对误差	1.285e-07
平均相对误差	1.224e-06
余弦相似度	0.9999998 (≈1.0)
在 bf16 容差内的元素比例	100.00%

结论: Mel 频谱计算在 CPU 和 NPU 之间的差异在 bf16 浮点精度范围内（< 0.01），完全一致。

5.2 端到端推理可复现性 (NPU)

在同一 NPU 设备上，使用相同输入 (003000298.wav)，相同参数 (temperature=0, repetition_penalty=1.05)，连续运行两次推理：

指标	结果
Token 完全匹配率	100.0% (64/64 tokens)
文本完全一致	True

NPU 推理两次生成文本:

Run 0: alguo 100% confio em vocês, e sei que vocês também confiam em mim...
Run 1: alguo 100% confio em vocês, e sei que vocês também confiam em mim...

结论: NPU 推理完全可复现，相同输入 + temperature=0 产生完全一致的 token 序列。

5.3 与 GPU 精度对比

明确标注: 截至本文档编写时，网络上未找到 Covo-Audio-Chat 模型的 GPU 精度基准数据（该模型于 2026 年 2 月发布，较新）。无法提供与 GPU 的直接精度对比数据。

从模型官方 README 中的 benchmark 结果来看，Covo-Audio-Chat 在多项音频理解任务上取得了 SOTA 或竞争性表现，具体可参考论文 arXiv:2602.09823。

5.4 CPU 推理精度对比

注: 8.4B 参数模型在 CPU 上推理非常慢（单轮对话预计需要数十分钟），因内存限制（229GB RAM 需承载 ~16GB 模型权重 + KV cache），CPU 上仅完成 Mel 频谱级别的精度对比（见 5.1 节）。完整的端到端 CPU 推理因内存不足被系统终止。

5.5 算子兼容性分析

算子类型	Ascend 兼容性	说明
Conv1d / ConvTranspose1d	✅ 完全兼容	标准 PyTorch 算子
Linear (nn.Linear)	✅ 完全兼容	标准 PyTorch 算子
torch.stft	✅ 兼容 (float32)	需确保输入为 float32
F.scaled_dot_product_attention	✅ 兼容	torch_npu 支持
F.masked_scatter	✅ 兼容	torch_npu 支持
WhisperEncoder	✅ 兼容	transformers 原生支持
Qwen2ForCausalLM	✅ 兼容	transformers 原生支持
BigVGAN (token2wav)	✅ 兼容	纯标准算子实现
complex64 abs()	⚠️ 需适配	NPU 不支持，改用 real²+imag²

该模型无 Triton 自定义核或 CUDA 专有算子，所有计算均为标准 PyTorch 实现，NPU 兼容性良好。

6. 注意事项

音频加载: 当前环境 torchaudio.load 需要安装 torchcodec，代码已添加 soundfile 作为后备方案
STFT 类型: torch.stft 在 NPU 上要求 float32 输入，int64 会导致 aclnnInplaceCos 报错
torch.autocast: token2wav 中的 @torch.autocast(device_type="cuda") 需修改为 device_type="cpu" 以兼容 NPU
显存需求: 模型约 16GB (bf16)，单卡 61.3GB 足够运行推理
token2wav 解码器: 如需生成音频输出（a2ta 模式），需额外约 1.8GB 显存
complex64: NPU 不支持 torch.abs() 对 complex64 的运算，需使用 .real**2 + .imag**2 替代

7. 相关链接

原始模型权重 (HuggingFace): https://huggingface.co/tencent/Covo-Audio-Chat
原始模型权重 (GitCode): https://gitcode.com/tencent_hunyuan/Covo-Audio-Chat
推理代码 (GitHub): https://github.com/Tencent/Covo-Audio
论文: https://arxiv.org/abs/2602.09823

8. Citation

@misc{wang2026covoaudiotechnicalreport,
      title={Covo-Audio Technical Report},
      author={Wenfu Wang and Chenxing Li and Liqiang Zhang and others},
      year={2026},
      eprint={2602.09823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
}