Qwen3-ASR-1.7B NPU 适配

本仓库完成了 Qwen3-ASR-1.7B 在昇腾 NPU 上的适配与验证。

模型介绍

Qwen3-ASR-1.7B 是通义千问团队开源的高性能语音识别模型，支持 52 种语言与方言，在开源 ASR 模型中达到 SOTA 水平。本适配使其可在昇腾 NPU 上高效运行，仅需极少量代码改动。

环境配置

前置依赖

Python >= 3.10
CANN >= 8.0
torch >= 2.1.0
torch-npu

安装依赖

# 基础推理（Transformers 后端）
pip install -U qwen-asr

# 如需使用 vLLM 后端，需额外安装 vllm 依赖
pip install -U qwen-asr[vllm]

说明：当前运行镜像已预装 vllm 与 vllm-ascend，因此验证时未单独执行上述 qwen-asr[vllm] 安装命令。若你的环境未预装这两个包，则需执行该命令以启用 vLLM 后端。

权重下载

模型权重来源：

方式一：从 AtomGit 下载（推荐）

python3 -m atomgit download hf_mirrors/Qwen/Qwen3-ASR-1.7B -d /opt/atomgit/weight/Qwen3-ASR-1.7B
python3 -m atomgit download hf_mirrors/Qwen/Qwen3-ForcedAligner-0.6B -d /opt/atomgit/weight/Qwen3-ForcedAligner-0.6B

方式二：从 Hugging Face 下载

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ForcedAligner-0.6B --local-dir ./Qwen3-ForcedAligner-0.6B

推理

1. 快速推理（Transformers 后端）

相比原始 GPU 脚本，NPU 适配仅需两处改动：

增加 import torch_npu
将 device_map="cuda:0" 改为 device_map="npu:0"

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="npu:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,
)

print(results[0].language)
print(results[0].text)

输出示例：

English
Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.

2. 批量推理 + 时间戳（Transformers 后端）

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="npu:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="npu:0",
    ),
)

results = model.transcribe(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    language=["Chinese", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text, r.time_stamps[0] if r.time_stamps else None)

输出示例：

Chinese 甚至出现交易几乎停滞的情况。 ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72)
English Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. ForcedAlignItem(text='Hmm', start_time=0.48, end_time=0.88)

3. ForcedAligner 独立使用

import torch
import torch_npu
from qwen_asr import Qwen3ForcedAligner

model = Qwen3ForcedAligner.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="npu:0",
)

results = model.align(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    text="甚至出现交易几乎停滞的情况。",
    language="Chinese",
)

print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)

输出示例：

ForcedAlignResult(items=[ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72), ForcedAlignItem(text='至', start_time=0.72, end_time=0.96), ForcedAlignItem(text='出', start_time=0.96, end_time=1.12), ForcedAlignItem(text='现', start_time=1.12, end_time=1.52), ForcedAlignItem(text='交', start_time=1.52, end_time=1.76), ForcedAlignItem(text='易', start_time=1.76, end_time=2.0), ForcedAlignItem(text='几', start_time=2.0, end_time=2.24), ForcedAlignItem(text='乎', start_time=2.24, end_time=2.48), ForcedAlignItem(text='停', start_time=2.48, end_time=2.72), ForcedAlignItem(text='滞', start_time=2.72, end_time=2.88), ForcedAlignItem(text='的', start_time=2.88, end_time=3.04), ForcedAlignItem(text='情', start_time=3.04, end_time=3.36), ForcedAlignItem(text='况', start_time=3.36, end_time=3.68)])
甚 0.4 0.72

4. vLLM 后端

vLLM 后端由于 qwen-asr 与当前 vllm-ascend 版本存在 API 差异，需要通过 PYTHONPATH 注入兼容性补丁。

export PYTHONPATH=/opt/atomgit/Qwen3-ASR-1.7B/patch_site:$PYTHONPATH
python3 inference_vllm.py

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="/opt/atomgit/weight/Qwen3-ASR-1.7B",
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="npu:0",
        ),
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=["Chinese", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0] if r.time_stamps else None)

输出示例：

Chinese 甚至出现交易几乎停滞的情况。 ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72)
English Mhm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. ForcedAlignItem(text='Mhm', start_time=0.4, end_time=0.88)

性能评测

python3 benchmark.py

评测结果

===== Benchmark Results =====
Average latency: 1.935s
Min latency:     1.840s
Max latency:     2.055s

精度验证

python3 accuracy.py

验证结果

===== CPU (float32) 基线 =====
Language: English
Text: Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well. But he did very well when he started writing for other people.

===== NPU (bfloat16) =====
Language: English
Text: Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.

===== 对比 =====
CPU text length: 185
NPU text length: 185
Sequence similarity: 99.72%
Error rate: 0.28%

NPU 输出与 CPU float32 基线的字符级误差率为 0.28%，远低于 1% 阈值。差异仅为标点/大小写变化（. But vs , but），对 ASR 任务语义等价。

文件说明

文件	说明
`inference.py`	快速 NPU 推理（transformers 后端）
`inference_batch_timestamps.py`	批量推理 + 时间戳（transformers 后端）
`inference_forced_aligner.py`	ForcedAligner 独立使用
`inference_vllm.py`	vLLM 后端推理
`benchmark.py`	NPU 性能评测
`accuracy.py`	与 CPU 基线的精度对比
`patch_site/sitecustomize.py`	vllm-ascend 兼容性补丁
`output/`	运行日志

引用

@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}

Qwen3-ASR-1.7B NPU 适配

本仓库完成了 Qwen3-ASR-1.7B 在昇腾 NPU 上的适配与验证。

模型介绍

环境配置

前置依赖

Python >= 3.10
CANN >= 8.0
torch >= 2.1.0
torch-npu

安装依赖

# 基础推理（Transformers 后端）
pip install -U qwen-asr

# 如需使用 vLLM 后端，需额外安装 vllm 依赖
pip install -U qwen-asr[vllm]

说明：当前运行镜像已预装 vllm 与 vllm-ascend，因此验证时未单独执行上述 qwen-asr[vllm] 安装命令。若你的环境未预装这两个包，则需执行该命令以启用 vLLM 后端。

权重下载

模型权重来源：

方式一：从 AtomGit 下载（推荐）

python3 -m atomgit download hf_mirrors/Qwen/Qwen3-ASR-1.7B -d /opt/atomgit/weight/Qwen3-ASR-1.7B
python3 -m atomgit download hf_mirrors/Qwen/Qwen3-ForcedAligner-0.6B -d /opt/atomgit/weight/Qwen3-ForcedAligner-0.6B

方式二：从 Hugging Face 下载

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ForcedAligner-0.6B --local-dir ./Qwen3-ForcedAligner-0.6B

推理

1. 快速推理（Transformers 后端）

相比原始 GPU 脚本，NPU 适配仅需两处改动：

增加 import torch_npu
将 device_map="cuda:0" 改为 device_map="npu:0"

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="npu:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,
)

print(results[0].language)
print(results[0].text)

输出示例：

English
Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.

2. 批量推理 + 时间戳（Transformers 后端）

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="npu:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="npu:0",
    ),
)

results = model.transcribe(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    language=["Chinese", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text, r.time_stamps[0] if r.time_stamps else None)

输出示例：

Chinese 甚至出现交易几乎停滞的情况。 ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72)
English Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. ForcedAlignItem(text='Hmm', start_time=0.48, end_time=0.88)

3. ForcedAligner 独立使用

import torch
import torch_npu
from qwen_asr import Qwen3ForcedAligner

model = Qwen3ForcedAligner.from_pretrained(
    "/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="npu:0",
)

results = model.align(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    text="甚至出现交易几乎停滞的情况。",
    language="Chinese",
)

print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)

输出示例：

ForcedAlignResult(items=[ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72), ForcedAlignItem(text='至', start_time=0.72, end_time=0.96), ForcedAlignItem(text='出', start_time=0.96, end_time=1.12), ForcedAlignItem(text='现', start_time=1.12, end_time=1.52), ForcedAlignItem(text='交', start_time=1.52, end_time=1.76), ForcedAlignItem(text='易', start_time=1.76, end_time=2.0), ForcedAlignItem(text='几', start_time=2.0, end_time=2.24), ForcedAlignItem(text='乎', start_time=2.24, end_time=2.48), ForcedAlignItem(text='停', start_time=2.48, end_time=2.72), ForcedAlignItem(text='滞', start_time=2.72, end_time=2.88), ForcedAlignItem(text='的', start_time=2.88, end_time=3.04), ForcedAlignItem(text='情', start_time=3.04, end_time=3.36), ForcedAlignItem(text='况', start_time=3.36, end_time=3.68)])
甚 0.4 0.72

4. vLLM 后端

vLLM 后端由于 qwen-asr 与当前 vllm-ascend 版本存在 API 差异，需要通过 PYTHONPATH 注入兼容性补丁。

export PYTHONPATH=/opt/atomgit/Qwen3-ASR-1.7B/patch_site:$PYTHONPATH
python3 inference_vllm.py

import torch
import torch_npu
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="/opt/atomgit/weight/Qwen3-ASR-1.7B",
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="/opt/atomgit/weight/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="npu:0",
        ),
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=["Chinese", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0] if r.time_stamps else None)

输出示例：

Chinese 甚至出现交易几乎停滞的情况。 ForcedAlignItem(text='甚', start_time=0.4, end_time=0.72)
English Mhm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people. ForcedAlignItem(text='Mhm', start_time=0.4, end_time=0.88)

性能评测

python3 benchmark.py

评测结果

===== Benchmark Results =====
Average latency: 1.935s
Min latency:     1.840s
Max latency:     2.055s

精度验证

python3 accuracy.py

验证结果

===== CPU (float32) 基线 =====
Language: English
Text: Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well. But he did very well when he started writing for other people.

===== NPU (bfloat16) =====
Language: English
Text: Hmm. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.

===== 对比 =====
CPU text length: 185
NPU text length: 185
Sequence similarity: 99.72%
Error rate: 0.28%

NPU 输出与 CPU float32 基线的字符级误差率为 0.28%，远低于 1% 阈值。差异仅为标点/大小写变化（. But vs , but），对 ASR 任务语义等价。

文件说明

文件	说明
`inference.py`	快速 NPU 推理（transformers 后端）
`inference_batch_timestamps.py`	批量推理 + 时间戳（transformers 后端）
`inference_forced_aligner.py`	ForcedAligner 独立使用
`inference_vllm.py`	vLLM 后端推理
`benchmark.py`	NPU 性能评测
`accuracy.py`	与 CPU 基线的精度对比
`patch_site/sitecustomize.py`	vllm-ascend 兼容性补丁
`output/`	运行日志

引用

@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}