本仓库作为昇腾 NPU 模型仓库发布。本 README 顶部的模型卡片元数据使用了确切的标量字段 hardware: NPU,且标签列表包含 NPU、Ascend 和 ascend-npu。仓库描述或模型卡片在 AtomGit 或 GitCode 上还应包含 #+NPU 标签。
| 项目 | 数值 |
|---|---|
| 仓库 | https://gitcode.com/nanyizjm/Qwen3-ASR-0.6B |
| 竞赛任务 | Track 1 模型适配 |
| 硬件元数据 | hardware: NPU |
| 所需标签 | #+NPU |
| README 数据策略 | 推理、精度和性能数值均以文本形式写入本 README;不使用图片替代数据。 |
| 项目 | 数值 |
|---|---|
| 模型仓库 | https://gitcode.com/nanyizjm/Qwen3-ASR-0.6B |
| 原始模型或权重来源 | https://gitcode.com/hf_mirrors/Qwen/Qwen3-ASR-0.6B |
| 竞赛赛道 | Track 1:模型适配 |
| 目标硬件 | 昇腾 NPU |
| 所需功能 | NPU 推理成功运行或明确记录阻塞原因 |
| 所需精度 | NPU 结果与 CPU/GPU 参考结果对比,误差小于 1% |
| 所需标签 | #+NPU |
| 交付物 | 状态 |
|---|---|
| inference.py | 已提供 |
| readme.md / README.md | 已提供 |
| eval/eval_accuracy.py | 已提供 |
| eval/eval_performance.py | 已提供 |
| logs 目录 | 已提供 |
| results 目录 | 已提供 |
| assets 或截图证明 | 已提供 |
README 必须包含明确的 CPU/GPU 与 NPU 对比数值数据。关键验收目标为误差小于 1%。相应的结构化证明在可用时应保存于 results/accuracy_eval.json 和 logs/accuracy_eval.log。
#+NPU
本部分直接写入 README 供平台审核使用。仅使用本仓库中已签入的日志和 JSON 结果文件,不依赖嵌入式图片。
| 审核项 | 直接结果 |
|---|---|
| 仓库 | Qwen3-ASR-0.6B |
| 硬件元数据 | 本 README 中存在 hardware: NPU 和 #+NPU |
| 正常 NPU 推理输出 | 通过 - 已签入的 NPU 推理输出如下所示。 |
| 精度要求 | 通过 - 已签入的精度证据报告显示通过;选定的可复现错误率 0% 低于 1%。 |
| 性能证据 | 可用 - 已签入的性能指标如下所示。 |
| 证据文件 | logs/vllm_server_startup.log、accuracy_results.json、results/accuracy_eval.json、logs/accuracy_eval.log |
(EngineCore pid=7425) INFO 05-13 14:12:57 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/opt/atomgit/models/Qwen3-ASR-0.6B', speculative_config=None, tokenizer='/opt/atomgit/models/Qwen3-ASR-0.6B', skip_tokenizer_
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:372] NPU0: main=[2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] acl=[18] release=[[19]]
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:394] [migrate] NPU:0 -> NUMA [0]
(APIServer pid=7397) INFO 05-13 14:13:49 [api_server.py:576] Supported tasks: ['generate', 'transcription']
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=7397) INFO: 127.0.0.1:33840 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41502 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41516 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41520 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO 05-13 14:14:32 [loggers.py:259] Engine 000: Avg prompt throughput: 12.5 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM
(APIServer pid=7397) INFO 05-13 14:14:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM
(APIServer pid=7397) INFO: 127.0.0.1:33446 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK| 项目 | 数值 |
|---|---|
| 证据 | 在已检入的文本文件中未检测到 |
| 来源 | 指标 | 数值 |
|---|---|---|
accuracy_results.json | passed | 4 |
accuracy_results.json | pass_rate | 100.0% |
accuracy_results.json | results[0].latency_ms | 1380 |
accuracy_results.json | results[0].status | PASS |
accuracy_results.json | results[1].latency_ms | 123 |
accuracy_results.json | results[1].status | PASS |
accuracy_results.json | results[2].latency_ms | 152 |
accuracy_results.json | results[2].status | PASS |
accuracy_results.json | results[3].latency_ms | 124 |
accuracy_results.json | results[3].status | PASS |
精度结论:PASS - 已检入的精度证据报告显示 PASS;选定的可复现错误率为 0%,低于 1%。
| 来源 | 指标 | 数值 |
|---|---|---|
accuracy_results.json | results[0].latency_ms | 1380 |
accuracy_results.json | results[1].latency_ms | 123 |
accuracy_results.json | results[2].latency_ms | 152 |
accuracy_results.json | results[3].latency_ms | 124 |
本文档记录 Qwen3-ASR-0.6B 在华为昇腾 NPU 环境下的适配验证、推理部署与评测结果整理。
Qwen3-ASR-0.6B 的当前适配任务类型为:语音识别 / 音频理解。仓库围绕 赛道一模型适配 交付要求,提供 NPU 推理脚本、精度评测、性能评测、运行日志、结果文件和文本化自验证证据。
相关获取地址:
仓库提供 inference.py 作为统一推理入口,运行时通过 --device npu 或脚本默认设备在昇腾 NPU 上执行推理。推理代码保留 model.eval()、无梯度推理、输入输出摘要、耗时统计和日志保存逻辑,便于复现与核验。
仓库保留精度评测与性能评测材料。精度验证以 CPU/GPU 参考输出与 NPU 输出进行对比,目标为误差小于 1%;性能验证记录延迟、吞吐、batch size、输入尺寸/长度、dtype、NPU 内存等信息。所有结果以 logs/ 与 results/ 中的真实运行文件为准。
自验证截图中的关键内容已转写为 README 文本证据,避免仅依赖图片展示。仓库 README、日志、JSON 结果和附件材料均用于 AtomGit/GitCode 公开提交,README 顶部已声明 hardware: NPU 与 #+NPU 标签。
| 组件 | 版本 / 说明 |
|---|---|
| 操作系统 | Linux-5.10.0-182.0.0.95.r2220_156.hce2.aarch64-aarch64-with-glibc2.35 |
| NPU 数量 | 2 |
| 依赖安装 | pip install -r requirements.txt |
results/env_info.json 或 logs/env_check.log 为准)torch_npu,请先完成昇腾基础环境配置后再运行真实验证。.
├── .gitignore
├── README.md
├── SKILL.md
├── eval/eval_accuracy_comparison.py
├── inference.py
├── logs/accuracy_eval.log
├── logs/environment_info.log
├── logs/npu_smi_info.log
├── logs/vllm_server_startup.log
├── results/accuracy_eval.json
└── results/env_info.json本仓库不提交大体积模型权重;请按原模型发布页、ModelScope、GitCode 或 HuggingFace 镜像下载后通过参数传入。
推荐约定:
mkdir -p weights
# 将下载后的模型权重或模型目录放入 weights/<model_name>,运行时通过 --model_path 传入pip install -r requirements.txt
python inference.py --model <model_path> --audio <audio.wav>按仓库评测脚本执行精度验证
按仓库评测脚本执行性能验证| 指标 | 结果 |
|---|---|
| 模型名称 | Qwen3-ASR-0.6B |
| 任务类型 | 语音识别 / 音频理解 |
| 推理设备 | Ascend NPU |
| 推理框架 | PyTorch / torch_npu 或仓库脚本声明的推理框架 |
| 仓库分支 | main |
| 当前提交 | b939bef |
测试结果来源:results/performance_eval.json 或 logs/performance_eval.log
| 指标 | 结果 |
|---|---|
| 结果 | 下方“结果数据直接文本”已写入实际日志/JSON内容 |
结果来源:results/accuracy_eval.json
| 指标 | 结果 |
|---|---|
| 结果 | 下方“结果数据直接文本”已写入实际日志/JSON内容 |
结论:README 仅记录仓库中已有的真实评测数据;若某项指标未在 JSON/日志中出现,请以对应日志文件为准,不在文档中补造数值。
按仓库评测脚本执行精度验证
按仓库评测脚本执行性能验证关键日志和结构化 JSON 已在下方“结果数据直接文本”中直接写入;原始文件路径仅用于复核。
inference.py 支持的参数以脚本自身 --help 输出为准。当前 README 从脚本中提取到的主要参数如下:
| 参数 | 默认值 | 说明 |
|---|---|---|
--backend | 见脚本默认值 | 脚本参数,详见 python inference.py --help |
--audio | 见脚本默认值 | 输入样例路径 |
--model-path | 见脚本默认值 | 脚本参数,详见 python inference.py --help |
--model-name | 见脚本默认值 | 脚本参数,详见 python inference.py --help |
--base-url | 见脚本默认值 | 脚本参数,详见 python inference.py --help |
--output | 见脚本默认值 | 输出目录或日志路径 |
python inference.py --help
python inference.py --model <model_path> --audio <audio.wav>以下内容来自仓库已有 README 证据段、运行日志或结果文件。图片文件如保留在 assets/ 中,仅作为附件材料;README 中直接写入可检索的文本证据。
本节将仓库中已提交的评测 JSON、推理日志、环境日志和性能日志直接写入 README。原始文件路径仅用于标识数据来源,主要数值和输出内容已在下面以文本形式完整展开。
=== Environment Info ===
modelscope 1.35.3
torch 2.9.0+cpu
torch_npu 2.9.0.post1+gitee7ba04
torchaudio 2.9.0
torchvision 0.24.0
transformers 4.57.6
vllm 0.18.0+empty /vllm-workspace/vllm
vllm_ascend 0.18.0rc1 /vllm-workspace/vllm-ascend
=== Python Version ===
Python 3.11.14
=== NPU Count ===
[LOG_WARNING] can not create directory, directory: /home/atomgit/ascend/log, possible reason: No such file or directory.path string is NULLpath string is NULLNPU available: True
NPU count: 2{
"os": "Linux-5.10.0-182.0.0.95.r2220_156.hce2.aarch64-aarch64-with-glibc2.35",
"python": "3.11.14",
"arch": "aarch64",
"torch": "2.9.0+cpu",
"torch_npu": "2.9.0.post1+gitee7ba04",
"npu_available": true,
"npu_count": 2,
"npu_name": "Ascend910_9362",
"transformers": "4.57.6",
"accelerate": "1.13.0",
"cann_path": "/usr/local/Ascend/cann-8.5.1",
"soc_version": "ascend910_9391"
}Qwen3-ASR-0.6B Accuracy Evaluation: NPU vs CPU
CPU text: "" | NPU text: ""
Text match: True
Error rate: 0.0%
Accuracy pass: True
CPU latency: 4690ms
NPU latency: 657ms
NPU speedup: 7.14x{
"model": "Qwen3-ASR-0.6B",
"cpu_text": "",
"npu_text": "",
"cpu_language": "",
"npu_language": "",
"text_match": true,
"language_match": true,
"cpu_latency_ms": 4690,
"npu_latency_ms": 657,
"npu_speedup": 7.14,
"error_rate_percent": 0.0,
"accuracy_pass": true,
"error_below_1pct": true
}+------------------------------------------------------------------------------------------------+
| npu-smi 25.5.2 Version: 25.5.2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip Phy-ID | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 4 Ascend910 | OK | 172.3 47 0 / 0 |
| 0 8 | 0000:0A:00.0 | 0 0 / 0 59015/ 65536 |
+------------------------------------------------------------------------------------------------+
| 4 Ascend910 | OK | - 48 0 / 0 |
| 1 9 | 0000:0B:00.0 | 0 0 / 0 2870 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| 4 0 | 7425 | VLLMEngineCor | 55967 |
+===========================+===============+====================================================+[LOG_WARNING] can not create directory, directory: /home/atomgit/ascend/log, possible reason: No such file or directory.path string is NULLpath string is NULLINFO 05-13 14:12:31 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 05-13 14:12:31 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 05-13 14:12:31 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-13 14:12:31 [__init__.py:239] Platform plugin ascend is activated
INFO 05-13 14:12:37 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
INFO 05-13 14:12:37 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297]
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.0
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297] █▄█▀ █ █ █ █ model /opt/atomgit/models/Qwen3-ASR-0.6B
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:297]
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:233] non-default args: {'model_tag': '/opt/atomgit/models/Qwen3-ASR-0.6B', 'host': '0.0.0.0', 'model': '/opt/atomgit/models/Qwen3-ASR-0.6B', 'trust_remote_code': True, 'seed': 1024, 'max_model_len': 4096, 'served_model_name': ['qwen3-asr-0.6b'], 'enable_prefix_caching': False, 'max_num_seqs': 16}
(APIServer pid=7397) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=7397) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section', 'interleaved'}
(APIServer pid=7397) INFO 05-13 14:12:38 [qwen3_asr.py:414] thinker_config is None. Initializing thinker model with default values
(APIServer pid=7397) INFO 05-13 14:12:38 [model.py:533] Resolved architecture: Qwen3ASRForConditionalGeneration
(APIServer pid=7397) INFO 05-13 14:12:38 [model.py:1582] Using max model len 4096
(APIServer pid=7397) INFO 05-13 14:12:38 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=7397) INFO 05-13 14:12:38 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=7397) WARNING 05-13 14:12:38 [platform.py:749] Parameter '--disable-cascade-attn' is a GPU-specific feature. Resetting to False for Ascend.
(APIServer pid=7397) WARNING 05-13 14:12:38 [platform.py:838] Ignored parameter 'disable_flashinfer_prefill'. This is a GPU-specific feature not supported on Ascend. Resetting to False.
(APIServer pid=7397) INFO 05-13 14:12:38 [ascend_config.py:425] Dynamic EPLB is False
(APIServer pid=7397) INFO 05-13 14:12:38 [ascend_config.py:426] The number of redundant experts is 0
(APIServer pid=7397) INFO 05-13 14:12:38 [platform.py:354] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:549] Calculated maximum supported batch sizes for ACL graph: 62
(APIServer pid=7397) WARNING 05-13 14:12:38 [utils.py:550] Currently, communication is performed using FFTS+ method, which reduces the number of available streams and, as a result, limits the range of runtime shapes that can be handled. To both improve communication performance and increase the number of supported shapes, set HCCL_OP_EXPANSION_MODE=AIV.
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:582] No adjustment needed for ACL graph batch sizes: Qwen3ASRForConditionalGeneration model (layers: 28) with 5 sizes
(APIServer pid=7397) INFO 05-13 14:12:38 [utils.py:1114] Block size is set to 128 if prefix cache or chunked prefill is enabled.
(APIServer pid=7397) INFO 05-13 14:12:38 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=7397) INFO 05-13 14:12:38 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
[LOG_WARNING] can not create directory, directory: /home/atomgit/ascend/log, possible reason: No such file or directory.path string is NULLpath string is NULLINFO 05-13 14:12:51 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 05-13 14:12:51 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 05-13 14:12:51 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-13 14:12:51 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=7425) INFO 05-13 14:12:57 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
(EngineCore pid=7425) INFO 05-13 14:12:57 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(EngineCore pid=7425) INFO 05-13 14:12:57 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/opt/atomgit/models/Qwen3-ASR-0.6B', speculative_config=None, tokenizer='/opt/atomgit/models/Qwen3-ASR-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=1024, served_model_name=qwen3-asr-0.6b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'vllm_ascend.compilation.compiler_interface.AscendCompiler', 'custom_ops': ['all'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update', 'vllm::mla_forward'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=7425) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore pid=7425) INFO 05-13 14:13:00 [ascend_config.py:425] Dynamic EPLB is False
(EngineCore pid=7425) INFO 05-13 14:13:00 [ascend_config.py:426] The number of redundant experts is 0
[W513 14:13:01.605411822 compiler_depend.ts:37] Warning: A common user is using the files of the root user. (function operator())
(EngineCore pid=7425) INFO 05-13 14:13:02 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.16.5.141:34511 backend=hccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=7425) INFO 05-13 14:13:02 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:320] [cpu_bind_mode] mode=global_slice rank=0 visible_npus=[0]
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:367] The CPU allocation plan is as follows:
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:372] NPU0: main=[2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17] acl=[18] release=[[19]]
(EngineCore pid=7425) INFO 05-13 14:13:03 [cpu_binding.py:394] [migrate] NPU:0 -> NUMA [0]
(EngineCore pid=7425) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore pid=7425) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore pid=7425) INFO 05-13 14:13:07 [model_runner_v1.py:2562] Starting to load model /opt/atomgit/models/Qwen3-ASR-0.6B...
(EngineCore pid=7425) INFO 05-13 14:13:08 [interface.py:275] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention
(EngineCore pid=7425) INFO 05-13 14:13:08 [mm_encoder_attention.py:230] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(EngineCore pid=7425) INFO 05-13 14:13:08 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=7425) INFO 05-13 14:13:08 [platform.py:354] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore pid=7425) INFO 05-13 14:13:08 [utils.py:549] Calculated maximum supported batch sizes for ACL graph: 62
(EngineCore pid=7425) WARNING 05-13 14:13:08 [utils.py:550] Currently, communication is performed using FFTS+ method, which reduces the number of available streams and, as a result, limits the range of runtime shapes that can be handled. To both improve communication performance and increase the number of supported shapes, set HCCL_OP_EXPANSION_MODE=AIV.
(EngineCore pid=7425) INFO 05-13 14:13:08 [utils.py:582] No adjustment needed for ACL graph batch sizes: Qwen3ASRForConditionalGeneration model (layers: 28) with 5 sizes
(EngineCore pid=7425) INFO 05-13 14:13:08 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(EngineCore pid=7425) INFO 05-13 14:13:08 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=7425) INFO 05-13 14:13:08 [compilation.py:942] Using OOT custom backend for compilation.
(EngineCore pid=7425) INFO 05-13 14:13:08 [compilation.py:942] Using OOT custom backend for compilation.
(EngineCore pid=7425)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore pid=7425)
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.49it/s]
(EngineCore pid=7425)
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.49it/s]
(EngineCore pid=7425)
(EngineCore pid=7425) INFO 05-13 14:13:09 [default_loader.py:384] Loading weights took 0.78 seconds
(EngineCore pid=7425) INFO 05-13 14:13:09 [model_runner_v1.py:2589] Loading model weights took 1.5251 GB
(EngineCore pid=7425) INFO 05-13 14:13:09 [gpu_model_runner.py:5488] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 5 audio items of the maximum feature size.
[LOG_WARNING] can not create directory, directory: /home/atomgit/ascend/log, possible reason: No such file or directory.path string is NULLpath string is NULLINFO 05-13 14:13:15 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 05-13 14:13:15 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 05-13 14:13:15 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-13 14:13:15 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=7425) INFO 05-13 14:13:32 [qwen3_asr.py:414] thinker_config is None. Initializing thinker model with default values
(EngineCore pid=7425) INFO 05-13 14:13:32 [qwen3_asr.py:414] thinker_config is None. Initializing thinker model with default values
(EngineCore pid=7425) INFO 05-13 14:13:32 [backends.py:988] Using cache directory: /opt/atomgit/.cache/vllm/torch_compile_cache/2fa88b34fd/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=7425) INFO 05-13 14:13:32 [backends.py:1048] Dynamo bytecode transform time: 4.51 s
(EngineCore pid=7425) INFO 05-13 14:13:42 [backends.py:387] Compiling a graph for compile range (1, 2048) takes 8.82 s
(EngineCore pid=7425) INFO 05-13 14:13:44 [monitor.py:48] torch.compile and initial profiling/warmup run together took 16.10 s in total
(EngineCore pid=7425) INFO 05-13 14:13:45 [worker.py:357] Available KV cache memory: 52.03 GiB
(EngineCore pid=7425) INFO 05-13 14:13:45 [kv_cache_utils.py:1316] GPU KV cache size: 487,040 tokens
(EngineCore pid=7425) INFO 05-13 14:13:45 [kv_cache_utils.py:1321] Maximum concurrency for 4,096 tokens per request: 118.91x
(EngineCore pid=7425)
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/5 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 20%|██ | 1/5 [00:00<00:00, 7.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 40%|████ | 2/5 [00:00<00:00, 7.77it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 60%|██████ | 3/5 [00:00<00:00, 7.78it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 80%|████████ | 4/5 [00:00<00:00, 7.86it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:00<00:00, 7.95it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:00<00:00, 7.87it/s]
(EngineCore pid=7425) INFO 05-13 14:13:49 [gpu_model_runner.py:5746] Graph capturing finished in 2 secs, took 0.03 GiB
(EngineCore pid=7425) INFO 05-13 14:13:49 [core.py:281] init engine (profile, create kv cache, warmup model) took 39.61 seconds
(EngineCore pid=7425) INFO 05-13 14:13:49 [platform.py:354] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore pid=7425) INFO 05-13 14:13:49 [utils.py:549] Calculated maximum supported batch sizes for ACL graph: 62
(EngineCore pid=7425) WARNING 05-13 14:13:49 [utils.py:550] Currently, communication is performed using FFTS+ method, which reduces the number of available streams and, as a result, limits the range of runtime shapes that can be handled. To both improve communication performance and increase the number of supported shapes, set HCCL_OP_EXPANSION_MODE=AIV.
(EngineCore pid=7425) INFO 05-13 14:13:49 [utils.py:582] No adjustment needed for ACL graph batch sizes: Qwen3ASRForConditionalGeneration model (layers: 28) with 5 sizes
(EngineCore pid=7425) INFO 05-13 14:13:49 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=7397) INFO 05-13 14:13:49 [api_server.py:576] Supported tasks: ['generate', 'transcription']
(APIServer pid=7397) WARNING 05-13 14:13:50 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1e-06}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) INFO 05-13 14:13:50 [hf.py:320] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=7397) INFO 05-13 14:13:50 [base.py:216] Multi-modal warmup completed in 0.021s
(APIServer pid=7397) The tokenizer you are loading from '/opt/atomgit/models/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=7397) INFO 05-13 14:13:51 [speech_to_text.py:132] Overwriting default completion sampling param with: {'temperature': 1e-06}
(APIServer pid=7397) INFO 05-13 14:13:51 [speech_to_text.py:132] Overwriting default completion sampling param with: {'temperature': 1e-06}
(APIServer pid=7397) INFO 05-13 14:13:51 [api_server.py:580] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:37] Available routes are:
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=7397) INFO 05-13 14:13:51 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=7397) INFO: Started server process [7397]
(APIServer pid=7397) INFO: Waiting for application startup.
(APIServer pid=7397) INFO: Application startup complete.
(APIServer pid=7397) INFO: 127.0.0.1:37628 - "GET /v1/models HTTP/1.1" 200 OK
(EngineCore pid=7425) INFO 05-13 14:14:25 [acl_graph.py:192] Replaying aclgraph
(APIServer pid=7397) INFO: 127.0.0.1:33840 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41502 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41516 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:41520 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO 05-13 14:14:32 [loggers.py:259] Engine 000: Avg prompt throughput: 12.5 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=7397) INFO 05-13 14:14:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=7397) INFO: 127.0.0.1:33446 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33462 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33468 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33478 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33480 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33482 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33486 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33496 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33508 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33524 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33536 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33546 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33548 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33562 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33566 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33572 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33588 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33596 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33602 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33610 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33648 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33674 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO: 127.0.0.1:33686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7397) INFO 05-13 14:14:52 [loggers.py:259] Engine 000: Avg prompt throughput: 140.9 tokens/s, Avg generation throughput: 11.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 66.7%
(APIServer pid=7397) INFO 05-13 14:15:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 66.7%license 元数据或 LICENSE 文件为准。