Qwen3-VL-Embedding-8B昇腾部署指导

本指导仅适合用于在A2上基于vllm-acend框架部署Qwen3-VL-Embdedding-8B模型。

介绍

Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 模型系列是 Qwen 家族的最新成员，它们基于近期开源的强大 Qwen3-VL 基础模型构建而成。该模型套件专为多模态信息检索和跨模态理解而设计，可接受包括文本、图像、屏幕截图和视频在内的多种输入，以及包含这些模态混合的输入。本指南将介绍如何使用 vLLM Ascend 运行该模型。

环境准备

环境信息

部件	版本
驱动固件	25.5.0.b070
CANN版本	8.5.0
python版本	2.9.0
torch版本	2.9.0
torch_npu版本	3.11.14

运行docker

准备vllm-ascend的docker镜像

vllm-ascend提供用于部署的 Docker 镜像。您可以直接从镜像仓库ascend/vllm-ascend拉取预构建镜像，然后使用 bash 运行它。以当前最新的vllm-ascend镜像为例，输入如下命令即可拉取镜像：

docker pull quay.io/ascend/vllm-ascend:v0.14.0rc1

拉取完成后，输入 docker images 可以查看是否拉取成功。

启动容器

export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0rc1
docker run --rm \
    --name vllm-ascend-env \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

注意，上述启动容器的方式由于添加了 --rm，因此在退出容器后会自动删除该容器，若想要长期保存容器，请删除 --rm 参数。

模型权重

下载模型权重的方式有很多，下面仅提供使用 huggingface_hub 工具下载模型的方法：

首先使用 pip 安装相应工具包：

pip install huggingface-hub tqdm

然后在命令行中输入下面代码：

HF_ENDPOINT=https://hf-mirror.com python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='Qwen/Qwen3-VL-Embedding-8B',
    local_dir='/home/models/Qwen3-VL-Embedding-8B',
    resume_download=True,
    local_dir_use_symlinks=False
)
"

注 1、其中local_dir表示模型的下载目录，可以根据自己的要求进行设置。 2、如果该权重要用于在线推理，需要将这个模型权重保存的目录，记录下来（例如本例展示的 '/home/models/Qwen3-VL-Embedding-8B'），作为在线推理的入参传入

部署

离线推理

import torch
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


if __name__=="__main__":
    # Each query must come with a one-sentence instruction that describes the task
    task = 'Given a web search query, retrieve relevant passages that answer the query'

    queries = [
        get_detailed_instruct(task, 'What is the capital of China?'),
        get_detailed_instruct(task, 'Explain gravity')
    ]
    # No need to add instruction for retrieval documents
    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
    input_texts = queries + documents

    model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
                runner="pooling",
                distributed_executor_backend="mp")

    outputs = model.embed(input_texts)
    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
    scores = (embeddings[:2] @ embeddings[2:].T)
    print(scores.tolist())

注意，上述model="Qwen/Qwen3-VL-Embedding-8B"的路径需要替换为自己下载的指定路径。如果脚本运行成功，您可以看到以下信息：

(EngineCore_DP0 pid=344) (Worker pid=351) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|█████████████████████| 4/4 [00:00<00:00, 16.09it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.9283185005187988, 0.3254002034664154], [0.4081040918827057, 0.74125736951828]]

在线推理

启动服务

vllm serve /home/models/Qwen3-VL-Embedding-8B  --runner pooling   --port 9008   --served-model-name Qwen3-VL-Embedding-8B

其中/home/models/Qwen3-VL-Embedding-8B 就是模型的本地存储路径

发送请求

服务器启动后，可以重新打开一个窗口，进入上述容器，通过输入提示查询模型：

curl http://127.0.0.1:9008/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}'

如果有结果返回，则表示调用成功。且服务端有如下显示：

(APIServer pid=1611) INFO:     127.0.0.1:42740 - "POST /v1/embeddings HTTP/1.1" 200 OK
(APIServer pid=1611) INFO 02-04 07:50:30 [loggers.py:257] Engine 000: Avg prompt throughput: 3.7 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%

性能测试

Qwen3-VL-Embedding-8B以 vllm 基准测试为例，运行性能测试。以这个serve例子为例，按如下方式运行代码。

 vllm bench serve   --model Qwen/Qwen3-VL-Embedding-8B   --backend openai-embeddings   --dataset-name random   --endpoint /v1/embeddings   --random-input 200   --port 9008   --served-model-name Qwen3-VL-Embedding-8B   --save-result   --result-dir ./

其中Qwen/Qwen3-VL-Embedding-8B需要替换为自己下载的模型路径。

大约几分钟后，您就可以获得性能评估结果。在本教程中，性能评估结果为：

100%|██████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:17<00:00, 58.53it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  17.09
Total input tokens:                      200000
Request throughput (req/s):              58.53
Total token throughput (tok/s):          11705.95
----------------End-to-end Latency----------------
Mean E2EL (ms):                          9493.82
Median E2EL (ms):                        9514.06
P99 E2EL (ms):                           16801.62
==================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute