本指导仅适合用于在A2上基于vllm-acend框架部署Qwen3-VL-Embdedding-8B模型。
Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 模型系列是 Qwen 家族的最新成员,它们基于近期开源的强大 Qwen3-VL 基础模型构建而成。该模型套件专为多模态信息检索和跨模态理解而设计,可接受包括文本、图像、屏幕截图和视频在内的多种输入,以及包含这些模态混合的输入。本指南将介绍如何使用 vLLM Ascend 运行该模型。
| 部件 | 版本 |
|---|---|
| 驱动固件 | 25.5.0.b070 |
| CANN版本 | 8.5.0 |
| python版本 | 2.9.0 |
| torch版本 | 2.9.0 |
| torch_npu版本 | 3.11.14 |
vllm-ascend提供用于部署的 Docker 镜像。您可以直接从镜像仓库ascend/vllm-ascend拉取预构建镜像,然后使用 bash 运行它。 以当前最新的vllm-ascend镜像为例,输入如下命令即可拉取镜像:
docker pull quay.io/ascend/vllm-ascend:v0.14.0rc1
拉取完成后,输入 docker images 可以查看是否拉取成功。
export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0rc1
docker run --rm \
--name vllm-ascend-env \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash注意,上述启动容器的方式由于添加了 --rm,因此在退出容器后会自动删除该容器,若想要长期保存容器,请删除 --rm 参数。
下载模型权重的方式有很多,下面仅提供使用 huggingface_hub 工具下载模型的方法:
首先使用 pip 安装相应工具包:
pip install huggingface-hub tqdm然后在命令行中输入下面代码:
HF_ENDPOINT=https://hf-mirror.com python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='Qwen/Qwen3-VL-Embedding-8B',
local_dir='/home/models/Qwen3-VL-Embedding-8B',
resume_download=True,
local_dir_use_symlinks=False
)
"注 1、其中local_dir表示模型的下载目录,可以根据自己的要求进行设置。 2、如果该权重要用于在线推理,需要将这个模型权重保存的目录,记录下来(例如本例展示的 '/home/models/Qwen3-VL-Embedding-8B'),作为在线推理的入参传入
import torch
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
if __name__=="__main__":
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
runner="pooling",
distributed_executor_backend="mp")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())注意,上述model="Qwen/Qwen3-VL-Embedding-8B"的路径需要替换为自己下载的指定路径。 如果脚本运行成功,您可以看到以下信息:
(EngineCore_DP0 pid=344) (Worker pid=351) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|█████████████████████| 4/4 [00:00<00:00, 16.09it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.9283185005187988, 0.3254002034664154], [0.4081040918827057, 0.74125736951828]]vllm serve /home/models/Qwen3-VL-Embedding-8B --runner pooling --port 9008 --served-model-name Qwen3-VL-Embedding-8B其中/home/models/Qwen3-VL-Embedding-8B 就是模型的本地存储路径
服务器启动后,可以重新打开一个窗口,进入上述容器,通过输入提示查询模型:
curl http://127.0.0.1:9008/v1/embeddings -H "Content-Type: application/json" -d '{
"input": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
}'如果有结果返回,则表示调用成功。且服务端有如下显示:
(APIServer pid=1611) INFO: 127.0.0.1:42740 - "POST /v1/embeddings HTTP/1.1" 200 OK
(APIServer pid=1611) INFO 02-04 07:50:30 [loggers.py:257] Engine 000: Avg prompt throughput: 3.7 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
Qwen3-VL-Embedding-8B以 vllm 基准测试为例,运行性能测试。 以这个serve例子为例,按如下方式运行代码。
vllm bench serve --model Qwen/Qwen3-VL-Embedding-8B --backend openai-embeddings --dataset-name random --endpoint /v1/embeddings --random-input 200 --port 9008 --served-model-name Qwen3-VL-Embedding-8B --save-result --result-dir ./其中Qwen/Qwen3-VL-Embedding-8B需要替换为自己下载的模型路径。
大约几分钟后,您就可以获得性能评估结果。在本教程中,性能评估结果为:
100%|██████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:17<00:00, 58.53it/s]
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 17.09
Total input tokens: 200000
Request throughput (req/s): 58.53
Total token throughput (tok/s): 11705.95
----------------End-to-end Latency----------------
Mean E2EL (ms): 9493.82
Median E2EL (ms): 9514.06
P99 E2EL (ms): 16801.62
==================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute