| 环境配置 | 配置说明 |
|---|---|
| 硬件配置 | 910B2(64G) |
| 驱动版本 | 23.0.5.1 |
| CANN版本 | 8.3.RC2 |
| 推理框架 | vllm-ascend |
| 推理镜像 | quay.io/ascend/vllm-ascend:v0.11.0rc2 |
| 部署方式 | 单机 |
使用vllm-ascend官方镜像:
docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2pip install modelscope
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir ./Qwen2.5-VL-3B-Instruct# Qwen2.5-VL requires the latest transformers (transformers >= 4.49.0),使用国内镜像源
pip uninstall -y transformers accelerate
pip install transformers==4.49.0 accelerate==0.27.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# Qwen vl utils
pip install qwen-vl-utils==0.0.8单卡即可实现,使用 Python 编写推理脚本
vim one_pic_inference.py脚本只需要修改模型的根路径'xxx',见Line7、Line11。脚本如下:
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xxx/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("xxx/Qwen2.5-VL-3B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("npu")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)运行以下命令
结果如下:
['该图像描绘了一幅宁静的海滩场景,画面中有一个人和一只狗。这个人坐在沙滩上,面朝大海,似乎正在与狗互动。狗也坐在沙滩上,面向这个人,像是在伸出爪子或做出友好的姿态。这个人穿着格子衬衫,留着长发。背景是有着轻柔波浪的大海,天空晴朗,柔和的光线表明此时可能是清晨或傍晚。整个图像的氛围宁静而愉悦。']
单卡无法跑通,会出现内存爆炸的问题: RuntimeError: NPU out of memory. Tried to allocate 91.78 GiB (NPU 0; 60.96 GiB total capacity; 8.48 GiB already allocated; 8.48 GiB current active; 51.95 GiB free; 8.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
针对以上的问题,我们需要扩充资源,至少需要两张NPU卡。在本次实验中,我采用的是2张卡。运行脚本的主要修改点为
1、device_map="auto"需要注释或者删除
模型只会加载到 一个 NPU 上,不会重复复制,显存需求大幅下降。
2、Qwen-VL 的图像编码器会将图像转为视觉 token。图像越大,token 越多,显存占用越高。
将每张图分辨率控制在 768px 以内,可显著减少视觉 token 数量和显存占用。
用python编写推理脚本
vim multi_pics_inference.py我们提供三张图片,分别为
![]() | ) |
|---|
参考的推理脚本如下:
我将图片和模型都保存在相同的根路径,因此只需要修改以下的'xxx'路径,见Line11, Line24, Lines31-33
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image
# ==============================
# 1. 模型加载(自动分配到两个 NPU 卡)
# ==============================
# 替换原来的 model 加载代码:
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xxx/Qwen2.5-VL-3B-Instruct",
torch_dtype="auto",
# device_map="auto" ←【删除或注释这一行】
)
# 显式指定使用单个 NPU(例如 NPU 0)
device = "npu:0"
model = model.to(device)
# ==============================
# 2. 加载 Processor(支持多图)
# ==============================
processor = AutoProcessor.from_pretrained(
"xxx/Qwen2.5-VL-3B-Instruct"
)
# ==============================
# 3. 准备 3 张本地图片路径(替换为你的图片路径)
# ==============================
image_paths = [
"xxx/car1.jpg", # 替换为真实路径
"xxx/car2.jpeg",
"xxx/car3.jpg"
]
# 加载图片为 PIL.Image 对象
images = []
for path in image_paths:
img = Image.open(path).convert("RGB")
# 降低分辨率(关键!)
img = img.resize((768, 768), Image.Resampling.LANCZOS)
images.append(img)
# ==============================
# 4. 构建 messages(包含 3 张图像 + 文本提问)
# ==============================
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": img, # 直接传入 PIL.Image 对象
}
for img in images
] + [
{"type": "text", "text": "Describe all three images in detail."}
],
}
]
# ==============================
# 5. 处理对话模板 + 提取图像输入
# ==============================
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
# ==============================
# 6. 构造输入并移动到 NPU
# ==============================
inputs = processor(
text=[text],
images=image_inputs, # 支持多个图像(列表)
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# 移动到 NPU(自动跨卡)
inputs = inputs.to("npu")
# ==============================
# 7. 推理生成
# ==============================
generated_ids = model.generate(**inputs, max_new_tokens=512)
# 去掉输入部分
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
# 解码输出
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print("✅ 推理结果:")
print(output_text[0])运行以下命令
python multi_pics_inference.py推理结果为
当然可以!以下是每张图片的详细描述:
图片 1:
图片 2:
图片 3:
以上描述全面概述了每款汽车的设计、颜色
Qwen2.5-VL技术报告 https://arxiv.org/abs/2502.13923