环境准备

环境配置	配置说明
硬件配置	910B2(64G)
驱动版本	23.0.5.1
CANN版本	8.3.RC2
推理框架	vllm-ascend
推理镜像	quay.io/ascend/vllm-ascend:v0.11.0rc2
部署方式	单机

镜像准备

使用vllm-ascend官方镜像：

docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2

下载权重

pip install modelscope
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir ./Qwen2.5-VL-3B-Instruct

更新依赖模块包

# Qwen2.5-VL requires the latest transformers (transformers >= 4.49.0)，使用国内镜像源

pip uninstall -y transformers accelerate

pip install transformers==4.49.0 accelerate==0.27.0 -i https://pypi.tuna.tsinghua.edu.cn/simple 

# Qwen vl utils

pip install qwen-vl-utils==0.0.8

单图推理

单卡即可实现，使用 Python 编写推理脚本

vim one_pic_inference.py

脚本只需要修改模型的根路径'xxx'，见Line7、Line11。脚本如下：

from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

"xxx/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"

)

processor = AutoProcessor.from_pretrained("xxx/Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.

# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

# min_pixels = 256*28*28

# max_pixels = 1280*28*28

# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [

 {

 "role": "user",

 "content": [

 {

 "type": "image",

 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",

 },

 {"type": "text", "text": "Describe this image."},

 ],

 }

]

# Preparation for inference

text = processor.apply_chat_template(

 messages, tokenize=False, add_generation_prompt=True

)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(

 text=[text],

 images=image_inputs,

 videos=video_inputs,

 padding=True,

 return_tensors="pt",

)

inputs = inputs.to("npu")



# Inference: Generation of the output

generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [

 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)

]

output_text = processor.batch_decode(

 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False

)

print(output_text)

运行以下命令

结果如下：

['该图像描绘了一幅宁静的海滩场景，画面中有一个人和一只狗。这个人坐在沙滩上，面朝大海，似乎正在与狗互动。狗也坐在沙滩上，面向这个人，像是在伸出爪子或做出友好的姿态。这个人穿着格子衬衫，留着长发。背景是有着轻柔波浪的大海，天空晴朗，柔和的光线表明此时可能是清晨或傍晚。整个图像的氛围宁静而愉悦。']

多图推理

单卡无法跑通，会出现内存爆炸的问题： RuntimeError: NPU out of memory. Tried to allocate 91.78 GiB (NPU 0; 60.96 GiB total capacity; 8.48 GiB already allocated; 8.48 GiB current active; 51.95 GiB free; 8.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

针对以上的问题，我们需要扩充资源，至少需要两张NPU卡。在本次实验中，我采用的是2张卡。运行脚本的主要修改点为

1、device_map="auto"需要注释或者删除

模型只会加载到 一个 NPU 上，不会重复复制，显存需求大幅下降。

2、Qwen-VL 的图像编码器会将图像转为视觉 token。图像越大，token 越多，显存占用越高。

将每张图分辨率控制在 768px 以内，可显著减少视觉 token 数量和显存占用。

用python编写推理脚本

vim multi_pics_inference.py

我们提供三张图片，分别为

		)

参考的推理脚本如下：

我将图片和模型都保存在相同的根路径，因此只需要修改以下的'xxx'路径，见Line11, Line24, Lines31-33

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# ==============================
# 1. 模型加载（自动分配到两个 NPU 卡）
# ==============================
# 替换原来的 model 加载代码：
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xxx/Qwen2.5-VL-3B-Instruct",
torch_dtype="auto",
# device_map="auto" ←【删除或注释这一行】
)

# 显式指定使用单个 NPU（例如 NPU 0）
device = "npu:0"
model = model.to(device)

# ==============================
# 2. 加载 Processor（支持多图）
# ==============================
processor = AutoProcessor.from_pretrained(
    "xxx/Qwen2.5-VL-3B-Instruct"
)

# ==============================
# 3. 准备 3 张本地图片路径（替换为你的图片路径）
# ==============================
image_paths = [
    "xxx/car1.jpg",  # 替换为真实路径
    "xxx/car2.jpeg",
    "xxx/car3.jpg"
]

# 加载图片为 PIL.Image 对象
images = []
for path in image_paths:
    img = Image.open(path).convert("RGB")
    # 降低分辨率（关键！）
    img = img.resize((768, 768), Image.Resampling.LANCZOS)
    images.append(img)


# ==============================
# 4. 构建 messages（包含 3 张图像 + 文本提问）
# ==============================
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": img,  # 直接传入 PIL.Image 对象
            }
            for img in images
        ] + [
            {"type": "text", "text": "Describe all three images in detail."}
        ],
    }
]

# ==============================
# 5. 处理对话模板 + 提取图像输入
# ==============================
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

# ==============================
# 6. 构造输入并移动到 NPU
# ==============================
inputs = processor(
    text=[text],
    images=image_inputs,   # 支持多个图像（列表）
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# 移动到 NPU（自动跨卡）
inputs = inputs.to("npu")

# ==============================
# 7. 推理生成
# ==============================
generated_ids = model.generate(**inputs, max_new_tokens=512)

# 去掉输入部分
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# 解码输出
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print("✅ 推理结果：")
print(output_text[0])

运行以下命令

python multi_pics_inference.py

推理结果为

当然可以！以下是每张图片的详细描述：

图片 1：
- 车型：图片中的汽车是法拉利 812 Superfast。
- 颜色：车身采用鲜艳的红色涂装。
- 设计：该车采用高性能跑车特有的流畅空气动力学设计，线条锐利，车身低矮。
- 外观细节：前格栅宽大醒目，法拉利标志位于中央位置。前大灯造型犀利且棱角分明，为车辆增添了极具攻击性的视觉效果。后视镜与车身融为一体，轮毂采用多辐式设计。
- 背景：背景是带有树木的优美风景和晴朗天空，表明汽车在白天停放在户外。
图片 2：
- 车型：图片中的汽车是道奇 Viper。
- 颜色：车身采用亮黄色涂装。
- 设计：该车设计独特且极具攻击性，车身低矮宽大。前端配有带黑色饰条的大型进气口，引擎盖上有一条黑色条纹从中央贯穿而下。
- 外观细节：前大灯大而棱角分明，车辆整体呈现出低矮的运动姿态和宽大的轮距。轮毂为黑色多辐式设计，整体造型彰显速度与性能。
- 背景：背景中道路模糊，表明汽车正在高速行驶。场景似乎是在赛道或高速道路上。
图片 3：
- 车型：图片中的汽车是宾利欧陆 GT。
- 颜色：车身采用白色涂装。
- 设计：该车采用经典豪华设计，宽大醒目的前格栅带有网状图案。前大灯线条流畅且棱角分明，整体造型彰显优雅与精致。
- 外观细节：车身修长低矮，表面光滑且打磨精致。轮毂为黑色多辐式设计，整体造型尽显豪华与高端。
- 背景：背景是带有深色外立面的现代建筑，表明汽车停放在城市或商业区。

以上描述全面概述了每款汽车的设计、颜色

参考文献

Qwen2.5-VL技术报告 https://arxiv.org/abs/2502.13923

环境准备

环境配置	配置说明
硬件配置	910B2(64G)
驱动版本	23.0.5.1
CANN版本	8.3.RC2
推理框架	vllm-ascend
推理镜像	quay.io/ascend/vllm-ascend:v0.11.0rc2
部署方式	单机

镜像准备

使用vllm-ascend官方镜像：

docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2

下载权重

pip install modelscope
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir ./Qwen2.5-VL-3B-Instruct

更新依赖模块包

# Qwen2.5-VL requires the latest transformers (transformers >= 4.49.0)，使用国内镜像源

pip uninstall -y transformers accelerate

pip install transformers==4.49.0 accelerate==0.27.0 -i https://pypi.tuna.tsinghua.edu.cn/simple 

# Qwen vl utils

pip install qwen-vl-utils==0.0.8

单图推理

单卡即可实现，使用 Python 编写推理脚本

vim one_pic_inference.py

脚本只需要修改模型的根路径'xxx'，见Line7、Line11。脚本如下：

from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

"xxx/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"

)

processor = AutoProcessor.from_pretrained("xxx/Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.

# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

# min_pixels = 256*28*28

# max_pixels = 1280*28*28

# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [

 {

 "role": "user",

 "content": [

 {

 "type": "image",

 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",

 },

 {"type": "text", "text": "Describe this image."},

 ],

 }

]

# Preparation for inference

text = processor.apply_chat_template(

 messages, tokenize=False, add_generation_prompt=True

)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(

 text=[text],

 images=image_inputs,

 videos=video_inputs,

 padding=True,

 return_tensors="pt",

)

inputs = inputs.to("npu")



# Inference: Generation of the output

generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [

 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)

]

output_text = processor.batch_decode(

 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False

)

print(output_text)

运行以下命令

结果如下：

多图推理

针对以上的问题，我们需要扩充资源，至少需要两张NPU卡。在本次实验中，我采用的是2张卡。运行脚本的主要修改点为

1、device_map="auto"需要注释或者删除

模型只会加载到 一个 NPU 上，不会重复复制，显存需求大幅下降。

2、Qwen-VL 的图像编码器会将图像转为视觉 token。图像越大，token 越多，显存占用越高。

将每张图分辨率控制在 768px 以内，可显著减少视觉 token 数量和显存占用。

用python编写推理脚本

vim multi_pics_inference.py

我们提供三张图片，分别为

		)

参考的推理脚本如下：

我将图片和模型都保存在相同的根路径，因此只需要修改以下的'xxx'路径，见Line11, Line24, Lines31-33

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# ==============================
# 1. 模型加载（自动分配到两个 NPU 卡）
# ==============================
# 替换原来的 model 加载代码：
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xxx/Qwen2.5-VL-3B-Instruct",
torch_dtype="auto",
# device_map="auto" ←【删除或注释这一行】
)

# 显式指定使用单个 NPU（例如 NPU 0）
device = "npu:0"
model = model.to(device)

# ==============================
# 2. 加载 Processor（支持多图）
# ==============================
processor = AutoProcessor.from_pretrained(
    "xxx/Qwen2.5-VL-3B-Instruct"
)

# ==============================
# 3. 准备 3 张本地图片路径（替换为你的图片路径）
# ==============================
image_paths = [
    "xxx/car1.jpg",  # 替换为真实路径
    "xxx/car2.jpeg",
    "xxx/car3.jpg"
]

# 加载图片为 PIL.Image 对象
images = []
for path in image_paths:
    img = Image.open(path).convert("RGB")
    # 降低分辨率（关键！）
    img = img.resize((768, 768), Image.Resampling.LANCZOS)
    images.append(img)


# ==============================
# 4. 构建 messages（包含 3 张图像 + 文本提问）
# ==============================
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": img,  # 直接传入 PIL.Image 对象
            }
            for img in images
        ] + [
            {"type": "text", "text": "Describe all three images in detail."}
        ],
    }
]

# ==============================
# 5. 处理对话模板 + 提取图像输入
# ==============================
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

# ==============================
# 6. 构造输入并移动到 NPU
# ==============================
inputs = processor(
    text=[text],
    images=image_inputs,   # 支持多个图像（列表）
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# 移动到 NPU（自动跨卡）
inputs = inputs.to("npu")

# ==============================
# 7. 推理生成
# ==============================
generated_ids = model.generate(**inputs, max_new_tokens=512)

# 去掉输入部分
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# 解码输出
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print("✅ 推理结果：")
print(output_text[0])

运行以下命令

python multi_pics_inference.py

推理结果为

当然可以！以下是每张图片的详细描述：

图片 1：
- 车型：图片中的汽车是法拉利 812 Superfast。
- 颜色：车身采用鲜艳的红色涂装。
- 设计：该车采用高性能跑车特有的流畅空气动力学设计，线条锐利，车身低矮。
- 外观细节：前格栅宽大醒目，法拉利标志位于中央位置。前大灯造型犀利且棱角分明，为车辆增添了极具攻击性的视觉效果。后视镜与车身融为一体，轮毂采用多辐式设计。
- 背景：背景是带有树木的优美风景和晴朗天空，表明汽车在白天停放在户外。
图片 2：
- 车型：图片中的汽车是道奇 Viper。
- 颜色：车身采用亮黄色涂装。
- 设计：该车设计独特且极具攻击性，车身低矮宽大。前端配有带黑色饰条的大型进气口，引擎盖上有一条黑色条纹从中央贯穿而下。
- 外观细节：前大灯大而棱角分明，车辆整体呈现出低矮的运动姿态和宽大的轮距。轮毂为黑色多辐式设计，整体造型彰显速度与性能。
- 背景：背景中道路模糊，表明汽车正在高速行驶。场景似乎是在赛道或高速道路上。
图片 3：
- 车型：图片中的汽车是宾利欧陆 GT。
- 颜色：车身采用白色涂装。
- 设计：该车采用经典豪华设计，宽大醒目的前格栅带有网状图案。前大灯线条流畅且棱角分明，整体造型彰显优雅与精致。
- 外观细节：车身修长低矮，表面光滑且打磨精致。轮毂为黑色多辐式设计，整体造型尽显豪华与高端。
- 背景：背景是带有深色外立面的现代建筑，表明汽车停放在城市或商业区。

以上描述全面概述了每款汽车的设计、颜色

参考文献

Qwen2.5-VL技术报告 https://arxiv.org/abs/2502.13923