InternVL2_5-1B-MPO

模型名称	视觉部分	语言部分	HF 链接
InternVL2_5-1B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-0.5B-Instruct	🤗 链接
InternVL2_5-2B-MPO	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 链接
InternVL2_5-4B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 链接
InternVL2_5-8B-MPO	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 链接
InternVL2_5-26B-MPO	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 链接
InternVL2_5-38B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 链接
InternVL2_5-78B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 链接

模型架构

如下图所示，InternVL2.5-MPO 沿用了 InternVL 2.5 及其前代版本 InternVL 1.5 和 2.0 的模型架构，遵循“ViT-MLP-LLM”范式。在这一新版本中，我们将新的增量预训练 InternViT 与多种预训练 LLM（包括 InternLM 2.5 和 Qwen 2.5）通过随机初始化的 MLP 投影器进行集成。

image/png

与之前的版本一样，我们应用了像素重排操作，将视觉 tokens 的数量减少到原始数量的四分之一。此外，我们采用了与 InternVL 1.5 类似的动态分辨率策略，将图像分割成 448×448 像素的图块。从 InternVL 2.0 开始的关键区别在于，我们额外引入了对多图像和视频数据的支持。

关键设计

多模态偏好数据集

MMPR 是一个大规模、高质量的多模态推理偏好数据集。该数据集包含约 300 万条样本。

image/jpeg

为构建此数据集，我们提出了一个高效的数据构建流程。具体而言，我们将多模态数据分为具有明确标准答案的样本和无明确标准答案的样本。

对于具有明确标准答案的样本： 模型被提示首先提供推理过程，然后以 Final Answer: *** 类似的格式给出最终答案。与标准答案匹配的响应构成正集 $\mathcal{Y}_p$，不匹配的响应构成负集 $\mathcal{Y}_n$。此外，未能提供明确最终答案的响应也被合并到 $\mathcal{Y}_n$ 中。给定这些标记为正或负的响应，我们通过从 $\mathcal{Y}_p$ 中选择一个优选响应 $y_c$ 和从 $\mathcal{Y}_n$ 中选择一个负响应 $y_r$ 来构建偏好对。
对于无明确标准答案的样本： 我们提出了一种简单而有效的方法：Dropout 下一个 Token 预测（Dropout NTP）。具体来说，我们使用 InternVL2-8B 生成的响应作为优选答案。给定优选答案，我们将其截断一半，然后提示 InternVL2-8B 在不访问图像输入的情况下完成截断答案的剩余部分。这种生成的补全内容作为配对样本的拒绝答案。值得注意的是，虽然 InternVL2-8B 生成的响应可能并不完美，但在没有图像输入的情况下生成的补全内容会比有图像输入时产生更多的幻觉。因此，优选响应和拒绝响应之间的偏序关系是成立的。

数据构建流程已开源，更多细节请参见我们的文档。

混合偏好优化

MPO 的核心思想在于，一个有效的偏好优化过程应使模型能够学习响应对之间的相对偏好、单个响应的绝对质量以及生成首选响应的过程。 我们将训练目标定义为偏好损失 $\mathcal{L}{\text{p}}$、质量损失 $\mathcal{L}{\text{q}}$ 和生成损失 $\mathcal{L}_{\text{g}}$ 的组合，称为混合偏好优化：

\mathcal{L}=w_{p}\cdot\mathcal{L}_{\text{p}} + w_{q}\cdot\mathcal{L}_{\text{q}} + w_{g}\cdot\mathcal{L}_{\text{g}},

其中 $w_{*}$ 表示分配给每个损失组件的权重。在本研究中，我们通过实验比较了不同变体的偏好损失。基于实验结果，我们采用 DPO 作为偏好损失，BCO 作为质量损失。

具体而言，DPO 作为偏好损失，用于使模型学习所选响应与被拒响应之间的相对偏好。该算法优化以下损失函数：

\mathcal{L}_{\text{p}}=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)}\right),

其中 $\beta$ 是 KL 惩罚系数，$x$、$y_c$ 和 $y_r$ 分别是用户查询、所选响应和被拒响应。策略模型 $\pi_\theta$ 由模型 $\pi_0$ 初始化。

此外，BCO 损失被用作质量损失，帮助模型理解单个响应的绝对质量。该损失函数定义为：

\mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,

其中 $\mathcal{L}{\text{q}}^{+}$ 和 $\mathcal{L}{\text{q}}^{-}$ 分别表示所选响应和被拒响应的损失。每种响应类型的损失独立计算，要求模型区分单个响应的绝对质量。损失项如下：

\mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),

\mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),

其中 $\delta$ 表示奖励偏移，通过计算先前奖励的移动平均值来稳定训练。

最后，SFT 损失被用作生成损失，帮助模型学习首选响应的生成过程。该损失函数定义为：

\mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.

多模态能力评估

为全面对比MPO前后InternVL的性能，我们采用了OpenCompass排行榜中的基准测试集，包括成熟的经典数据集和新引入的数据集。这些基准测试集涵盖了广泛的类别，旨在对InternVL在各类多模态任务上的能力进行全面且均衡的评估。评估结果如下表所示。

模型	平均得分	MMBench v1.1	MMStar	MMMU	MathVista	HallusionBench	AI2D	OCRBench	MMVet
InternVL2-5-1B	54.9	66.5	51.3	41.2	47.1	39.4	69.0	77.4	47.2
InternVL2-5-1B-MPO	56.4	67.2	49.7	40.8	53.0	40.0	69.4	83.6	47.2
InternVL2-5-2B	59.9	70.9	54.3	43.2	51.1	42.3	74.9	80.2	62.6
InternVL2-5-2B-MPO	62.0	71.6	55.0	45.0	56.4	43.0	75.3	84.2	65.4
InternVL2-5-4B	65.1	78.2	58.7	51.8	60.8	46.6	81.4	82.0	61.5
InternVL2-5-4B-MPO	67.6	78.6	60.2	51.6	65.3	47.8	82.0	88.0	67.1
InternVL2-5-8B	68.9	82.5	63.2	56.2	64.5	49.0	84.6	82.1	62.8
InternVL2-5-8B-MPO	70.4	82.4	65.7	54.9	68.9	51.4	84.5	88.3	66.9
InternVL2-5-26B	71.6	84.6	66.5	60.7	68.0	55.8	86.2	85.4	65.4
InternVL2-5-26B-MPO	72.7	84.2	67.2	57.7	72.8	55.3	86.2	91.2	67.1
InternVL2-5-38B	73.5	85.4	68.5	64.6	72.4	57.9	87.6	84.1	67.2
InternVL2-5-38B-MPO	75.5	85.6	69.8	64.1	73.8	61.5	88.1	88.5	72.5
InternVL2-5-78B	75.2	87.5	69.5	70.0	70.6	57.4	89.1	85.3	71.8
InternVL2-5-78B-MPO	76.6	87.3	73.1	68.3	73.8	58.7	89.3	91.2	71.4

快速开始

我们提供了一个使用 transformers 运行 InternVL2_5-1B-MPO 的示例代码。

请使用 transformers>=4.37.2 以确保模型正常工作。

模型加载

16位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2_5-1B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2_5-1B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多GPU

采用这种方式编写代码的原因是为了避免在多GPU推理过程中因张量不在同一设备上而产生的错误。通过确保大型语言模型（LLM）的第一层和最后一层位于同一设备上，我们可以防止此类错误的发生。

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
        'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL2_5-1B-MPO"
device_map = split_model('InternVL2_5-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 进行推理

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2_5-1B-MPO'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式输出

除了此方法外，您还可以使用以下代码获取流式输出。

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

微调

目前已有多个代码库支持 InternVL 系列模型的微调，包括 InternVL、SWIFT、XTurner 等。有关微调的更多详细信息，请参考它们的文档。

部署

LMDeploy

LMDeploy 是一个用于压缩、部署和服务大语言模型（LLMs）及视觉语言模型（VLMs）的工具包。

pip install lmdeploy>=0.6.4

LMDeploy 将多模态视觉语言模型（VLM）复杂的推理过程抽象为易于使用的流水线，类似于大语言模型（LLM）的推理流水线。

“Hello, world”示例

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)

如果在执行此案例时出现 ImportError，请根据提示安装所需的依赖包。

多图像推理

处理多张图像时，可将所有图像放入同一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增大上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示词推理

使用批量提示词进行推理非常简单，只需将它们放入列表结构中即可：

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

使用该pipeline进行多轮对话有两种方式。一种是按照OpenAI的格式构建消息，并使用上述介绍的方法；另一种是使用pipeline.chat接口。

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务

LMDeploy 的 api_server 能够通过单条命令轻松将模型打包为服务。其提供的 RESTful API 兼容 OpenAI 的接口。以下是服务启动示例：

lmdeploy serve api_server OpenGVLab/InternVL2_5-1B-MPO --server-port 23333

要使用 OpenAI 风格的接口，您需要安装 OpenAI：

pip install openai

然后，使用以下代码进行 API 调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

许可协议

本项目基于 MIT 许可协议发布。本项目使用预训练模型 Qwen2.5-0.5B-Instruct 作为组件，该模型基于 Apache License 2.0 许可协议。

引用

如果您发现本项目对您的研究有所帮助，请考虑引用：

@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

InternVL2_5-1B-MPO

[

$📂 GitHub$

](https://github.com/OpenGVLab/InternVL) [

$📜 InternVL 1.0$

](https://huggingface.co/papers/2312.14238) [

$📜 InternVL 1.5$

](https://huggingface.co/papers/2404.16821) [

$📜 InternVL 2.5$

](https://huggingface.co/papers/2412.05271) [

$📜 InternVL2.5-MPO$

](https://huggingface.co/papers/2411.10442)

[

$🆕 博客$

](https://internvl.github.io/blog/) [

$🗨️ 聊天演示$

](https://internvl.opengvlab.com/) [

$🤗 HF 演示$

](https://huggingface.co/spaces/OpenGVLab/InternVL) [

$🚀 快速开始$

](#quick-start) [

$📖 文档$

](https://internvl.readthedocs.io/en/latest/)

简介

image/png

InternVL 2.5 系列

下表为您概述了 InternVL2.5-MPO 系列的相关信息。

模型名称	视觉部分	语言部分	HF 链接
InternVL2_5-1B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-0.5B-Instruct	🤗 链接
InternVL2_5-2B-MPO	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 链接
InternVL2_5-4B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 链接
InternVL2_5-8B-MPO	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 链接
InternVL2_5-26B-MPO	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 链接
InternVL2_5-38B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 链接
InternVL2_5-78B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 链接

模型架构

image/png

关键设计

多模态偏好数据集

MMPR 是一个大规模、高质量的多模态推理偏好数据集。该数据集包含约 300 万条样本。

image/jpeg

为构建此数据集，我们提出了一个高效的数据构建流程。具体而言，我们将多模态数据分为具有明确标准答案的样本和无明确标准答案的样本。

对于具有明确标准答案的样本： 模型被提示首先提供推理过程，然后以 Final Answer: *** 类似的格式给出最终答案。与标准答案匹配的响应构成正集 $\mathcal{Y}_p$，不匹配的响应构成负集 $\mathcal{Y}_n$。此外，未能提供明确最终答案的响应也被合并到 $\mathcal{Y}_n$ 中。给定这些标记为正或负的响应，我们通过从 $\mathcal{Y}_p$ 中选择一个优选响应 $y_c$ 和从 $\mathcal{Y}_n$ 中选择一个负响应 $y_r$ 来构建偏好对。
对于无明确标准答案的样本： 我们提出了一种简单而有效的方法：Dropout 下一个 Token 预测（Dropout NTP）。具体来说，我们使用 InternVL2-8B 生成的响应作为优选答案。给定优选答案，我们将其截断一半，然后提示 InternVL2-8B 在不访问图像输入的情况下完成截断答案的剩余部分。这种生成的补全内容作为配对样本的拒绝答案。值得注意的是，虽然 InternVL2-8B 生成的响应可能并不完美，但在没有图像输入的情况下生成的补全内容会比有图像输入时产生更多的幻觉。因此，优选响应和拒绝响应之间的偏序关系是成立的。

数据构建流程已开源，更多细节请参见我们的文档。

混合偏好优化

\mathcal{L}=w_{p}\cdot\mathcal{L}_{\text{p}} + w_{q}\cdot\mathcal{L}_{\text{q}} + w_{g}\cdot\mathcal{L}_{\text{g}},

具体而言，DPO 作为偏好损失，用于使模型学习所选响应与被拒响应之间的相对偏好。该算法优化以下损失函数：

\mathcal{L}_{\text{p}}=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)}\right),

其中 $\beta$ 是 KL 惩罚系数，$x$、$y_c$ 和 $y_r$ 分别是用户查询、所选响应和被拒响应。策略模型 $\pi_\theta$ 由模型 $\pi_0$ 初始化。

此外，BCO 损失被用作质量损失，帮助模型理解单个响应的绝对质量。该损失函数定义为：

\mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,

\mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),

\mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),

其中 $\delta$ 表示奖励偏移，通过计算先前奖励的移动平均值来稳定训练。

最后，SFT 损失被用作生成损失，帮助模型学习首选响应的生成过程。该损失函数定义为：

\mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.

多模态能力评估

模型	平均得分	MMBench v1.1	MMStar	MMMU	MathVista	HallusionBench	AI2D	OCRBench	MMVet
InternVL2-5-1B	54.9	66.5	51.3	41.2	47.1	39.4	69.0	77.4	47.2
InternVL2-5-1B-MPO	56.4	67.2	49.7	40.8	53.0	40.0	69.4	83.6	47.2
InternVL2-5-2B	59.9	70.9	54.3	43.2	51.1	42.3	74.9	80.2	62.6
InternVL2-5-2B-MPO	62.0	71.6	55.0	45.0	56.4	43.0	75.3	84.2	65.4
InternVL2-5-4B	65.1	78.2	58.7	51.8	60.8	46.6	81.4	82.0	61.5
InternVL2-5-4B-MPO	67.6	78.6	60.2	51.6	65.3	47.8	82.0	88.0	67.1
InternVL2-5-8B	68.9	82.5	63.2	56.2	64.5	49.0	84.6	82.1	62.8
InternVL2-5-8B-MPO	70.4	82.4	65.7	54.9	68.9	51.4	84.5	88.3	66.9
InternVL2-5-26B	71.6	84.6	66.5	60.7	68.0	55.8	86.2	85.4	65.4
InternVL2-5-26B-MPO	72.7	84.2	67.2	57.7	72.8	55.3	86.2	91.2	67.1
InternVL2-5-38B	73.5	85.4	68.5	64.6	72.4	57.9	87.6	84.1	67.2
InternVL2-5-38B-MPO	75.5	85.6	69.8	64.1	73.8	61.5	88.1	88.5	72.5
InternVL2-5-78B	75.2	87.5	69.5	70.0	70.6	57.4	89.1	85.3	71.8
InternVL2-5-78B-MPO	76.6	87.3	73.1	68.3	73.8	58.7	89.3	91.2	71.4

快速开始

我们提供了一个使用 transformers 运行 InternVL2_5-1B-MPO 的示例代码。

请使用 transformers>=4.37.2 以确保模型正常工作。

模型加载

16位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2_5-1B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2_5-1B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多GPU

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
        'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL2_5-1B-MPO"
device_map = split_model('InternVL2_5-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 进行推理

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2_5-1B-MPO'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式输出

除了此方法外，您还可以使用以下代码获取流式输出。

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

微调

目前已有多个代码库支持 InternVL 系列模型的微调，包括 InternVL、SWIFT、XTurner 等。有关微调的更多详细信息，请参考它们的文档。

部署

LMDeploy

LMDeploy 是一个用于压缩、部署和服务大语言模型（LLMs）及视觉语言模型（VLMs）的工具包。

pip install lmdeploy>=0.6.4

LMDeploy 将多模态视觉语言模型（VLM）复杂的推理过程抽象为易于使用的流水线，类似于大语言模型（LLM）的推理流水线。

“Hello, world”示例

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)

如果在执行此案例时出现 ImportError，请根据提示安装所需的依赖包。

多图像推理

处理多张图像时，可将所有图像放入同一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增大上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示词推理

使用批量提示词进行推理非常简单，只需将它们放入列表结构中即可：

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

使用该pipeline进行多轮对话有两种方式。一种是按照OpenAI的格式构建消息，并使用上述介绍的方法；另一种是使用pipeline.chat接口。

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-1B-MPO'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务

LMDeploy 的 api_server 能够通过单条命令轻松将模型打包为服务。其提供的 RESTful API 兼容 OpenAI 的接口。以下是服务启动示例：

lmdeploy serve api_server OpenGVLab/InternVL2_5-1B-MPO --server-port 23333

要使用 OpenAI 风格的接口，您需要安装 OpenAI：

pip install openai

然后，使用以下代码进行 API 调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

许可协议

本项目基于 MIT 许可协议发布。本项目使用预训练模型 Qwen2.5-0.5B-Instruct 作为组件，该模型基于 Apache License 2.0 许可协议。

引用

如果您发现本项目对您的研究有所帮助，请考虑引用：

@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}