Ovis2.6-80B-A3B

简介

我们推出Ovis2.6-80B-A3B，这是Ovis系列多模态大语言模型（MLLMs）的最新成果。在Ovis2.5的坚实基础上，Ovis2.6将LLM骨干升级为混合专家（Mixture-of-Experts, MoE） 架构，以更低的服务成本实现了更卓越的多模态性能。它在长上下文理解、高分辨率图像识别、主动图像分析视觉推理以及信息密集型文档理解方面均带来重大提升。

核心特性

MoE架构：卓越性能与低服务成本兼备
LLM骨干已升级为混合专家（Mixture-of-Experts, MoE） 架构。这使得Ovis2.6能够扩展至总计800亿参数，从而捕捉海量知识与细微差别。关键在于，其推理过程中仅激活约30亿参数，确保了低服务成本和高吞吐量。
增强的长序列与高分辨率处理能力
Ovis2.6将上下文窗口扩展至64K tokens，并支持高达2880×2880的图像分辨率，显著提升了处理高分辨率及信息密集型视觉输入的能力。这些增强在长文档问答任务中尤为有效，模型需从多页文档中收集并综合线索以得出正确答案。
以图为思（Think with Image）
我们引入**“以图为思”** 能力，将视觉从被动输入转变为主动认知工作空间。在推理过程中，模型可主动调用视觉工具（如裁剪、旋转），在思维链（Chain-of-Thought）中重新审视和分析图像区域，实现对视觉输入的多轮自反思推理，从而提高复杂任务的准确率。
强化的OCR、文档及图表处理能力
延续对信息密集型视觉任务的关注，我们进一步强化了模型在光学字符识别（OCR）、文档理解及图表分析方面的能力。Ovis2.6不仅能从视觉数据中准确提取结构化信息，还能对提取内容进行深入推理。

性能表现

下表展示了详细的性能对比情况。请注意，标有上标的结果来源于外部技术报告，Qwen的得分取其Think版本与Instruct版本中的最高值。为方便快速参考，最佳结果已用红色突出显示，次佳结果则用下划线标注。所有数值均四舍五入至一位小数。

bench

快速推理（transformers）

以下是一个简单示例，演示如何使用单张图像输入运行Ovis2.6。

首先，安装所需的依赖项：

pip install torch==2.7.1 transformers==4.57.0 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3 accelerate==1.12.0
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3

然后，运行以下代码。

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM

# Thinking mode & budget
enable_thinking = True
enable_thinking_budget = True  # Only effective if enable_thinking is True.

# Total tokens for thinking + answer. Ensure: max_new_tokens > thinking_budget + 25
max_new_tokens = 2048
thinking_budget = 1024

model = AutoModelForCausalLM.from_pretrained(
    "AIDC-AI/Ovis2.6-80B-A3B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
        {"type": "text", "text": "Calculate the sum of the numbers in the middle box in figure (c)."},
    ],
}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(
    messages=messages,
    add_generation_prompt=True,
    enable_thinking=enable_thinking
)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda() if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
)

response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

思考及思考预算逻辑可同样应用于多图像、视频和纯文本场景。

注意（CoT/思考的答案提取）： 为便于评估和使用，我们建议在使用思维链（CoT）或思考模式时，在提示词末尾添加固定后缀。这样可以确保模型清晰地输出可通过程序提取的最终答案：

End your response with 'Final answer: '.

例如：

Calculate the sum of the numbers in the middle box in figure (c).
End your response with 'Final answer: '.

提示： 以下部分包含一个可选的流式辅助工具（兼容两阶段思考/预算运行）以及额外的推理模式：多图像、视频和纯文本。

可选：流式处理（高级）

为了支持思考预算，我们修改了 Ovis generate 方法的实现，默认的 TextIteratorStreamer 现在已不兼容。如果您需要流式传输模型输出，请确保使用下面的辅助类。

# --- Budget-aware streamer helper ---
from transformers import TextIteratorStreamer

class BudgetAwareTextStreamer(TextIteratorStreamer):
    """A streamer compatible with Ovis two-phase generation.

    Call .manual_end() after generation to flush any remaining text.
    """
    def manual_end(self):
        if len(self.token_cache) > 0:
            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
            printable_text = text[self.print_len:]
            self.token_cache = []
            self.print_len = 0
        else:
            printable_text = ""
        self.next_tokens_are_prompt = True
        self.on_finalized_text(printable_text, stream_end=True)

    # Disable base class's end hook; we'll finalize via manual_end()
    def end(self):
        pass

示例用法：

streamer = BudgetAwareTextStreamer(
    model.text_tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = model.generate(
    inputs=input_ids,
    pixel_values=pixel_values,
    grid_thws=grid_thws,
    enable_thinking=enable_thinking,
    enable_thinking_budget=enable_thinking_budget,
    max_new_tokens=max_new_tokens,
    thinking_budget=thinking_budget,
    streamer=streamer
)

示例：多图像

演示如何使用多张图像和相关问题进行推理。

# Multi-image inference
multi_image_files = [
    "/path/to/image_1.jpg",
    "/path/to/image_2.jpg",
    "/path/to/image_3.jpg",
]

content = [{"type": "image", "image": Image.open(p).convert("RGB")} for p in multi_image_files]
content.append({"type": "text", "text": "Describe the images."})
messages = [{"role": "user", "content": content}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(messages=messages, add_generation_prompt=True, max_pixels=896*896)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda().to(model.dtype) if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, pixel_values=pixel_values, grid_thws=grid_thws,
                             max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))

示例：视频

演示如何通过采样多个帧并要求模型描述内容来对视频进行推理。

# Video inference
from moviepy.editor import VideoFileClip  # pip install moviepy==1.0.3

video_file = "/path/to/video_1.mp4"
num_frames = 8

with VideoFileClip(video_file) as clip:
    total_frames = int(clip.fps * clip.duration)
    indices = [int(i * total_frames / num_frames) for i in range(num_frames)]
    frames = [Image.fromarray(clip.get_frame(t)) for t in (idx / clip.fps for idx in indices)]

messages = [{"role": "user", "content": [
    {"type": "video", "video": frames},
    {"type": "text", "text": "Describe this video in detail."},
]}]

input_ids, pixel_values, grid_thws = model.preprocess_inputs(messages=messages, add_generation_prompt=True, max_pixels=896*896)
input_ids = input_ids.cuda()
pixel_values = pixel_values.cuda().to(model.dtype) if pixel_values is not None else None
grid_thws = grid_thws.cuda() if grid_thws is not None else None

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, pixel_values=pixel_values, grid_thws=grid_thws,
                             max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))

示例：纯文本

演示如何仅使用文本输入（无需任何图像或视频）运行推理。

# Text-only inference
messages = [{"role": "user", "content": "Hi, please introduce Yellow Mountain."}]

input_ids, _, _ = model.preprocess_inputs(messages=messages, add_generation_prompt=True)
input_ids = input_ids.cuda()

with torch.no_grad():
    outputs = model.generate(inputs=input_ids, max_new_tokens=1024, do_sample=True,
                             eos_token_id=model.text_tokenizer.eos_token_id,
                             pad_token_id=model.text_tokenizer.pad_token_id)
print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))

若要启用锚定功能，请在提示词末尾添加 Please provide the bounding box coordinates.（用于框选）或 Please provide the point coordinates.（用于点选）。如需定位特定对象，请将其描述用 <ref> 标签包裹，例如：

Find the <ref>red apple</ref> in the image. Please provide the bounding box coordinates.

坐标归一化到 [0,1) 范围，原点 (0,0) 位于图像的左上角。

点：<point>(x,y)</point>
边界框：<box>(x1,y1),(x2,y2)</box>，其中 (x1,y1) 为左上角，(x2,y2) 为右下角。
多个结果可放在方括号中列出：[<box>(...)</box>,<box>(...)</box> ]

示例：

The image features a serene scene with <ref>three birds</ref>[
  <box>(0.401,0.526),(0.430,0.557)</box>,
  <box>(0.489,0.494),(0.516,0.526)</box>,
  <box>(0.296,0.529),(0.324,0.576)</box>
] flying in formation against a clear blue sky.

引用说明

如果您发现 Ovis 有用，请考虑引用相关论文。

@article{lu2025ovis25technicalreport,
  title={Ovis2.5 Technical Report}, 
  author={Shiyin Lu and Yang Li and Yu Xia and Yuwei Hu and Shanshan Zhao and Yanqing Ma and Zhichao Wei and Yinglun Li and Lunhao Duan and Jianshan Zhao and Yuxuan Han and Haijun Li and Wanying Chen and Junke Tang and Chengkun Hou and Zhixing Du and Tianli Zhou and Wenjie Zhang and Huping Ding and Jiahe Li and Wen Li and Gui Hu and Yiliang Gu and Siran Yang and Jiamang Wang and Hailong Sun and Yibo Wang and Hui Sun and Jinlong Huang and Yuping He and Shengze Shi and Weihong Zhang and Guodong Zheng and Junpeng Jiang and Sensen Gao and Yi-Feng Wu and Sijia Chen and Yuhui Chen and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang},
  year={2025},
  journal={arXiv:2508.11737}
}

@article{lu2024ovis,
  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
  year={2024},
  journal={arXiv:2405.20797}
}

许可协议

本项目基于 Apache License, Version 2.0 许可协议授权（SPDX 许可证标识符：Apache-2.0）。

免责声明

在训练过程中，我们使用了合规性检查算法，尽最大努力确保训练后模型的合规性。由于数据的复杂性以及语言模型使用场景的多样性，我们无法保证模型完全不存在版权问题或不当内容。如果您认为任何内容侵犯了您的权利或生成了不当内容，请与我们联系，我们将及时处理相关事宜。