MOSS-VL-Instruct-0408 是 MOSS-VL 系列的指令微调 checkpoint,隶属于 OpenMOSS 生态系统,致力于推动视觉理解技术的发展。
该 checkpoint 在 MOSS-VL-Base-0408 的基础上通过监督微调(SFT)构建而成,旨在打造高性能的离线多模态引擎。它在全谱系的视觉语言任务中均展现出强劲且全面的性能,包括图像理解、OCR、文档解析、视觉推理和指令遵循,尤其在视频理解方面表现卓越,涵盖长视频内容理解、细粒度时间推理及动作识别等。
MOSS-VL-Instruct-0408 采用基于交叉注意力的架构,将视觉编码与认知推理解耦。这种设计将延迟降至毫秒级,能够对动态视频流做出即时响应。原生支持模态交错,可在统一 pipeline 中处理复杂的图像与视频序列,无需繁重的预处理步骤。
为确保模型准确感知事件的节奏和持续时间,MOSS-VL-Instruct-0408 在每个采样帧旁注入绝对时间戳,使推理过程建立在精确的时间参考之上。
MOSS-VL 采用了交叉注意力旋转位置编码(XRoPE),该编码专为其基于交叉注意力的视觉-语言架构量身定制。此机制将文本令牌和视频补丁映射到一个由时间(t)、高度(h)和宽度(w)定义的统一 3D 坐标空间中。
我们从四个关键维度对 MOSS-VL-Instruct-0408 进行了全面评估:多模态感知、多模态推理、文档/OCR 以及视频理解。结果表明,MOSS-VL 性能卓越,尤其在 通用多模态感知 和 复杂视频分析 方面表现突出。
VideoMME、MLVU、EgoSchema 和 VSI-bench 等基准测试中展现出卓越的时间一致性和动作识别能力(在 VSI-bench 上,它比 Qwen3-VL-8B-Instruct 高出 8.3 分)。BLINK 和 MMBench 等基准测试的细粒度目标识别和空间推理任务中表现突出。
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txtimport torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_image_generate(
processor,
prompt=prompt,
image=image_path,
shortest_edge=4096,
longest_edge=16777216,
multi_image_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe this video."
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_video_generate(
processor,
prompt=prompt,
video=video_path,
shortest_edge=4096,
longest_edge=16777216,
video_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
video_fps=1.0,
min_frames=1,
max_frames=256,
num_extract_threads=4,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
)
print(text)import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
queries = [
{
"prompt": "Describe sample A.",
"images": [],
"videos": ["data/sample_a.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
{
"prompt": "Describe sample B.",
"images": [],
"videos": ["data/sample_b.mp4"],
"media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
"generate_kwargs": {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
},
},
]
with torch.no_grad():
result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)
texts = [item["text"] for item in result["results"]]MOSS-VL-Instruct-0408 是 MOSS-VL 路线图中的早期里程碑,我们正积极从以下几个方向推进模型优化:
[!NOTE] 我们欢迎社区就上述任何方向提供反馈和贡献。
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
note = {GitHub repository}
}