MOSS-Video-Preview-SFT

📄 技术报告 | 💻 GitHub

简介

我们推出了MOSS-Video-Preview-SFT，这是MOSS-Video-Preview系列中的离线监督微调模型 checkpoint。

[!Important] 这是一个离线SFT checkpoint（指令微调模型）。它并非实时SFT流式处理 checkpoint。

该 checkpoint 主要用于：

具备增强指令遵循能力的离线视频/图像理解
作为进一步实时SFT或领域适配的强大起点

模型架构

MOSS-Video-Preview 基于Llama-3.2-Vision骨干网络构建，采用开创性的图像-视频统一交叉注意力架构：

原生统一设计：不同于传统的投影方法，我们的架构原生、统一地支持图像和视频理解，确保流畅的时间一致性。
深度多模态融合：利用专门的交叉注意力机制，实现视觉时间特征与语言上下文的高保真对齐。
统一时空编码：对齐视频帧序列和文本标记，以实现稳健的长上下文多模态推理。

Model Architecture

有关架构图和完整系统详情，请参见顶级仓库：OpenMOSS/MOSS-Video-Preview。

🌟 核心亮点

🧩 原生交叉注意力基础：一种新颖的方法，将视觉感知与语言生成解耦，实现流畅的实时视频理解。
🔄 动态交互支持：尽管此SFT版本用于离线场景，但其底层架构设计支持"静默-对话"切换和实时中断。
⚡ 高效推理：针对CUDA和NPU上的Flash Attention 2进行优化，确保即使对于长视频流也能实现低延迟处理。

🚀 快速入门

视频推理（Python）

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Use Hugging Face model id (or load from a local folder with the same name).
checkpoint = "OpenMOSS-Team/moss-video-preview-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    videos=[video_path],
    video_fps=1.0,
    video_minlen=8,
    video_maxlen=16,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

图像推理（Python）

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "OpenMOSS-Team/moss-video-preview-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."

image = Image.open(image_path).convert("RGB")

processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ],
    }
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=input_text,
    images=[image],
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(processor.decode(output_ids[0], skip_special_tokens=True))

✅ 预期用途

离线指令遵循：适用于视频/图像理解（推荐大多数用户使用的默认检查点）。
微调起点：如果您计划训练自己的实时SFT或特定领域变体，可将此作为起点。

⚠️ 局限性与未来展望

仅支持离线SFT：此特定检查点针对离线指令遵循进行了优化。如需实时流处理和动态中断功能，请参考我们的实时SFT变体。
性能基准测试：尽管在实时架构方面处于领先地位，但与Qwen2.5-VL等顶级模型相比仍存在性能差距。缩小这一差距是我们未来迭代的主要重点。
分布式训练与扩展：当前版本是架构验证。未来版本将集成Megatron-LM框架，利用3D并行技术进行大规模预训练。
数据多样性：正在进行的工作重点是扩大训练数据集的规模和多样性，以提高在更复杂场景中的泛化能力。

🧩 环境要求

Python：3.10+
PyTorch：1.13.1+（强烈建议使用GPU）
测试配置：Python 3.12.4 + PyTorch 2.4.0（CUDA 12.1）+ DeepSpeed 0.16.1
仅CPU：PyTorch 2.4.0
Transformers：使用此模型系列时需设置trust_remote_code=True（由于auto_map自定义代码）
可选（推荐）：FlashAttention 2（attn_implementation="flash_attention_2"）
视频解码：流式演示需导入OpenCV（cv2）；离线演示依赖处理器的视频加载后端

有关完整的环境设置（包括可选的FlashAttention2扩展），请参见顶层仓库的README.md。

[!IMPORTANT]

🌟 我们的使命与社区邀请

我们填补了基于交叉注意力的视频理解基础模型的空白。

我们热烈欢迎表示学习和模型效率领域的专家探索、实验并在我们的架构基础上进行创新。让我们共同突破视频智能的边界，推动开源社区的发展！

引用格式

@article{wang2026mossvideo,
  title         = {{MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention}},
  author        = {Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu},
  year          = {2026},
  journal       = {arXiv preprint arXiv:2606.07639},
  eprint        = {2606.07639},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.07639}
}