我们推出了MOSS-Video-Preview-SFT,这是MOSS-Video-Preview系列中的离线监督微调模型 checkpoint。
[!Important] 这是一个离线SFT checkpoint(指令微调模型)。它并非实时SFT流式处理 checkpoint。
该 checkpoint 主要用于:
MOSS-Video-Preview 基于Llama-3.2-Vision骨干网络构建,采用开创性的图像-视频统一交叉注意力架构:
有关架构图和完整系统详情,请参见顶级仓库:OpenMOSS/MOSS-Video-Preview。
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
# Use Hugging Face model id (or load from a local folder with the same name).
checkpoint = "OpenMOSS-Team/moss-video-preview-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "OpenMOSS-Team/moss-video-preview-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))trust_remote_code=True(由于auto_map自定义代码)attn_implementation="flash_attention_2")cv2);离线演示依赖处理器的视频加载后端有关完整的环境设置(包括可选的FlashAttention2扩展),请参见顶层仓库的README.md。
[!IMPORTANT]
🌟 我们的使命与社区邀请
我们填补了基于交叉注意力的视频理解基础模型的空白。
我们热烈欢迎表示学习和模型效率领域的专家探索、实验并在我们的架构基础上进行创新。让我们共同突破视频智能的边界,推动开源社区的发展!
@article{wang2026mossvideo,
title = {{MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention}},
author = {Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu},
year = {2026},
journal = {arXiv preprint arXiv:2606.07639},
eprint = {2606.07639},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.07639}
}