我们推出MOSS-Video-Preview-Base,这是MOSS-Video-Preview系列中的预训练基础模型 checkpoint。
[!Important] 这是一个预训练模型 checkpoint,未经过有监督指令微调(无离线SFT/无实时SFT)。
本仓库包含预训练权重,旨在作为下游任务的起点:
MOSS-Video-Preview-Base是该系列的基础 checkpoint,采用首创的图像-视频统一交叉注意力架构:
VideoMllamaTextCrossAttention机制驱动,实现时间视觉特征与语言上下文的高效语义对齐。有关架构图和完整系统详情,请参见顶级仓库:OpenMOSS/MOSS-Video-Preview。
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "OpenMOSS-Team/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "OpenMOSS-Team/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))trust_remote_code=True(由于 auto_map 自定义代码)attn_implementation="flash_attention_2")cv2)有关完整的环境设置(包括可选的 FlashAttention2 额外组件),请参见顶层仓库的 README.md。
config.json中的auto_map进行引用。[!IMPORTANT]
🌟 我们的使命与社区邀请
我们填补了基于交叉注意力的视频理解基础模型的空白。
我们热烈欢迎表示学习和模型效率领域的专家在我们的架构基础上进行探索、实验和创新。让我们共同突破视频智能的边界,推动开源社区的发展!
@article{wang2026mossvideo,
title = {{MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention}},
author = {Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu},
year = {2026},
journal = {arXiv preprint arXiv:2606.07639},
eprint = {2606.07639},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.07639}
}