我们推出了 MOSS-Video-Preview-Base,它是 MOSS-Video-Preview 系列中的预训练基础模型 checkpoint。
[!Important] 这是一个预训练模型 checkpoint,未经过有监督指令微调(无离线 SFT/无实时 SFT)。
本仓库包含预训练权重,旨在作为下游任务的起点:
MOSS-Video-Preview-Base 作为该系列的基础 checkpoint,采用了首创的图像-视频统一交叉注意力架构:
VideoMllamaTextCrossAttention 机制,实现时间视觉特征与语言上下文之间的高效语义对齐。有关架构图和完整系统详情,请参见顶层仓库:fnlp-vision/MOSS-Video-Preview。
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
video_path = "data/example_video.mp4"
prompt = "" # For base model, prompt is set to empty to perform completion task.
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-base"
image_path = "data/example_image.jpg"
prompt = "" # For base model, prompt is set to empty to perform completion task.
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))trust_remote_code=True(由于 auto_map 自定义代码)attn_implementation="flash_attention_2")cv2)有关完整环境设置(包括可选的 FlashAttention2 附加组件),请参见顶层仓库的 README.md。
config.json 中的 auto_map 进行引用。[!IMPORTANT]
🌟 我们的使命与社区邀请
我们填补了基于交叉注意力的视频理解基础模型的空白。
我们热烈欢迎表示学习和模型效率领域的专家在我们的架构基础上进行探索、实验和创新。让我们共同突破视频智能的边界,推动开源社区的发展!
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}