我们推出了MOSS-Video-Preview-SFT,这是MOSS-Video-Preview系列中的离线监督微调 checkpoint。
[!Important] 这是一个离线SFT checkpoint(指令微调)。它不是实时SFT流式 checkpoint。
此checkpoint适用于:
MOSS-Video-Preview 基于Llama-3.2-Vision骨干网络构建,采用开创性的图像-视频统一交叉注意力架构:
有关架构图和完整系统详情,请参见顶级仓库:fnlp-vision/MOSS-Video-Preview。
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
# Use Hugging Face model id (or load from a local folder with the same name).
checkpoint = "fnlp-vision/moss-video-preview-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
videos=[video_path],
video_fps=1.0,
video_minlen=8,
video_maxlen=16,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
image = Image.open(image_path).convert("RGB")
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=input_text,
images=[image],
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(output_ids[0], skip_special_tokens=True))trust_remote_code=True(由于auto_map自定义代码)attn_implementation="flash_attention_2")cv2);离线演示依赖处理器的视频加载后端有关完整的环境设置(包括可选的FlashAttention2扩展),请参见顶层仓库的README.md。
[!IMPORTANT]
🌟 我们的使命与社区邀请
我们填补了基于交叉注意力的视频理解基础模型的空白。
我们热烈欢迎表示学习和模型效率领域的专家探索、实验并在我们的架构基础上进行创新。让我们共同突破视频智能的边界,推动开源社区的发展!
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
note = {GitHub repository}
}