OpenMOSS/moss-video-preview-realtime-sft

MOSS-Video-Preview-Real-Time-SFT

简介

我们推出了MOSS-Video-Preview-Real-Time-SFT，这是一款基于MOSS-Video-Preview-SFT衍生而来的专用模型，通过额外的实时监督微调（Real-Time SFT） 优化而成。

[!Important] 这是一个实时监督微调（Real-Time SFT） checkpoint。它针对低延迟、高频率的实时视频理解进行了优化。

此checkpoint适用于：

具备真正“边看边说”能力的实时视频理解。
低延迟交互应用，其中首 token 生成时间（Time to First Token, TTFT）至关重要。
持续视频监控与即时动作反馈。

模型架构

Model Architecture

**MOSS-Video-Preview-Real-Time-SFT** 是该系列的旗舰模型，其特色在于专为流式处理优化的**开创性图像-视频统一交叉注意力架构**：

原生统一设计：与传统模型不同，我们的架构支持原生逐帧视频注入，确保视觉上下文始终与生成过程保持同步更新。
双工交互：专门针对“静默-说话”切换进行调优。随着视频场景的变化，模型能够被实时打断并自我修正其响应。
统一时空编码：采用优化的门控位置嵌入和交叉注意力KV缓存，使模型能够在扩展流上维持稳健的时间上下文。

有关架构图和完整系统详情，请参见顶级仓库：fnlp-vision/MOSS-Video-Preview。

🌊 流式推理机制

该模型的核心优势在于其异步流式处理能力，可实现真正的“边看边说”视频智能。

Streaming Inference Mechanism

异步单帧流式传输：视频帧以稳定频率注入。输入流水线为非阻塞式，与文本生成解耦，确保持续感知。
持久状态维护：利用交叉注意力KV缓存和时间位置编码，模型在连续帧之间维持长程上下文依赖。
即时流式响应：构建于优化的 MllamaVideoModel 之上，它与视觉流并行执行自回归生成，实现超低首 token 生成时间（TTFT）。

🌟 核心亮点

🧩 解耦交叉注意力机制：一种新颖的方法，将视觉感知与语言生成解耦，实现无缝的实时视频理解。
🔄 毫秒级交互：支持实时中断，并能随着环境变化动态调整响应。
⚡ 硬件优化性能：全面支持Flash Attention 2，兼容CUDA/NPU平台，针对长上下文视频流处理进行了优化。

🚀 快速开始

视频流推理（推荐用于实时SFT）

此模式使用 real_time_generate() API 进行低延迟流式处理。

import os, queue, threading, time, cv2
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

def feed(video, q, fps=1.0):
    cap=cv2.VideoCapture(video); step=max(1, round((cap.get(cv2.CAP_PROP_FPS) or 25)/fps)); i=0
    while cap.isOpened():
        ok, f = cap.read()
        if not ok: break
        if i % step == 0: q.put(Image.fromarray(cv2.cvtColor(f, cv2.COLOR_BGR2RGB))); time.sleep(1/fps)
        i += 1
    cap.release()

checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

processor=AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
model=AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, device_map="auto")

image_queue, prompt_queue, token_queue = queue.Queue(), queue.Queue(), queue.Queue()
threading.Thread(target=feed, args=(video_path, image_queue), daemon=True).start()
time.sleep(1)
prompt_queue.put(prompt)
threading.Thread(
    target=lambda: model.real_time_generate(image_queue, prompt_queue, token_queue, processor),
    daemon=True,
).start()

END={"[DONE]","[ERROR]","<|round_end|>"}; BANNER="\n"+"-"*30+" [Silence / Observing] "+"-"*30
pending=None; silent=False; last=time.time(); got=False
while True:
    try: tok = token_queue.get(timeout=0.1)
    except queue.Empty:
        if pending: print(pending, end="", flush=True); pending=None
        if got and time.time()-last>5: break
        continue
    got,last=True,time.time()
    if tok=="<|round_start|>": pending=None; continue
    if tok in END:
        if pending: print(pending, end="", flush=True)
        break
    if tok=="<|silence|>":
        if not silent:
            if pending: print(pending, end="", flush=True); pending=None
            print(BANNER, flush=True); silent=True
        continue
    silent=False
    if pending: print(pending, end="", flush=True)
    pending=tok

if hasattr(model,"stop_real_time_generate"): model.stop_real_time_generate()

视频离线推理

import os
import queue
import threading

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0

video_fps = 1.0
video_minlen = 8
video_maxlen = 256


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint, trust_remote_code=True, frame_extract_num_threads=1
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


if not checkpoint:
    raise ValueError("Missing `checkpoint`.")
if not video_path:
    raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
    raise FileNotFoundError(f"Video not found: {video_path}")

model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()

new_queries.put(
    {
        "prompt": f"\n{prompt}",
        "images": [],
        "videos": [video_path],
        "media_kwargs": {
            "video_fps": video_fps,
            "video_minlen": video_minlen,
            "video_maxlen": video_maxlen,
        },
        "thinking_mode": "no_thinking",
        "system_prompt_type": "video",
        "generate_kwargs": {
            "temperature": temperature,
            "top_k": top_k,
            "top_p": top_p,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
        },
        "stop_offline_generate": False,
    }
)
new_queries.put({"stop_offline_generate": True})


def drain_output():
    while True:
        tok = output_text_queue.get()
        if tok == "<|round_end|>":
            break
        print(tok, end="", flush=True)


t = threading.Thread(target=drain_output, daemon=True)
t.start()
with torch.no_grad():
    model.offline_generate(processor, new_queries, output_text_queue, vision_chunked_length=64)
t.join(timeout=5.0)

图像离线推理



import os, queue, threading, torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "fnlp-vision/moss-video-preview-realtime-sft"
image_path = "data/example_image.jpg"
prompt = "Describe this image."
if not os.path.isfile(image_path):
    raise FileNotFoundError(image_path)

processor = AutoProcessor.from_pretrained(
    checkpoint, trust_remote_code=True, frame_extract_num_threads=1
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2"
)

new_q, out_q = queue.Queue(), queue.Queue()
new_q.put(
    {
        "prompt": f"\n{prompt}",
        "images": [Image.open(image_path).convert("RGB")],
        "videos": [],
        "system_prompt_type": "text_image",
        "thinking_mode": "no_thinking",
        "generate_kwargs": {"temperature": 1.0, "top_k": 50, "top_p": 1.0, "max_new_tokens": 256, "repetition_penalty": 1.0},
        "stop_offline_generate": False,
    }
)
new_q.put({"stop_offline_generate": True})

threading.Thread(
    target=lambda: (lambda: [print(t, end="", flush=True) for t in iter(out_q.get, "<|round_end|>")])(),
    daemon=True,
).start()

with torch.no_grad():
    model.offline_generate(processor, new_q, out_q, vision_chunked_length=64)

✅ 预期用途

实时“看图说话”：为直播视频流提供即时描述和问答功能。
低延迟监控：以最小延迟实时检测事件或动作。
交互式多模态智能体：构建能够“看见”并进行交互的响应式AI助手。

⚠️ 局限性与未来展望

建议使用高端硬件：为获得最佳实时体验（最低延迟），强烈推荐配备FlashAttention 2的现代GPU（如A100/H100/H200）。
性能基准测试：尽管在实时交互方面处于领先地位，但与Qwen2.5-VL等模型相比，在通用基准测试中仍存在性能差距。持续优化是我们的主要工作重点。
可扩展分布式训练：我们正将训练管道迁移至Megatron-LM框架，利用3D并行技术支持未来版本更大规模的预训练和微调。
开源承诺：完整的训练代码库和实验配置将在下次重大更新中发布。

🧩 环境要求

Python：3.10及以上版本
PyTorch：1.13.1及以上版本（强烈建议使用GPU）
已测试配置：Python 3.12.4 + PyTorch 2.4.0（CUDA 12.1）+ DeepSpeed 0.16.1
仅CPU环境：PyTorch 2.4.0
Transformers：需使用 trust_remote_code=True 参数
FlashAttention 2：强烈推荐用于低延迟推理。
OpenCV：流演示中视频帧提取所必需。

[!IMPORTANT]

🌟 我们的使命与社区邀请

我们填补了基于交叉注意力的视频理解基础模型的空白。

我们热烈欢迎表示学习和模型效率领域的专家基于我们的架构进行探索、实验和创新。让我们共同突破视频智能的边界，推动开源社区的发展！

Citation

@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
  note          = {GitHub repository}
}