TimeLens-8B

✨ 模型描述

TimeLens-8B 是一款多模态大语言模型（MLLM），在开源模型中实现了最先进的视频时序定位性能。该模型基于 Qwen3-VL-8B-Instruct 进行微调，采用了我们论文中提出的精心设计的 RLVR（带可验证奖励的强化学习）训练方案，并利用了高质量的视频时序定位（VTG）训练数据集 TimeLens-100K。

📊 性能表现

TimeLens-8B 在开源模型中实现了最先进的视频时序定位性能：

模型	Charades-TimeLens				ActivityNet-TimeLens				QVHighlights-TimeLens
模型	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU
Qwen2.5-VL-7B-Instruct	59.7	37.8	16.6	39.3	44.1	31.0	16.1	31.4	41.5	27.8	15.2	31.6
TimeLens-7B🚀	70.5	55.6	28.4	48.8	62.8	51.0	32.6	46.2	74.1	62.7	43.1	56.0
Qwen3-VL-8B-Instruct	69.2	53.4	27.5	48.3	62.1	51.2	34.4	46.8	74.2	64.6	49.3	59.4
TimeLens-8B🚀	76.6	63.0	35.2	55.2	68.9	58.4	40.6	53.2	80.2	71.6	55.5	65.5

有关与其他模型的详细对比，请参考 🏆 排行榜。

🚀 使用方法

安装以下软件包：

pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

使用🤗Transformers进行推理：

import requests
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info


def download_video(url):
    save_path = os.path.basename(url)
    if not os.path.exists(save_path):
        print(f"Downloading video from {url}...")
        response = requests.get(url, stream=True)
        response.raise_for_status()
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return save_path

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-8B",
    padding_side="left",
    do_resize=False,
)

# Prepare input
query = "A man drinks water with a glass"
video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")

GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 32 * 32,
            'total_pixels': 14336 * 32 * 32,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
  messages,
  image_patch_size=16,
  return_video_kwargs=True,
  return_video_metadata=True,
)

videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)

inputs = processor(
  text=[text],
  images=images,
  videos=videos,
  video_metadata=video_metadatas,
  padding=True,
  return_tensors='pt',
  **video_kwargs,
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")

引用说明

如果您发现我们的研究成果对您的研究和应用有所帮助，请引用我们的论文：

@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}