HuggingFace镜像/TimeLens-8B
模型介绍文件和版本分析
下载使用量0

TimeLens-8B

📑 论文 | 💻 代码 | 🏠 项目主页 | 🤗 模型与数据

✨ 模型描述

TimeLens-8B 是一款多模态大语言模型(MLLM),在开源模型中实现了最先进的视频时序定位性能。该模型基于 Qwen3-VL-8B-Instruct 进行微调,采用了我们论文中提出的精心设计的 RLVR(带可验证奖励的强化学习)训练方案,并利用了高质量的视频时序定位(VTG)训练数据集 TimeLens-100K。

📊 性能表现

TimeLens-8B 在开源模型中实现了最先进的视频时序定位性能:

模型Charades-TimeLensActivityNet-TimeLensQVHighlights-TimeLens
R1
@0.3
R1
@0.5
R1
@0.7
mIoUR1
@0.3
R1
@0.5
R1
@0.7
mIoUR1
@0.3
R1
@0.5
R1
@0.7
mIoU
Qwen2.5-VL-7B-Instruct59.737.816.639.344.131.016.131.441.527.815.231.6
TimeLens-7B🚀70.555.628.448.862.851.032.646.274.162.743.156.0
Qwen3-VL-8B-Instruct69.253.427.548.362.151.234.446.874.264.649.359.4
TimeLens-8B🚀76.663.035.255.268.958.440.653.280.271.655.565.5

有关与其他模型的详细对比,请参考 🏆 排行榜。

🚀 使用方法

安装以下软件包:

pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

使用🤗Transformers进行推理:

import requests
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info


def download_video(url):
    save_path = os.path.basename(url)
    if not os.path.exists(save_path):
        print(f"Downloading video from {url}...")
        response = requests.get(url, stream=True)
        response.raise_for_status()
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return save_path

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-8B",
    padding_side="left",
    do_resize=False,
)

# Prepare input
query = "A man drinks water with a glass"
video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")

GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 32 * 32,
            'total_pixels': 14336 * 32 * 32,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
  messages,
  image_patch_size=16,
  return_video_kwargs=True,
  return_video_metadata=True,
)

videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)

inputs = processor(
  text=[text],
  images=images,
  videos=videos,
  video_metadata=video_metadatas,
  padding=True,
  return_tensors='pt',
  **video_kwargs,
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")

引用说明

如果您发现我们的研究成果对您的研究和应用有所帮助,请引用我们的论文:

@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}