📑 论文 | 💻 代码 | 🏠 项目主页 | 🤗 模型与数据
TimeLens-8B 是一款多模态大语言模型(MLLM),在开源模型中实现了最先进的视频时序定位性能。该模型基于 Qwen3-VL-8B-Instruct 进行微调,采用了我们论文中提出的精心设计的 RLVR(带可验证奖励的强化学习)训练方案,并利用了高质量的视频时序定位(VTG)训练数据集 TimeLens-100K。
TimeLens-8B 在开源模型中实现了最先进的视频时序定位性能:
| 模型 | Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 @0.3 | R1 @0.5 | R1 @0.7 | mIoU | R1 @0.3 | R1 @0.5 | R1 @0.7 | mIoU | R1 @0.3 | R1 @0.5 | R1 @0.7 | mIoU | |
| Qwen2.5-VL-7B-Instruct | 59.7 | 37.8 | 16.6 | 39.3 | 44.1 | 31.0 | 16.1 | 31.4 | 41.5 | 27.8 | 15.2 | 31.6 |
| TimeLens-7B🚀 | 70.5 | 55.6 | 28.4 | 48.8 | 62.8 | 51.0 | 32.6 | 46.2 | 74.1 | 62.7 | 43.1 | 56.0 |
| Qwen3-VL-8B-Instruct | 69.2 | 53.4 | 27.5 | 48.3 | 62.1 | 51.2 | 34.4 | 46.8 | 74.2 | 64.6 | 49.3 | 59.4 |
| TimeLens-8B🚀 | 76.6 | 63.0 | 35.2 | 55.2 | 68.9 | 58.4 | 40.6 | 53.2 | 80.2 | 71.6 | 55.5 | 65.5 |
有关与其他模型的详细对比,请参考 🏆 排行榜。
安装以下软件包:
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir使用🤗Transformers进行推理:
import requests
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
def download_video(url):
save_path = os.path.basename(url)
if not os.path.exists(save_path):
print(f"Downloading video from {url}...")
response = requests.get(url, stream=True)
response.raise_for_status()
with open(save_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
return save_path
# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
"TencentARC/TimeLens-8B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"TencentARC/TimeLens-8B",
padding_side="left",
do_resize=False,
)
# Prepare input
query = "A man drinks water with a glass"
video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")
GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."
messages = [{
'role': 'user',
'content': [
{
'type': 'video',
'video': video_path,
'min_pixels': 64 * 32 * 32,
'total_pixels': 14336 * 32 * 32,
'fps': 2,
},
{
'type': 'text',
'text': GROUNDER_PROMPT.format(query)
}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
messages,
image_patch_size=16,
return_video_kwargs=True,
return_video_metadata=True,
)
videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)
inputs = processor(
text=[text],
images=images,
videos=videos,
video_metadata=video_metadatas,
padding=True,
return_tensors='pt',
**video_kwargs,
).to("cuda")
output_ids = model.generate(
**inputs,
do_sample=False,
max_new_tokens=512,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")如果您发现我们的研究成果对您的研究和应用有所帮助,请引用我们的论文:
@article{zhang2025timelens,
title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
journal={arXiv preprint arXiv:2512.14698},
year={2025}
}