已被 CVPR 2026 接收 🎉
uAI-NEXUS-MedVLM-1.0a-7B-RL 是一款基于 Qwen2.5-VL-7B-Instruct 微调的医学视频理解模型。它是 uAI-NEXUS-MedVLM 1.0 系列中的 7B-RL 成员(变体 a = Qwen2.5-VL 基础版;变体 b/c 分别采用 Qwen3-VL-4B 和 Qwen3.5-4B)。训练采用两阶段流程:
该模型在医学视频理解领域的多项任务中均达到了最先进水平,包括时间动作定位、时空定位、视频摘要、区域描述以及手术技能/CVS 评估。
该模型可处理 8 项医学视频理解任务(包含 11 个变体):
| 任务类别 | 任务内容 |
|---|---|
| 时间理解 | 时间动作定位(TAL)、时空定位(STG)、下一动作预测 |
| 描述生成 | 密集描述(GPT / Gemini)、视频摘要(GPT / Gemini)、区域描述(GPT / Gemini) |
| 评估 | 技能评估、CVS(安全关键视野) |
基于 51,505 条平衡的视频-指令对(MedVidBench 标准划分)进行训练,涵盖 8 个源数据集:AVOS、CholecT50、CholecTrack20、Cholec80-CVS、CoPESD、EgoSurgery、JIGSAWS、NurViD。
第二阶段(GRPO)使用标准划分中任务平衡的子集(详见论文)。
pip install transformers accelerate torch pillow qwen-vl-utilsimport torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL")
video_frames = ["frame_0001.jpg", "frame_0002.jpg", "frame_0003.jpg"] # list of frame paths
messages = [{
"role": "user",
"content": [
{"type": "video", "video": video_frames},
{"type": "text", "text": "When does the surgeon grasp the gallbladder? Provide start and end times in seconds."},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# Example: "The surgeon grasps the gallbladder from 45.2 to 58.7 seconds."若需进行完整的批量推理并正确处理视频帧,请使用 UII-AI/MedGRPO-Code 中的参考流水线:
git clone https://github.com/UII-AI/MedGRPO-Code
cd MedGRPO-Code
pip install -r requirements.txt
bash run_inference.sh在MedVidBench(涵盖8项任务的6,245个测试样本)上进行了评估。GRPO在以下方面持续优于SFT基线:
请将预测结果提交至MedVidBench排行榜,以对您自己的模型进行基准测试。
基于Apache 2.0许可证发布。
如果您使用本模型或MedVidBench基准,请引用:
@inproceedings{su2026medgrpo,
title = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}请在GitHub 仓库上提交 issue。