模型 ID 的模型卡片

https://showlab.github.io/videollm-online/

模型详情

大语言模型（LLM）：meta-llama/Meta-Llama-3-8B-Instruct
视觉策略：
- 帧编码器：google/siglip-large-patch16-384
- 帧令牌：CLS 令牌 + 平均池化 3x3 令牌
- 帧 FPS：训练时为 2，推理时为 2~10
- 帧分辨率：最大分辨率 384，采用零填充以保持宽高比
- 视频长度：10 分钟
训练数据：Ego4D 叙述流 113K + Ego4D 目标步骤流 21K

模型来源

代码仓库： https://github.com/showlab/videollm-online
论文： https://arxiv.org/abs/2406.11816

用途

首先，克隆 github 代码仓库并按照安装说明操作：

git clone https://github.com/showlab/videollm-online

请确保已安装 Miniconda 和版本 ≥ 3.10 的 Python，然后运行：

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch 源码会安装 ffmpeg，但安装的是旧版本，通常会导致预处理质量很低。请按照以下步骤安装最新版 ffmpeg：

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

如果您想使用实时流音频来试用我们的模型，还请克隆 ChatTTS。

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

通过以下命令在本地启动 gradio 演示：

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

或者通过以下命令在本地启动 CLI：

python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

引用

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}

模型 ID 的模型卡片

https://showlab.github.io/videollm-online/

模型详情

大语言模型（LLM）：meta-llama/Meta-Llama-3-8B-Instruct
视觉策略：
- 帧编码器：google/siglip-large-patch16-384
- 帧令牌：CLS 令牌 + 平均池化 3x3 令牌
- 帧 FPS：训练时为 2，推理时为 2~10
- 帧分辨率：最大分辨率 384，采用零填充以保持宽高比
- 视频长度：10 分钟
训练数据：Ego4D 叙述流 113K + Ego4D 目标步骤流 21K

模型来源

代码仓库： https://github.com/showlab/videollm-online
论文： https://arxiv.org/abs/2406.11816

用途

首先，克隆 github 代码仓库并按照安装说明操作：

git clone https://github.com/showlab/videollm-online

请确保已安装 Miniconda 和版本 ≥ 3.10 的 Python，然后运行：

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch 源码会安装 ffmpeg，但安装的是旧版本，通常会导致预处理质量很低。请按照以下步骤安装最新版 ffmpeg：

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

如果您想使用实时流音频来试用我们的模型，还请克隆 ChatTTS。

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

通过以下命令在本地启动 gradio 演示：

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

或者通过以下命令在本地启动 CLI：

python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

引用

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}