[
](https://github.com/NVlabs/EAGLE) [
](http://arxiv.org/abs/2501.14818) [
](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
eagle_2_5_vl,以支持 generate 功能。我们非常高兴发布最新的 Eagle2 系列视觉语言模型。开源视觉语言模型(VLM)在缩小与专有模型差距方面取得了显著进展。然而,关于数据策略和实现的关键细节往往缺失,限制了可复现性和创新。在本项目中,我们从数据中心的角度专注于 VLM 后训练,分享从零开始构建有效数据策略的见解。通过将这些策略与稳健的训练方法和模型设计相结合,我们推出了 Eagle2 这一系列高性能 VLM。我们的工作旨在赋能开源社区,通过透明流程开发具有竞争力的 VLM。
在本仓库中,我们开源了 Eagle2-2B,这是一款轻量级模型,在保持稳定性能的同时实现了卓越的效率和速度。
我们提供以下模型:
| 模型名称 | 语言模型 | 视觉模型 | 最大长度 | HF 链接 |
|---|---|---|---|---|
| Eagle2-1B | Qwen2.5-0.5B-Instruct | Siglip | 16K | 🤗 链接 |
| Eagle2-2B | Qwen2.5-1.5B-Instruct | Siglip | 16K | 🤗 链接 |
| Eagle2-9B | Qwen2.5-7B-Instruct | Siglip+ConvNext | 16K | 🤗 链接 |
| 基准测试 | InternVL2-2B | InternVL2.5-2B | InternVL2-4B | Qwen2-VL-2B | Eagle2-2B |
|---|---|---|---|---|---|
| DocVQAtest | 86.9 | 88.7 | 89.2 | 90.1 | 88.0 |
| ChartQAtest | 76.2 | 79.2 | 81.5 | 73.0 | 82.0 |
| InfoVQAtest | 58.9 | 60.9 | 67.0 | 65.5 | 65.8 |
| TextVQAval | 73.4 | 74.3 | 74.4 | 79.7 | 79.1 |
| OCRBench | 784 | 804 | 788 | 809 | 818 |
| MMEsum | 1876.8 | 2138.2 | 2059.8 | 1872.0 | 2109.8 |
| RealWorldQA | 57.3 | 60.1 | 60.7 | 62.6 | 63.1 |
| AI2Dtest | 74.1 | 74.9 | 74.7 | 78.9 | 79.3 |
| MMMUval | 36.3 | 43.6 | 47.9 | 41.1 | 43.1 |
| MMVetGPT-4-Turbo | 39.5 | 60.8 | 51.0 | 49.5 | 53.8 |
| HallBenchavg | 37.9 | 42.6 | 41.9 | 41.7 | 45.8 |
| MathVistatestmini | 46.3 | 51.3 | 58.6 | 43.0 | 54.7 |
| MMstar | 50.1 | 53.7 | 54.3 | 48.0 | 56.4 |
我们提供了一个推理脚本,帮助您快速开始使用模型。我们支持多种输入类型:
pip install transformers
pip install flash-attnfrom PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch
from transformers import TextIteratorStreamer
import threading
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=True,
top_p=0.95,
temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe these two images."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
},
{"type": "text", "text": "Describe this video."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
"nframes": 10,
},
{
"type": "video",
"video": "../Eagle2-8B/video_ocr.mp4",
"nframes": 10,
},
{"type": "text", "text": "Describe these two videos respectively."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages1 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)NVIDIA 认为可信 AI 是一项共同责任,我们已制定相关政策和实践,以支持广泛 AI 应用的开发。当开发者按照我们的服务条款下载或使用本模型时,应与内部模型团队合作,确保该模型满足相关行业和用例的要求,并应对未预见的产品误用问题。
请通过 此处 报告安全漏洞或 NVIDIA AI 相关问题。