[
](https://github.com/NVlabs/EAGLE) [
](http://arxiv.org/abs/2501.14818) [
](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
eagle_2_5_vl,以支持 generate 功能。我们非常高兴发布最新的 Eagle2 系列视觉语言模型。开源视觉语言模型(VLM)在缩小与专有模型差距方面取得了显著进展。然而,关于数据策略和实现的关键细节往往缺失,限制了可复现性和创新。在本项目中,我们从数据中心视角专注于 VLM 后训练,分享从零开始构建有效数据策略的见解。通过将这些策略与稳健的训练方法和模型设计相结合,我们推出了 Eagle2——一系列高性能 VLM。我们的工作旨在赋能开源社区,以透明流程开发具有竞争力的 VLM。
在此仓库中,我们开源了 Eagle2-1B,这是一款紧凑高效的模型,专为需要快速推理和最小计算资源的场景设计,同时不牺牲基本性能。
我们提供以下模型:
| 模型名称 | LLM | 视觉模型 | 最大长度 | HF 链接 |
|---|---|---|---|---|
| Eagle2-1B | Qwen2.5-0.5B-Instruct | Siglip | 16K | 🤗 链接 |
| Eagle2-2B | Qwen2.5-1.5B-Instruct | Siglip | 16K | 🤗 链接 |
| Eagle2-9B | Qwen2.5-7B-Instruct | Siglip+ConvNext | 16K | 🤗 链接 |
| 基准测试 | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B | Qwen2-VL-2B | Eagle2-1B |
|---|---|---|---|---|---|
| DocVQAtest | 70.0 | 81.7 | 84.8 | 90.1 | 81.8 |
| ChartQAtest | 61.4 | 72.9 | 75.9 | 73.0 | 77.0 |
| InfoVQAtest | 41.8 | 50.9 | 56.0 | 65.5 | 54.8 |
| TextVQAval | - | 70.0 | 72.0 | 79.7 | 76.6 |
| OCRBench | 565 | 754 | 785 | 809 | 767 |
| MMEsum | 1438.0 | 1794.4 | 1950.5 | 1872.0 | 1790.2 |
| RealWorldQA | 55.6 | 50.3 | 57.5 | 62.6 | 55.4 |
| AI2Dtest | 57.1 | 64.1 | 69.3 | 74.7 | 70.9 |
| MMMUval | 31.4 | 36.7 | 40.9 | 41.1 | 38.8 |
| MMVetGPT-4-Turbo | 32.2 | 32.7 | 48.8 | 49.5 | 40.9 |
| HallBenchavg | 27.9 | 34.0 | 39.0 | 41.7 | 35.3 |
| MathVistatestmini | 33.8 | 37.7 | 43.2 | 43.0 | 45.3 |
| MMstar | 37.7 | 45.7 | 50.1 | 48.0 | 48.5 |
我们提供了一个推理脚本,帮助您快速开始使用模型。我们支持多种输入类型:
pip install transformers
pip install flash-attnfrom PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch
from transformers import TextIteratorStreamer
import threading
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=True,
top_p=0.95,
temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe these two images."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
},
{"type": "text", "text": "Describe this video."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
"nframes": 10,
},
{
"type": "video",
"video": "../Eagle2-8B/video_ocr.mp4",
"nframes": 10,
},
{"type": "text", "text": "Describe these two videos respectively."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages1 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)NVIDIA 认为可信 AI 是一项共同责任,我们已制定相关政策和实践,以支持广泛 AI 应用的开发。当根据我们的服务条款下载或使用本模型时,开发者应与其内部模型团队合作,确保该模型满足相关行业和用例的要求,并应对不可预见的产品误用问题。
请通过 此处 报告安全漏洞或 NVIDIA AI 相关问题。