📍在智谱AI开放平台上体验更大规模的CogVLM模型。
我们推出了新一代的CogVLM2系列模型,并开源了两个基于Meta-Llama-3-8B-Instruct构建的模型。与上一代的CogVLM开源模型相比,CogVLM2系列开源模型有以下改进:
TextVQA、DocVQA等多个基准测试中取得了显著提升。CogVlM2 Int4模型需要16G GPU内存,且必须在配备Nvidia GPU的Linux系统上运行。
| 模型名称 | cogvlm2-llama3-chat-19B-int4 | cogvlm2-llama3-chat-19B |
|---|---|---|
| 所需GPU内存 | 16G | 42G |
| 系统要求 | Linux(配备Nvidia GPU) | Linux(配备Nvidia GPU) |
我们的开源模型在许多列表中与上一代CogVLM开源模型相比取得了良好的成绩。其出色的性能可以与某些非开源模型竞争,如下表所示:
| 模型 | 开源 | 语言模型大小 | TextVQA | DocVQA | ChartQA | OCRbench | MMMU | MMVet | MMBench |
|---|---|---|---|---|---|---|---|---|---|
| CogVLM1.1 | ✅ | 7B | 69.7 | - | 68.3 | 590 | 37.3 | 52.0 | 65.8 |
| LLaVA-1.5 | ✅ | 13B | 61.3 | - | - | 337 | 37.0 | 35.4 | 67.7 |
| Mini-Gemini | ✅ | 34B | 74.1 | - | - | - | 48.0 | 59.3 | 80.6 |
| LLaVA-NeXT-LLaMA3 | ✅ | 8B | - | 78.2 | 69.5 | - | 41.7 | - | 72.1 |
| LLaVA-NeXT-110B | ✅ | 110B | - | 85.7 | 79.7 | - | 49.1 | - | 80.5 |
| InternVL-1.5 | ✅ | 20B | 80.6 | 90.9 | 83.8 | 720 | 46.8 | 55.4 | 82.3 |
| QwenVL-Plus | ❌ | - | 78.9 | 91.4 | 78.1 | 726 | 51.4 | 55.7 | 67.0 |
| Claude3-Opus | ❌ | - | - | 89.3 | 80.8 | 694 | 59.4 | 51.7 | 63.3 |
| Gemini Pro 1.5 | ❌ | - | 73.5 | 86.5 | 81.3 | - | 58.5 | - | - |
| GPT-4V | ❌ | - | 78.0 | 88.4 | 78.5 | 656 | 56.8 | 67.7 | 75.0 |
| CogVLM2-LLaMA3 (Ours) | ✅ | 8B | 84.2 | 92.3 | 81.0 | 756 | 44.3 | 60.4 | 80.5 |
| CogVLM2-LLaMA3-Chinese (Ours) | ✅ | 8B | 85.0 | 88.4 | 74.7 | 780 | 42.8 | 60.5 | 78.9 |
所有评估结果均未使用任何外部OCR工具(“仅像素”)获得。
以下是一个如何使用模型与CogVLM2模型进行聊天的简单示例。更多用例请查看我们的GitHub。
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
low_cpu_mem_usage=True,
).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
image_path = input("image path >>>>> ")
if image_path == '':
print('You did not enter image path, the following will be a plain text conversation.')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("<|end_of_text|>")[0]
print("\nCogVLM2:", response)
history.append((query, response))本模型基于 CogVLM2 许可证 发布。对于使用 Meta Llama 3 构建的模型,请同时遵守 LLAMA3_LICENSE。
如果您认为我们的工作对您有帮助,请考虑引用以下论文:
@misc{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}当然,我可以帮助您进行文本的翻译。不过,您需要先提供需要翻译的原始文本。请您提供文本内容,我将会按照您的要求进行翻译。