CogVLM2

📍在智谱AI开放平台上体验更大规模的 CogVLM 模型。

模型简介

我们推出了新一代的 CogVLM2 系列模型，并开源了两个基于 Meta-Llama-3-8B-Instruct 构建的模式。与上一代的 CogVLM 开源模型相比，CogVLM2 系列开源模型有以下改进：

在 TextVQA、DocVQA 等众多基准测试中取得了显著提升。
支持 8K 的内容长度。
支持高达 1344 * 1344 的图像分辨率。
提供了同时支持 中文和英文 的开源模型版本。

CogVlM2 Int4 模型需要 16G GPU 内存，并且必须在配备 Nvidia GPU 的 Linux 系统上运行。

模型名称	cogvlm2-llama3-chinese-chat-19B-int4	cogvlm2-llama3-chinese-chat-19B
GPU 内存需求	16G	42G
系统需求	Linux（配备 Nvidia GPU）	Linux（配备 Nvidia GPU）

基准测试

我们的开源模型与上一代 CogVLM 开源模型相比，在许多列表中取得了良好的成绩。其卓越的性能可以与某些非开源模型相媲美，如下表所示：

模型名称	开源	语言模型规模	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2-LLaMA3 (我们的)	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese (我们的)	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

所有评估结果都是在不使用任何外部 OCR 工具的情况下获得的（"仅像素"）。

快速入门

以下是一个如何使用 CogVLM2 模型进行简单对话的示例。更多使用案例，请查看我们的 GitHub 页面。

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "THUDM/cogvlm2-llama3-chinese-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

许可证

本模型基于 CogVLM2 许可证发布。对于使用 Meta Llama 3 构建的模型，请同时遵守 LLAMA3_LICENSE。

引用

如果您认为我们的工作对您有帮助，请考虑引用以下论文：

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

请提供需要翻译的文本内容，我将会按照您的要求进行翻译。