CogVLM2

👋 欢迎通过微信加入我们

📍可在智谱AI开放平台体验更大规模的CogVLM模型。

模型介绍

我们推出了新一代CogVLM2系列模型，并开源了基于Meta-Llama-3-8B-Instruct构建的两款模型。与上一代CogVLM开源模型相比，CogVLM2系列开源模型具有以下改进：

在TextVQA、DocVQA等多项基准测试中性能显著提升。
支持8K内容长度。
支持最高1344 * 1344的图像分辨率。
提供支持中英文双语的开源模型版本。

您可在下表中查看CogVLM2系列开源模型的详细信息：

模型名称	cogvlm2-llama3-chat-19B	cogvlm2-llama3-chinese-chat-19B
基础模型	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
支持语言	英文	中文、英文
模型规模	19B	19B
任务类型	图像理解、对话模型	图像理解、对话模型
文本长度	8K	8K
图像分辨率	1344 * 1344	1344 * 1344

基准测试

与上一代CogVLM开源模型相比，我们的开源模型在多项榜单中均取得了优异成绩。其出色性能可与部分非开源模型相媲美，如下表所示：

模型	是否开源	LLM规模	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2-LLaMA3（我们的模型）	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese（我们的模型）	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

所有评测结果均未使用任何外部OCR工具（即“纯像素输入”）。

快速开始

以下是使用该模型与 CogVLM2 模型进行对话的简单示例。

import torch
import torch_npu
from PIL import Image
from openmind import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "AI-Research/cogvlm2-llama3-chinese-chat-19b"
DEVICE = 'npu' if torch.npu.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.npu.is_available() else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:
    image_path = input("image path >>>>> ")
    if image_path == '':
        print('You did not enter image path, the following will be a plain text conversation.')
        image = None
        text_only_first_query = True
    else:
        image = Image.open(image_path).convert('RGB')

    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        if image is None:
            if text_only_first_query:
                query = text_only_template.format(query)
                text_only_first_query = False
            else:
                old_prompt = ''
                for _, (old_query, response) in enumerate(history):
                    old_prompt += old_query + " " + response + "\n"
                query = old_prompt + "USER: {} ASSISTANT:".format(query)
        if image is None:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                template_version='chat'
            )
        else:
            input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

许可协议

本模型依据 CogVLM2 许可协议发布。对于基于 Meta Llama 3 构建的模型，还请同时遵守 LLAMA3_LICENSE。

引用

如果您发现我们的工作对您有所帮助，请考虑引用以下论文

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

模型介绍

我们推出了新一代CogVLM2系列模型，并开源了基于Meta-Llama-3-8B-Instruct构建的两款模型。与上一代CogVLM开源模型相比，CogVLM2系列开源模型具有以下改进：

在TextVQA、DocVQA等多项基准测试中性能显著提升。

支持8K内容长度。

支持最高1344 * 1344的图像分辨率。

提供支持中英文双语的开源模型版本。