介绍DeepSeek-VL,这是一个开源的视觉-语言(VL)模型,旨在用于现实世界的视觉与语言理解应用。DeepSeek-VL具备通用的多模态理解能力,能够处理逻辑图、网页、公式识别、科学文献、自然图像以及复杂环境中的具身智能。
Haoyu Lu*,Wen Liu*,Bo Zhang**,Bingxuan Wang,Kai Dong,Bo Liu,Jingxiang Sun,Tongzheng Ren,Zhuoshu Li,Hao Yang,Yaofeng Sun,Chengqi Deng,Hanwei Xu,Zhenda Xie,Chong Ruan (*同等贡献,**项目主导)

DeepSeek-VL-7b-base 使用 SigLIP-L 和 SAM-B 作为混合视觉编码器,支持 1024 x 1024 图像输入,并且基于在约 2T 文本标记上训练的 DeepSeek-LLM-7b-base 构建而成。整个 DeepSeek-VL-7b-base 模型最终在大约 400B 的视觉-语言标记上进行训练。DeepSeel-VL-7b-chat 是基于 DeepSeek-VL-7b-base 的指令版本。
在 Python >= 3.8 环境基础上,通过运行以下命令安装必要的依赖项:
git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images
# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-7b-chat"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>Describe each stage of this image.",
"images": ["./images/training_pipelines.png"]
},
{
"role": "Assistant",
"content": ""
}
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation,
images=pil_images,
force_batchify=True
).to(vl_gpt.device)
# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
python cli_chat.py --model_path "deepseek-ai/deepseek-vl-7b-chat"
# or local path
python cli_chat.py --model_path "local model path"
本代码仓库遵循 MIT许可。DeepSeek-VL基础/聊天模型的使用需遵守DeepSeek模型许可。DeepSeek-VL系列(包括基础和聊天模型)均支持商业用途。
@misc{lu2024deepseekvl,
title={DeepSeek-VL: Towards Real-World Vision-Language Understanding},
author={Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan},
year={2024},
eprint={2403.05525},
archivePrefix={arXiv},
primaryClass={cs.AI}
}如有任何疑问,请提出问题或通过 service@deepseek.com 与我们联系。