deepseek-vl-1.3b-base:深度探索视觉与语言理解的边界，DeepSeek-VL-1.3b-base开源模型以小巧之躯，承载强大智能。它能处理图像、图表、网页内容，识别公式，理解科学文献，为复杂场景提供视觉语言一体化解决方案。开启真实世界视觉语言理解新篇章。

1. 引言

向您介绍DeepSeek-VL，这是一个开源的视觉-语言（VL）模型，专为现实世界的视觉与语言理解应用而设计。DeepSeek-VL具备广泛的模态理解能力，能够处理逻辑图表、网页、公式识别、科学文献、自然图像以及复杂场景中的具身智能。

Haoyu Lu*，Wen Liu*，Bo Zhang**，Bingxuan Wang，Kai Dong，Bo Liu，Jingxiang Sun，Tongzheng Ren，Zhuoshu Li，Hao Yang，Yaofeng Sun，Chengqi Deng，Hanwei Xu，Zhenda Xie，Chong Ruan (*同等贡献，**项目负责人)

2. 模型概述

DeepSeek-VL-1.3b-base是一个小型视觉-语言模型。它使用SigLIP-L作为视觉编码器，支持384 x 384图像输入，基于在约5000亿文本标记上训练的DeepSeek-LLM-1.3b-base构建而成。整个DeepSeek-VL-1.3b-base模型最终在大约4000亿个视觉-语言标记上进行了训练。

3. 快速入门

安装

在Python >= 3.8环境的基础上，通过运行以下命令来安装必要的依赖项：

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL

pip install -e .

简易推理示例

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images


# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-1.3b-base"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>Describe each stage of this image.",
        "images": ["./images/training_pipelines.png"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

命令行界面聊天工具


python cli_chat.py --model_path "deepseek-ai/deepseek-vl-1.3b-base"

# or local path
python cli_chat.py --model_path "local model path"

4. 许可证

本代码仓库遵循 MIT许可证。使用 DeepSeek-VL 基础/对话模型需遵守 DeepSeek 模型许可证。DeepSeek-VL 系列产品（包括基础和对话版本）均支持商业用途。

5. 引用

@misc{lu2024deepseekvl,
      title={DeepSeek-VL: Towards Real-World Vision-Language Understanding}, 
      author={Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan},
      year={2024},
      eprint={2403.05525},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

6. 联系方式

如果您有任何疑问，请提出问题或通过 service@deepseek.com 与我们联系。