Qianfan-VL：领域增强型通用视觉语言模型

通过持续预训练提升领域能力 | 30亿至700亿参数规模 | 文档理解与OCR增强 | 支持思维链推理

🔗 快速链接

模型描述

Qianfan-VL是一系列面向企业级多模态应用的增强型通用多模态大语言模型。该系列模型在保持强大通用能力的同时，针对工业部署中的高频场景进行了深度优化。

模型变体

模型	参数规模	上下文长度	支持CoT	适用场景
Qianfan-VL-3B	30亿	32k	❌	边缘部署、实时OCR
Qianfan-VL-8B	80亿	32k	✅	服务端通用场景、模型微调
Qianfan-VL-70B	700亿	32k	✅	复杂推理、数据合成

技术架构

语言模型：
- Qianfan-VL-3B：基于Qwen2.5-3B
- Qianfan-VL-8B/70B：基于Llama 3.1架构
- 采用3T多语言语料增强
视觉编码器：基于InternViT，支持动态分块，最高4K分辨率
跨模态融合：MLP适配器，实现高效视觉-语言桥接

核心能力

🔍 OCR 与文档理解

全场景 OCR：手写体、公式、自然场景、卡片/文档
文档智能：布局分析、表格解析、图表理解、文档问答
高精度：在 OCR 基准测试中表现行业领先

🧮 思维链推理（8B 与 70B）

复杂图表分析与推理
数学问题求解及分步推导
视觉推理与逻辑推断
统计计算与趋势预测

📊 基准测试性能

通用视觉语言基准测试

基准测试	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
A-Bench_VAL	75.65	75.72	78.1	75.86	75.86	76.49	79.22
CCBench	66.86	70.39	80.98	77.84	70.78	57.65	73.73
SEEDBench_IMG	76.55	78.02	79.13	77.0	77.52	76.98	78.34
SEEDBench2_Plus	67.59	70.97	73.17	69.52	68.47	70.93	73.25
MMVet	48.17	53.21	67.34	80.28	78.9	70.64	75.69
MMMU_VAL	46.44	47.11	58.33	56.11	60.78	51.0	65.78
ScienceQA_TEST	95.19	97.62	98.76	97.97	97.17	85.47	92.51
ScienceQA_VAL	93.85	97.62	98.81	97.81	95.14	83.59	91.32
MMT-Bench_VAL	62.23	63.22	71.06	65.17	63.67	61.4	69.49
MTVQA_TEST	26.5	30.14	32.18	30.3	27.62	29.08	31.48
BLINK	49.97	56.81	59.44	55.87	51.87	54.55	63.02
MMStar	57.93	64.07	69.47	68.4	66.07	61.53	66.0
RealWorldQA	65.75	70.59	71.63	71.11	74.25	69.28	73.86
Q-Bench1_VAL	73.51	75.25	77.46	75.99	77.99	78.1	79.93
POPE	85.08	86.06	88.97	90.59	88.87	85.97	83.35
RefCOCO (Avg)	85.94	89.37	91.01	89.65	91.40	86.56	90.25

OCR 与文档理解

基准测试	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-3B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
OCRBench	831	854	873	881	847	810	883	874
AI2D_TEST	81.38	85.07	87.23	85.07	83.55	77.07	80.472	83.84
OCRVQA_TEST	66.15	68.98	74.06	39.03	35.58	69.24	71.02	66.8
TextVQA_VAL	80.11	82.13	84.48	82.15	83.52	79.09	84.962	83.26
DocVQA_VAL	90.85	93.54	94.75	92.04	83.82	92.71	94.91	95.75
ChartQA_TEST	81.79	87.72	89.6	85.76	82.04	83.4	86.68	87.16

数学推理

基准测试	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
Mathvista-mini	69.19	78.6	69.5	70.1	67.2	73.9
Mathvision	32.82	50.29	29.61	34.8	25.95	39.34
Mathverse	48.4	61.04	43.68	49.26	44.21	55.18
ChartQA Pro	50.43	52	37.32	44.43	43.73	45.3
HallusionBench	51.72	54.52	49.2	40.2	47.9	49.9
InHouse Dataset A	59.87	71.78	40.64	41.47	45.58	57.2
InHouse Dataset B	61.33	75.6	36.25	42.65	30.62	59.68

快速开始

安装

pip install transformers accelerate torch torchvision pillow einops

使用 Transformers

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)

# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 512},
        verbose=False
    )
print(response)

使用 vLLM

您可以使用 vLLM 的官方 Docker 镜像部署 Qianfan-VL，以通过兼容 OpenAI 的 API 实现高性能推理：

启动 vLLM 服务

docker run -d --name qianfan-vl \
  --gpus all \
  -v /path/to/Qianfan-VL-8B:/model \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model /model \
  --served-model-name qianfan-vl \
  --trust-remote-code \
  --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'

调用API

curl 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "qianfan-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
            }
          },
          {
            "type": "text",
            "text": "<image>请识别图中所有文字"
          }
        ]
      }
    ]
  }'

或使用 Python 搭配 OpenAI SDK：

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8000/v1"
)

response = client.chat.completions.create(
    model="qianfan-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
                },
                {
                    "type": "text",
                    "text": "<image>请描述这张图片"
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

训练详情

四阶段渐进式训练

跨模态对齐（1000亿 tokens）：建立视觉-语言连接
通用知识注入（3.5万亿 tokens）：构建强大的基础能力
领域增强（3000亿 tokens）：专业化OCR与推理能力
后训练（10亿 tokens）：指令遵循与偏好对齐

基础设施

基于5000+百度昆仑芯片训练
5000芯片单任务并行训练，规模空前
大规模分布式训练达到90%以上的扩展效率
创新的通信-计算融合技术

模型卡片

开发单位：百度智能云千帆团队
模型类型：视觉-语言Transformer
支持语言：多语言支持
许可证：[具体许可证信息请查看模型卡片]
基础架构：请参考技术报告

引用

如果您在研究中使用Qianfan-VL，请引用：

@misc{qianfan-vl-2025,
  title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
  author={Qianfan Team},
  year={2025},
  publisher={Baidu}
}

联系方式

如需了解更多信息和 API 访问权限，请访问：百度千帆平台

致谢

本模型系列在多模态人工智能领域取得了重大进展，将通用能力与特定领域增强相结合，以满足实际应用需求。