Qwen3-VL-4B-Instruct-FP8

本仓库包含 Qwen3-VL-4B-Instruct 模型的 FP8 量化版本。量化方法为细粒度 FP8 量化，块大小为 128，其性能指标与原始 BF16 模型几乎一致。欢迎使用！

Qwen3-VL 震撼登场——这是迄今为止 Qwen 系列中功能最为强大的视觉语言模型。

此代模型实现了全面升级：文本理解与生成能力更卓越，视觉感知与推理更深入，上下文长度显著延长，空间与视频动态理解能力增强，智能体交互能力也更为强大。

Qwen3-VL 提供稠密型（Dense）和混合专家（MoE）两种架构，可从边缘设备到云端灵活扩展；同时推出指令优化版（Instruct）和推理增强版（Thinking），满足按需部署的多样化需求。

核心增强亮点：

视觉智能体（Visual Agent）：可操控电脑/手机图形用户界面（GUI）——识别界面元素、理解功能用途、调用工具并完成任务。
视觉编码助力（Visual Coding Boost）：能从图像/视频直接生成 Draw.io 图表、HTML、CSS 及 JavaScript 代码。
高级空间感知（Advanced Spatial Perception）：精准判断物体位置、观察视角和遮挡关系；提供更强的 2D 定位能力，并支持 3D 定位，为空间推理和具身智能奠定基础。
长上下文与视频理解（Long Context & Video Understanding）：原生支持 256K 上下文长度，可扩展至 100 万；轻松处理整本书籍和长达数小时的视频内容，实现完整召回和秒级索引。
增强型多模态推理（Enhanced Multimodal Reasoning）：在 STEM 领域及数学方面表现突出——擅长因果分析，能提供符合逻辑、基于证据的答案。
升级的视觉识别（Upgraded Visual Recognition）：更广泛、更高质量的预训练使其能够“识别万物”——包括名人、动漫角色、产品、地标、动植物等。
扩展的 OCR 功能（Expanded OCR）：支持 32 种语言（从原来的 19 种大幅提升）；在低光照、模糊和倾斜场景下表现稳健；对生僻字、古文字和专业术语识别更精准；长文档结构解析能力也得到优化。
媲美纯语言模型的文本理解能力（Text Understanding on par with pure LLMs）：实现文本与视觉的无缝融合，达成无损、统一的综合理解。

模型架构更新：

交错式旋转位置编码（Interleaved-MRoPE）：通过稳健的位置嵌入，在时间、宽度和高度维度上实现全频率分配，显著增强长时视频推理能力。
深度堆叠（DeepStack）：融合多尺度视觉Transformer（ViT）特征，捕捉细粒度细节，提升图文对齐精度。
文本-时间戳对齐（Text–Timestamp Alignment）：超越传统 T-RoPE，实现精确的、基于时间戳的事件定位，强化视频时序建模。

本仓库为 Qwen3-VL-4B-Instruct-FP8 的权重仓库。

模型性能

多模态性能

纯文本性能

快速开始

目前，🤗 Transformers 尚不支持直接加载这些权重。敬请期待！

我们建议使用 vLLM 或 SGLang 部署模型，下面提供示例启动命令。有关运行环境和部署的详细信息，请参考此链接。

vLLM 推理

这里提供一个代码片段，演示如何使用 vLLM 在本地运行 Qwen3-VL 推理。有关使用 vLLM 进行高效部署的更多详细信息，请参考社区部署指南。

# -*- coding: utf-8 -*-
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from vllm import LLM, SamplingParams

import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'

def prepare_inputs_for_vllm(messages, processor):
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # qwen_vl_utils 0.0.14+ reqired
    image_inputs, video_inputs, video_kwargs = process_vision_info(
        messages,
        image_patch_size=processor.image_processor.patch_size,
        return_video_kwargs=True,
        return_video_metadata=True
    )
    print(f"video_kwargs: {video_kwargs}")

    mm_data = {}
    if image_inputs is not None:
        mm_data['image'] = image_inputs
    if video_inputs is not None:
        mm_data['video'] = video_inputs

    return {
        'prompt': text,
        'multi_modal_data': mm_data,
        'mm_processor_kwargs': video_kwargs
    }


if __name__ == '__main__':
    # messages = [
    #     {
    #         "role": "user",
    #         "content": [
    #             {
    #                 "type": "video",
    #                 "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
    #             },
    #             {"type": "text", "text": "这段视频有多长"},
    #         ],
    #     }
    # ]

    messages = [
        {
            "role": "user",
            "content": [
              {
                  "type": "image",
                  "image": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/receipt.png",
              },
              {"type": "text", "text": "Read all the text in the image."},
            ],
        }
    ]

    # TODO: change to your own checkpoint path
    checkpoint_path = "Qwen/Qwen3-VL-4B-Instruct-FP8"
    processor = AutoProcessor.from_pretrained(checkpoint_path)
    inputs = [prepare_inputs_for_vllm(message, processor) for message in [messages]]

    llm = LLM(
        model=checkpoint_path,
        trust_remote_code=True,
        gpu_memory_utilization=0.70,
        enforce_eager=False,
        tensor_parallel_size=torch.cuda.device_count(),
        seed=0
    )

    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=1024,
        top_k=-1,
        stop_token_ids=[],
    )

    for i, input_ in enumerate(inputs):
        print()
        print('=' * 40)
        print(f"Inputs[{i}]: {input_['prompt']=!r}")
    print('\n' + '>' * 40)

    outputs = llm.generate(inputs, sampling_params=sampling_params)
    for i, output in enumerate(outputs):
        generated_text = output.outputs[0].text
        print()
        print('=' * 40)
        print(f"Generated text: {generated_text!r}")

SGLang 推理

以下是一个代码片段，展示如何使用 SGLang 在本地运行 Qwen3-VL 的推理。

import time
from PIL import Image
from sglang import Engine
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoConfig

if __name__ == "__main__":
    # TODO: change to your own checkpoint path
    checkpoint_path = "Qwen/Qwen3-VL-4B-Instruct-FP8"
    processor = AutoProcessor.from_pretrained(checkpoint_path)

    messages = [
        {
            "role": "user",
            "content": [
              {
                  "type": "image",
                  "image": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/receipt.png",
              },
              {"type": "text", "text": "Read all the text in the image."},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)

    llm = Engine(
        model_path=checkpoint_path,
        enable_multimodal=True,
        mem_fraction_static=0.8,
        tp_size=torch.cuda.device_count(),
        attention_backend="fa3"
    )

    start = time.time()
    sampling_params = {"max_new_tokens": 1024}
    response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
    print(f"Response costs: {time.time() - start:.2f}s")
    print(f"Generated text: {response['text']}")

生成超参数

VL

export greedy='false'
export top_p=0.8
export top_k=20
export temperature=0.7
export repetition_penalty=1.0
export presence_penalty=1.5
export out_seq_length=16384

文本

export greedy='false'
export top_p=1.0
export top_k=40
export repetition_penalty=1.0
export presence_penalty=2.0
export temperature=1.0
export out_seq_length=32768

引用

如果您觉得我们的工作对您有所帮助，欢迎引用我们的成果。

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}