config.json 中 eos_token_id 配置错误导致的无限生成循环问题。(PR #abdf3)max_tokens 设置为 32k 导致的。我们正在重新运行测试,并将在技术报告的下一版本中提供更正后的数值。STEP3-VL-10B 是一款轻量级开源基础模型,旨在重新定义模型紧凑性与前沿级多模态智能之间的平衡。尽管仅有 100 亿参数的紧凑规模,STEP3-VL-10B 在视觉感知、复杂推理和人机对齐方面表现卓越。它不仅持续超越 100 亿参数以下的模型,还能媲美甚至超越规模大得多(10 至 20 倍于其体量)的开源模型,如 GLM-4.6V(106B-A12B)、Qwen3-VL-Thinking(235B-A22B),以及顶尖的专有旗舰模型如 Gemini 2.5 Pro 和 Seed-1.5-VL。
图 1:STEP3-VL-10B 与主流多模态基础模型的性能对比。SeRe:序列推理;PaCoRe:并行协同推理。
STEP3-VL-10B 的成功源于两项关键战略设计:
| 模型名称 | 类型 | Hugging Face | ModelScope |
|---|---|---|---|
| STEP3-VL-10B-Base | 基础模型 | 🤗 下载 | 🤖 下载 |
| STEP3-VL-10B | 对话模型 | 🤗 下载 | 🤖 下载 |
| STEP3-VL-10B-FP8 | 量化模型 | 🤗 下载 | 🤖 下载 |
STEP3-VL-10B 在主流多模态基准测试中展现了卓越性能,为紧凑型模型树立了新的性能标准。结果表明,STEP3-VL-10B 是100亿参数级别中最强大的开源模型。
| 基准测试 | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL |
|---|---|---|---|---|---|---|
| MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 |
| MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 |
| MathVision | 70.81 | 75.95 | 63.50 | 72.10 | 73.30 | 68.70 |
| MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 |
| MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 |
| OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 |
| AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 |
| HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 |
| LiveCodeBench | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |
推理模式说明:
SeRe(顺序推理): 标准推理模式,采用顺序生成(思维链),最大长度为 64K tokens。
PaCoRe(并行协同推理): 一种高级模式,可扩展测试时计算能力。它通过16 个并行推理路径聚合证据以综合最终答案,最大上下文长度为 128K tokens。
除非另有说明,以下分数均指标准 SeRe 模式。通过 PaCoRe 获得的更高分数会明确标注。
| 类别 | 基准测试 | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) |
|---|---|---|---|---|---|---|
| STEM 推理 | MMMU | 78.11 | 71.17 | 73.53 | 71.69 | 71.14 |
| MathVision | 70.81 | 54.05 | 59.60 | 52.05 | 59.65 | |
| MathVista | 83.97 | 82.85 | 78.50 | 76.78 | 79.86 | |
| PhyX | 59.45 | 52.28 | 57.67 | 50.51 | 56.00 | |
| 识别 | MMBench (EN) | 92.05 | 91.04 | 90.55 | 88.20 | 89.91 |
| MMStar | 77.48 | 74.26 | 73.58 | 69.83 | 72.93 | |
| ReMI | 67.29 | 60.75 | 57.17 | 52.65 | 63.13 | |
| OCR 与文档 | OCRBench | 86.75 | 85.97 | 82.85 | 83.70 | 85.40 |
| AI2D | 89.35 | 88.93 | 83.32 | 82.34 | 84.96 | |
| GUI 定位 | ScreenSpot-V2 | 92.61 | 92.14 | 93.60 | 84.02 | 90.82 |
| ScreenSpot-Pro | 51.55 | 45.68 | 46.60 | 15.39 | 34.84 | |
| OSWorld-G | 59.02 | 54.71 | 56.70 | 31.91 | 50.54 | |
| 空间感知 | BLINK | 66.79 | 64.90 | 62.78 | 55.40 | 62.57 |
| All-Angles-Bench | 57.21 | 53.24 | 45.88 | 45.29 | 51.62 | |
| 代码 | HumanEval-V | 66.05 | 29.26 | 26.94 | 24.31 | 31.96 |
部署资源规格
本文介绍如何使用 transformers 库在推理阶段使用我们的模型。建议将 python=3.10、torch>=2.1.0 和 transformers=4.57.0 作为开发环境。目前我们仅支持 bf16 推理,并且默认支持用于图像预处理的多 patch 方式。此行为与 vllm 保持一致。
注意: 如果遇到生成过程无限循环的问题,请查看 Discussion #9 获取解决方案。
from transformers import AutoProcessor, AutoModelForCausalLM
key_mapping = {
"^vision_model": "model.vision_model",
r"^model(?!\.(language_model|vision_model))": "model.language_model",
"vit_large_projector": "model.vit_large_projector",
}
model_path = "stepfun-ai/Step3-VL-10B"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "What's in this picture?"}
]
},
]
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
torch_dtype="auto",
key_mapping=key_mapping).eval()
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)
print(decoded)如需部署,您可以使用 vLLM 创建一个兼容 OpenAI 的 API 端点。
安装 vLLM nightly 版本(选择一种方式):
Python / pip
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly要求 Python ≥3.10。请确保 vLLM 版本 ≥ 0.14.0rc2.dev143+gc0a350ca7。
Docker(nightly 镜像)
docker pull vllm/vllm-openai:nightly-963dc0b865a3b6011fde7e0d938f86245dccbfac上述标签固定了我们已验证的 nightly 构建版本;如有需要,可更新至最新的 nightly 标签。
启动服务器:
vllm serve --model stepfun-ai/Step3-VL-10B -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code关键步骤: 您必须在部署命令中添加 --trust-remote-code 标志。对于那些在架构中使用自定义代码的模型,这是必需的。
使用任何兼容 OpenAI 的 SDK 调用端点(Python 示例):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="stepfun-ai/Step3-VL-10B",
messages=[{
"role":
"user",
"content": [{
"type": "image_url",
"image_url": {
"url":
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
}
}, {
"type": "text",
"text": "what's in this picture?"
}]
}])
print(resp.choices[0].message.content)
Python / pip
pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install nvidia-cudnn-cu12==9.16.0.29 docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path stepfun-ai/Step3-VL-10B --host 0.0.0.0 --port 30000启动服务器:
sglang serve --model-path stepfun-ai/Step3-VL-10B --trust-remote-code --port 2345 --reasoning-parser deepseek-r1 --tool-call-parser hermes使用任何兼容 OpenAI 的 SDK 调用端点(Python 示例):
from openai import OpenAI
port = 30000
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="stepfun-ai/Step3-VL-10B",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
},
},
],
}
],
)
print(response.choices[0].message.content)如果您发现本项目对您的研究有所帮助,请引用我们的技术报告:
@misc{huang2026step3vl10btechnicalreport,
title={STEP3-VL-10B Technical Report},
author={Ailin Huang and Chengyuan Yao and Chunrui Han and Fanqi Wan and Hangyu Guo and Haoran Lv and Hongyu Zhou and Jia Wang and Jian Zhou and Jianjian Sun and Jingcheng Hu and Kangheng Lin and Liang Zhao and Mitt Huang and Song Yuan and Wenwen Qu and Xiangfeng Wang and Yanlin Lai and Yingxiu Zhao and Yinmin Zhang and Yukang Shi and Yuyang Chen and Zejia Weng and Ziyang Meng and Ang Li and Aobo Kong and Bo Dong and Changyi Wan and David Wang and Di Qi and Dingming Li and En Yu and Guopeng Li and Haiquan Yin and Han Zhou and Hanshan Zhang and Haolong Yan and Hebin Zhou and Hongbo Peng and Jiaran Zhang and Jiashu Lv and Jiayi Fu and Jie Cheng and Jie Zhou and Jisheng Yin and Jingjing Xie and Jingwei Wu and Jun Zhang and Junfeng Liu and Kaijun Tan and Kaiwen Yan and Liangyu Chen and Lina Chen and Mingliang Li and Qian Zhao and Quan Sun and Shaoliang Pang and Shengjie Fan and Shijie Shang and Siyuan Zhang and Tianhao You and Wei Ji and Wuxun Xie and Xiaobo Yang and Xiaojie Hou and Xiaoran Jiao and Xiaoxiao Ren and Xiangwen Kong and Xin Huang and Xin Wu and Xing Chen and Xinran Wang and Xuelin Zhang and Yana Wei and Yang Li and Yanming Xu and Yeqing Shen and Yuang Peng and Yue Peng and Yu Zhou and Yusheng Li and Yuxiang Yang and Yuyang Zhang and Zhe Xie and Zhewei Huang and Zhenyi Lu and Zhimin Fan and Zihui Cheng and Daxin Jiang and Qi Han and Xiangyu Zhang and Yibo Zhu and Zheng Ge},
year={2026},
eprint={2601.09668},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.09668},
}本项目基于 Apache 2.0 许可证 开源。