中文 | EN

SenseNova-SI：借助多模态基础模型提升空间智能

概述

尽管多模态基础模型已取得显著进展，但在空间智能方面仍存在明显不足。本研究致力于通过扩展多模态基础模型，在SenseNova-SI系列中培养空间智能。该系列构建于成熟的多模态基础之上，包括视觉理解模型（如Qwen3-VL和InternVL3）以及统一理解与生成模型（如Bagel）。我们采用系统化方法构建高性能且稳健的空间智能，具体而言，通过严格的空间能力分类体系，精心构建了包含八百万个多样化数据样本的SenseNova-SI-8M数据集。 SenseNova-SI在各类空间智能基准测试中展现出前所未有的性能，同时保持了强大的通用多模态理解能力。更重要的是，我们分析了数据规模扩展的影响，探讨了多样化数据训练所带来的涌现泛化能力的早期迹象，研究了过拟合和语言捷径的风险，开展了空间思维链推理的初步探索，并验证了其潜在的下游应用价值。SenseNova-SI是一个持续推进的项目，本报告将不断更新。所有新训练的多模态基础模型均已公开发布，以促进该领域的进一步研究。 未来，SenseNova-SI将与更大规模的内部模型进行整合。

模型库

模型	基础架构	SI 数据集规模	EASI-8	其他说明
SenseNova-SI-1.5-InternVL3-8B	SenseNova-SI-1.4-InternVL3-8B	150 万	64.4	增强了立体几何能力
SenseNova-SI-1.4-InternVL3-8B	InternVL3	2900 万	63.7	增强了目标定位和深度估计能力
SenseNova-SI-1.3-InternVL3-8B	InternVL3	1400 万	65.2	空间智能表现最佳，增强了开放式简短问答能力
SenseNova-SI-1.3-Qwen3-VL-8B	Qwen3-VL	1400 万	61.4	增强了开放式简短问答能力
SenseNova-SI-1.2-InternVL3-8B	InternVL3	1000 万	64.5	-
SenseNova-SI-1.1-InternVL3-8B	InternVL3	800 万	61.5	-
SenseNova-SI-1.1-InternVL3-2B	InternVL3	800 万	49.4	-
SenseNova-SI-1.1-Qwen3-VL-8B	Qwen3-VL	800 万	58.1	-
SenseNova-SI-1.1-Qwen2.5-VL-7B	Qwen2.5-VL	800 万	51.0	-
SenseNova-SI-1.1-Qwen2.5-VL-3B	Qwen2.5-VL	800 万	45.7	-
SenseNova-SI-1.1-BAGEL-7B-MoT	BAGEL	800 万	48.6	统一理解与生成模型

发布信息

目前，我们基于主流开源基础模型构建SenseNova-SI，以最大程度兼容现有研究流程。本次发布包含以下模型： SenseNova-SI-1.3-Qwen3-VL-8B、 SenseNova-SI-1.5-InternVL3-8B、 SenseNova-SI-1.4-InternVL3-8B、 SenseNova-SI-1.3-InternVL3-8B、 SenseNova-SI-1.2-InternVL3-8B、 SenseNova-SI-1.1-Qwen2.5-VL-3B、 SenseNova-SI-1.1-Qwen2.5-VL-7B 以及 SenseNova-SI-1.1-Qwen3-VL-8B。

SenseNova-SI-1.3-Qwen3-VL-8B 在各类基准测试中展现出卓越的空间智能，与前代版本相比，其开放式空间问答能力也得到显著提升。

模型	VSI	MMSI	MindCube-Tiny	ViewSpatial	SITE	BLINK	3DSRBench	EmbSpatial-Bench
开源模型（~20亿参数）
InternVL3-2B	32.9	26.5	37.5	32.5	30.0	50.8	47.7	60.1
Qwen3-VL-2B-Instruct	50.3	28.9	34.5	36.9	35.6	53.2	47.5	70.1
MindCube-3B-RawQA-SFT	17.2	1.7	51.7	24.1	6.3	35.1	2.8	37.0
SpatialLadder-3B	44.8	27.4	43.4	39.8	27.9	43.0	42.8	58.2
SpatialMLLM-4B	46.3	26.1	33.4	34.6	18.0	40.5	36.2	50.0
VST-3B-SFT	57.9	30.2	35.9	52.8	35.8	58.8	54.1	69.0
Cambrian-S-3B	57.3	25.2	32.5	39.0	28.3	37.7	50.9	63.5
开源模型（~80亿参数）
InternVL3-8B	42.1	28.0	41.5	38.6	41.1	53.5	44.3	76.4
Qwen3-VL-8B-Instruct	57.9	31.1	29.4	42.2	45.8	66.7	53.9	77.7
BAGEL-7B-MoT	31.4	31.0	34.7	41.3	37.0	63.7	50.2	73.1
SpaceR-7B	41.5	27.4	37.9	35.8	34.2	49.6	40.5	66.9
ViLaSR-7B	44.6	30.2	35.1	35.7	38.7	51.4	46.6	67.3
VST-7B-SFT	60.6	32.0	39.7	50.5	39.6	61.9	54.6	73.7
Cambrian-S-7B	67.5	25.8	39.6	40.9	33.0	37.9	54.8	72.8
SenseNova-SI-1.3-Qwen3-VL-8B	67.8	39.5	68.3	55.8	57.5	63.0	57.3	82.1
专有模型
Gemini-2.5-pro-2025-06	53.5	38.0	57.6	46.0	57.0	73.5	59.3	78.9
Grok-4-2025-07-09	47.9	37.8	63.5	43.2	47.0	56.4	54.9	75.7
GPT-5-2025-08-07	55.0	41.8	56.3	45.5	61.8	68.0	60.3	81.6

快速开始

安装

我们建议使用 uv 来管理环境。

uv 安装指南：https://docs.astral.sh/uv/getting-started/installation/#installing-uv

git clone git@github.com:OpenSenseNova/SenseNova-SI.git
cd SenseNova-SI/
uv sync --extra cu124 # or one of [cu118|cu121|cu124|cu126|cu128|cu129], depending on your CUDA version
uv sync
source .venv/bin/activate

你好世界

一个简单的无图像测试，用于验证环境设置和模型下载。

python example.py \
  --question "Hello" \
  --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B

示例

示例 1

本示例来源于 SITE-Bench：

python example.py \
  --image_paths examples/Q1_1.png \
  --question "Consider the real-world 3D locations of the objects. Which is closer to the sink, the toilet paper or the towel?\nOptions: \nA. toilet paper\nB. towel\nGive me the answer letter directly. The best answer is:" \
  --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
# --model_path sensenova/SenseNova-SI-1.3-InternVL3-8B

示例 1 详情

问：考虑物体在现实世界中的 3D 位置。卫生纸和毛巾，哪个离水槽更近？\n选项：\nA. 卫生纸\nB. 毛巾\n请直接给出答案字母。最佳答案是：

正确答案：A

示例 2

本示例来源于 MMSI-Bench：

python example.py \
  --image_paths examples/Q2_1.png examples/Q2_2.png \
  --question "If the landscape painting is on the east side of the bedroom, where is the window located in the bedroom?\nOptions: A. North side, B. South side, C. West side, D. East side\nAnswer with the option's letter from the given choices directly. Enclose the option's letter within ``." \
  --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
# --model_path sensenova/SenseNova-SI-1.3-InternVL3-8B

示例 2 详情

问：如果山水画在卧室的东侧，那么卧室的窗户位于哪里？\n选项：A. 北侧，B. 南侧，C. 西侧，D. 东侧\n直接用给定选项的字母作答。将选项字母用 `` 括起来。

正确答案：C

示例 3

本示例来源于 MMSI-Bench，用于测试模型在开放式简短问答上的表现：

python example.py \
  --image_paths examples/Q3_1.png examples/Q3_2.png examples/Q3_3.png \
  --question "The robot is making tea. What is the order in which the pictures were taken?" \
  --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B

示例 3 详情

问题：机器人正在泡茶。这些图片的拍摄顺序是什么？

正确答案：第二张，第一张，第三张

单次运行测试多个问题

准备一个类似于 examples/examples.jsonl 的文件，其中每一行代表一个问题。

模型只需加载一次，然后按顺序处理问题。各个问题之间相互独立。

有关 jsonl 格式的更多详情，请参考单图像数据和多图像数据的文档。

python example.py \
  --jsonl_path examples/examples.jsonl \
  --model_path sensenova/SenseNova-SI-1.3-Qwen3-VL-8B
# --model_path Qwen/Qwen3-VL-8B-Instruct

评估

若要复现上述基准测试结果，请参考 EASI 在主流空间智能基准测试上对 SenseNova-SI 进行评估。

引用

@InProceedings{sensenova-si,
  title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
  author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2026}
}