tencent_hunyuan/HY-Embodied-0.5
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

HY-Embodied

面向现实世界智能体的具身基础模型系列

腾讯 Robotics X × 混元视觉团队

Tech Report Paper Models GitHub

🔥 更新动态

  • [2026-04-09] 🚀 我们正式发布HY-Embodied-0.5,开源了HY-Embodied-0.5 MoT-2B模型权重(可在Hugging Face获取),并同步提供官方推理代码!

📖 摘要

我们提出HY-Embodied-0.5,这是一套专为现实世界具身智能打造的基础模型。为弥合通用视觉语言模型(VLMs)与物理智能体严苛需求之间的差距,我们的模型在时空视觉感知和复杂具身推理(预测、交互与规划)方面进行了针对性优化。

该系列模型采用创新的混合Transformer(Mixture-of-Transformers, MoT) 架构,通过潜在令牌实现模态特异性计算,显著提升了细粒度感知能力。模型包含两个主要版本:高效的2B模型(适用于边缘部署)和强大的32B模型(适用于复杂推理任务)。通过自进化后训练范式和大模型到小模型的on-policy蒸馏,我们的紧凑型MoT-2B在16项基准测试中超越了同规模的现有最佳模型,而32B版本则达到了与Gemini 3.0 Pro相当的前沿性能。最终,HY-Embodied可作为视觉-语言-动作(VLA) pipeline的稳健“大脑”,在现实世界物理机器人控制中展现出令人瞩目的效果。

HY-Embodied Teaser

⭐️ 核心特性

  • 🧠 进化版MoT架构:在不牺牲视觉敏锐度的前提下实现极致效率。MoT-2B变体总参数量为40亿,但推理时仅需激活22亿参数。通过强化视觉通路中的模态专属计算,它既拥有密集型20亿模型的高速推理能力,又能提供更卓越、更精细的感知表征。
    • 🔗 高质量混合链推理:我们引入了先进的迭代式自进化后训练 pipeline。借助在策略蒸馏技术,成功将强大的320亿模型所具备的复杂逐步推理、规划及高质量“思考”能力,直接迁移至紧凑的20亿模型变体中。
    • 🌍 大规模具身预训练:基于精心构建的超大规模数据集,包含超过1亿个具身及空间特定数据点。在超2000亿tokens的语料上进行训练,使模型对3D空间、物理对象交互及智能体动态形成深度的原生理解。
    • 🦾 更强的VLA应用能力:除标准学术基准外,HY-Embodied旨在成为物理机器人的核心认知引擎。它能无缝集成到视觉-语言-动作(VLA)框架中,作为高度稳健且高效的“大脑”,驱动复杂现实世界机器人控制任务实现高成功率。
HY-Embodied Architecture

📅 规划

  • Transformers推理
  • vLLM推理
  • 在线Gradio演示

🛠️ 依赖项与安装

前置条件

  • 🖥️ 操作系统:Linux(推荐)
  • 🐍 Python:3.12+(推荐且已测试)
  • ⚡ CUDA:12.6
  • 🔥 PyTorch:2.8.0
  • 🎮 GPU:支持CUDA的NVIDIA GPU

安装步骤

  1. 安装此模型所需的特定Transformers版本:
pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a

注意:我们稍后会将这些改进合并到Transformers主分支中。

  1. 安装其他依赖项:
pip install -r requirements.txt

快速开始

  1. 克隆仓库:
git clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/
  1. 安装依赖项:
pip install -r requirements.txt
  1. 运行推理:
python inference.py

该示例脚本展示了单轮生成和批量生成两种功能。

模型下载

代码会自动从 Hugging Face Hub 下载模型 tencent/HY-Embodied-0.5。请确保您有足够的磁盘空间(8 GB)来存储模型权重。

硬件要求

  • GPU:推荐用于获得最佳性能(NVIDIA GPU,至少 16GB 显存)
  • CPU:支持但速度较慢
  • 内存:建议至少 16GB RAM
  • 存储:20GB 以上可用空间,用于存放模型和依赖项

🚀 使用 Transformers 快速开始

基本推理示例

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0

processor = AutoProcessor.from_pretrained(MODEL_PATH)

# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()

# Prepare input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "./figures/example.jpg"},
            {"type": "text", "text": "Describe the image in detail."},
        ],
    }
]

# Process and generate
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    enable_thinking=THINKING_MODE,
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

批量推理

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0

processor = AutoProcessor.from_pretrained(MODEL_PATH)

# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
    processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()

# Batch Inference (multiple prompts at once)
messages_batch = [
    # Sample A: image + text
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./figures/example.jpg"},
                {"type": "text", "text": "Describe the image in detail."},
            ],
        }
    ],
    # Sample B: text only
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "How to open a fridge?"},
            ],
        }
    ],
]

# Process each message independently
all_inputs = []
for msgs in messages_batch:
    inp = processor.apply_chat_template(
        msgs,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        enable_thinking=THINKING_MODE,
    )
    all_inputs.append(inp)

# Left-pad and batch
batch = processor.pad(all_inputs, padding=True, padding_side="left").to(model.device)

with torch.no_grad():
    batch_generated_ids = model.generate(
        **batch,
        max_new_tokens=32768,
        use_cache=True,
        temperature=TEMPERATURE,
        do_sample=TEMPERATURE > 0,
    )

# Decode: strip the padded input portion
padded_input_len = batch["input_ids"].shape[1]
for i, msgs in enumerate(messages_batch):
    out_ids = batch_generated_ids[i][padded_input_len:]
    print(f"\n--- Sample {i} ---")
    print(processor.decode(out_ids, skip_special_tokens=True))

📊 评估

视觉感知

注:我们在22个具身相关基准上对HY-Embodied-0.5 MoT-2B与同规模模型进行了评估。详细的性能指标和方法,请参考我们的技术报告。

注:我们观察到Qwen3.5系列的小型模型在部分基准测试中会产生重复的思维模式,从而导致整体结果较低。因此,我们在评估中选择与Qwen3-VL模型进行对比。

基准测试HY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
CV-Bench89.280.085.786.988.8
DA-2K92.369.576.579.472.2

具身理解

基准测试HY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
ERQA54.541.847.343.346.8
EmbSpatial-Bench82.875.980.773.876.2
RoboBench-MCQ49.236.945.844.443.6
RoboBench-Planning54.236.236.439.258.7
RoboSpatial-Home55.745.363.262.361.8
ShareRobot-Aff.26.819.825.525.59.0
ShareRobot-Traj.73.341.662.281.450.6
Ego-Plan245.535.538.852.639.9

空间理解

基准测试HY-Embodied 0.5 MoT-2BQwen3-VL 2BQwen3-VL 4BRoboBrain 2.5 4BMiMo-Embodied 7B
3DSRBench57.039.943.944.842.0
All-Angles Bench55.142.346.743.849.0
MindCube66.328.431.026.936.2
MMSI-Bench33.223.625.120.531.9
RefSpatial-Bench45.828.945.356.048.0
SAT76.745.356.751.378.7
SIBench-mini58.242.050.947.353.1
SITE-Bench-Image62.752.361.057.949.9
SITE-Bench-Video63.552.258.054.858.9
ViewSpatial53.137.241.636.636.1
VSIBench60.548.055.241.748.5
Where2Place68.045.059.065.063.6

注:HY-Embodied-0.5 MoT-2B的结果在思维模式下报告,而对于所有其他模型,我们报告非思维模式和思维模式中的较优性能。

📚 引用

引用信息将在发表后提供。请稍后查看更新。

🙏 致谢

我们感谢 Hugging Face 社区的支持以及使本实现成为可能的开源贡献。