[2026-04-09] 🚀 我们正式发布 HY-Embodied-0.5,开源了 HY-Embodied-0.5 MoT-2B 模型权重(可在 Hugging Face 获取),并提供官方推理代码!本文介绍 HY-Embodied-0.5——一套专为现实世界具身智能打造的基础模型。为弥合通用视觉语言模型(VLM)与物理智能体严苛需求之间的差距,我们的模型在时空视觉感知和复杂具身推理(预测、交互与规划)方面进行了针对性优化。
该系列采用创新的混合Transformer(Mixture-of-Transformers, MoT)架构,通过潜变量实现模态特异性计算,显著提升细粒度感知能力。模型包含两个主要版本:高效的2B模型(适用于边缘部署)和强大的32B模型(适用于复杂推理任务)。通过自进化后训练范式与大模型到小模型的在线策略蒸馏,紧凑型MoT-2B在16项基准测试中超越同规模最优模型,而32B版本则达到与Gemini 3.0 Pro相当的前沿性能。最终,HY-Embodied可作为视觉-语言-动作(VLA) pipeline的稳健“大脑”,在现实世界物理机器人控制中展现出优异效果。
pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a注意:我们稍后会将这些改进合并到 Transformers 主分支中。
pip install -r requirements.txtgit clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/pip install -r requirements.txtpython inference.py该示例脚本展示了单轮生成和批量生成两种功能。
代码会自动从 Hugging Face Hub 下载模型 tencent/HY-Embodied-0.5。请确保您有足够的磁盘空间(8 GB)来存储模型权重。
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8
processor = AutoProcessor.from_pretrained(MODEL_PATH)
# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()
# Prepare input messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "./figures/example.jpg"},
{"type": "text", "text": "Describe the image in detail."},
],
}
]
# Process and generate
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
enable_thinking=THINKING_MODE,
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=32768,
use_cache=True,
temperature=TEMPERATURE,
do_sample=TEMPERATURE > 0,
)
output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8
processor = AutoProcessor.from_pretrained(MODEL_PATH)
# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()
# Batch Inference (multiple prompts at once)
messages_batch = [
# Sample A: image + text
[
{
"role": "user",
"content": [
{"type": "image", "image": "./figures/example.jpg"},
{"type": "text", "text": "Describe the image in detail."},
],
}
],
# Sample B: text only
[
{
"role": "user",
"content": [
{"type": "text", "text": "How to open a fridge?"},
],
}
],
]
# Process each message independently
all_inputs = []
for msgs in messages_batch:
inp = processor.apply_chat_template(
msgs,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
enable_thinking=THINKING_MODE,
)
all_inputs.append(inp)
# Left-pad and batch
batch = processor.pad(all_inputs, padding=True, padding_side="left").to(model.device)
with torch.no_grad():
batch_generated_ids = model.generate(
**batch,
max_new_tokens=32768,
use_cache=True,
temperature=TEMPERATURE,
do_sample=TEMPERATURE > 0,
)
# Decode: strip the padded input portion
padded_input_len = batch["input_ids"].shape[1]
for i, msgs in enumerate(messages_batch):
out_ids = batch_generated_ids[i][padded_input_len:]
print(f"\n--- Sample {i} ---")
print(processor.decode(out_ids, skip_special_tokens=True))注:我们在22个具身相关基准测试中,将HY-Embodied-0.5 MoT-2B与相似规模的模型进行了评估。有关详细的性能指标和方法,请参考我们的技术报告。
注:我们观察到Qwen3.5系列的小模型在某些基准测试中会产生重复的思维模式,这导致整体结果较低。因此,我们在评估中选择与Qwen3-VL模型进行对比。
| 基准测试 | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| CV-Bench | 89.2 | 80.0 | 85.7 | 86.9 | 88.8 |
| DA-2K | 92.3 | 69.5 | 76.5 | 79.4 | 72.2 |
| 基准测试 | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| ERQA | 54.5 | 41.8 | 47.3 | 43.3 | 46.8 |
| EmbSpatial-Bench | 82.8 | 75.9 | 80.7 | 73.8 | 76.2 |
| RoboBench-MCQ | 49.2 | 36.9 | 45.8 | 44.4 | 43.6 |
| RoboBench-Planning | 54.2 | 36.2 | 36.4 | 39.2 | 58.7 |
| RoboSpatial-Home | 55.7 | 45.3 | 63.2 | 62.3 | 61.8 |
| ShareRobot-Aff. | 26.8 | 19.8 | 25.5 | 25.5 | 9.0 |
| ShareRobot-Traj. | 73.3 | 41.6 | 62.2 | 81.4 | 50.6 |
| Ego-Plan2 | 45.5 | 35.5 | 38.8 | 52.6 | 39.9 |
| 基准测试 | HY-Embodied 0.5 MoT-2B | Qwen3-VL 2B | Qwen3-VL 4B | RoboBrain 2.5 4B | MiMo-Embodied 7B |
|---|---|---|---|---|---|
| 3DSRBench | 57.0 | 39.9 | 43.9 | 44.8 | 42.0 |
| All-Angles Bench | 55.1 | 42.3 | 46.7 | 43.8 | 49.0 |
| MindCube | 66.3 | 28.4 | 31.0 | 26.9 | 36.2 |
| MMSI-Bench | 33.2 | 23.6 | 25.1 | 20.5 | 31.9 |
| RefSpatial-Bench | 45.8 | 28.9 | 45.3 | 56.0 | 48.0 |
| SAT | 76.7 | 45.3 | 56.7 | 51.3 | 78.7 |
| SIBench-mini | 58.2 | 42.0 | 50.9 | 47.3 | 53.1 |
| SITE-Bench-Image | 62.7 | 52.3 | 61.0 | 57.9 | 49.9 |
| SITE-Bench-Video | 63.5 | 52.2 | 58.0 | 54.8 | 58.9 |
| ViewSpatial | 53.1 | 37.2 | 41.6 | 36.6 | 36.1 |
| VSIBench | 60.5 | 48.0 | 55.2 | 41.7 | 48.5 |
| Where2Place | 68.0 | 45.0 | 59.0 | 65.0 | 63.6 |
注:HY-Embodied-0.5 MoT-2B的结果在思维模式下报告,而对于所有其他模型,我们报告非思维模式和思维模式之间的较优性能。
如果您发现它对您的研究和应用有所帮助,请使用以下 BibTeX 引用我们的论文:
@article{tencent2026hyembodied05,
title={HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents},
author={Tencent Robotics X and HY Vision Team},
journal={arXiv preprint arXiv:2604.07430},
year={2026}
}感谢 Hugging Face 社区的支持以及开源贡献,这些都为 HY-Embodied-0.5 的实现提供了可能。