MolmoWeb-8B

重要更新！我们于2026年3月29日太平洋标准时间下午6点左右对这个兼容HF/transformers的检查点进行了一些小但重要的更新，以确保其输出与我们的原生模型检查点完全一致。如果您在此时间之前下载了此模型检查点，建议重新下载。更多详情请参见PR 2 和 3。感谢您的理解！

MolmoWeb是一系列完全开放的多模态网络智能体。MolmoWeb智能体取得了最先进的成果，性能超越了同等规模的开源模型，如Fara-7B、UI-Tars-1.5-7B和Holo1-7B。MolmoWeb-8B甚至优于基于更大规模闭源前沿模型（如GPT-4o）构建的Set-of-Marks (SoM)智能体。我们进一步展示了通过并行推演和最佳N选择（best-of-N selection）在测试时进行扩展所带来的持续收益，在WebVoyager和Online-Mind2Web上的pass@4指标分别达到94.7%和60.5%（相比之下，pass@1指标分别为78.2%和35.3%）。

了解更多关于MolmoWeb系列的信息，请参阅我们的公告博客文章和技术报告。

MolmoWeb-8B基于Molmo2架构，该架构使用Qwen3-8B和SigLIP 2作为视觉主干网络。

Ai2致力于开放科学。MolmoWeb数据集可在此处获取。所有用于创建MolmoWeb的其他成果（训练代码、评估、中间检查点）都将公开，以进一步践行我们对开源AI开发和可复现性的承诺。

快速链接：

快速开始

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
import torch
from jinja2 import Template

checkpoint_dir = "allenai/MolmoWeb-8B"

model = AutoModelForImageTextToText.from_pretrained(
    checkpoint_dir,
    trust_remote_code=True,
    torch_dtype=torch.float32, # we recommend using the default float32 precision 
    attn_implementation="sdpa",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    checkpoint_dir,
    trust_remote_code=True,
    padding_side="left",
)


MOLMOWEB_THINK_TEMPLATE = Template(
"""
# GOAL
{{ task_description }}

# PREVIOUS STEPS
{% for action in past_actions: -%}
## Step {{ action['index'] }}
THOUGHT: {{ action['thought'] }}
ACTION: {{ action['action'] }}
{% endfor %}
# CURRENTLY ACTIVE PAGE
Page {{ page_index }}: {{ page_title }} | {{ page_url }}

# NEXT STEP

"""
)

task_description = "Tell me about the Ai2 PIROR team's recent projects"
past_actions = []
user_message = MOLMOWEB_THINK_TEMPLATE.render(
    page_title=None,
    page_url="about:blank",
    page_index=0,
    task_description=task_description,
    past_actions=[]
)
system_message = "molmo_web_think"
prompt = f"{system_message}: {user_message}"

blank_image = Image.new("RGB", (1280, 720), color="white")

image_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image", "image": blank_image},
        ]
    }
]

inputs = processor.apply_chat_template(
    image_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    padding=True,
)

# Remove token_type_ids: HF uses it to enable bidirectional attention for image tokens; molmoweb is trained with causal attention only
inputs = {k: v.to("cuda") for k, v in inputs.items() if k != "token_type_ids"} 

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=200)

generated_tokens = output[0, inputs["input_ids"].size(1):]
print(processor.decode(generated_tokens, skip_special_tokens=True))

许可与使用

本模型采用 Apache 2.0 许可协议。其旨在依照 Ai2 的《负责任使用指南》（https://allenai.org/responsible-use）用于研究和教育用途。

引用

如果您使用此数据集，请引用：

arXiv:2604.08516

@misc{gupta2026molmowebopenvisualweb,
      title={MolmoWeb: Open Visual Web Agent and Open Data for the Open Web}, 
      author={Tanmay Gupta and Piper Wolters and Zixian Ma and Peter Sushko and Rock Yuren Pang and Diego Llanes and Yue Yang and Taira Anderson and Boyuan Zheng and Zhongzheng Ren and Harsh Trivedi and Taylor Blanton and Caleb Ouellette and Winson Han and Ali Farhadi and Ranjay Krishna},
      year={2026},
      eprint={2604.08516},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08516}, 
}