Molmo 7B-O

Molmo 是艾伦人工智能研究所（Allen Institute for AI）开发的一系列开源视觉语言模型。 Molmo 模型在 PixMo 数据集上进行训练，该数据集包含 100 万对经过高度精选的图像-文本对。在同等规模的多模态模型中，它拥有最先进的性能，同时完全开源。您可以在此处找到 Molmo 系列的所有模型。通过我们的公告博客文章或论文，了解更多关于 Molmo 系列的信息。

Molmo 7B-O 基于 OLMo-7B-1024（下一代 OLMo 模型的预览版）构建，并使用 OpenAI CLIP 作为视觉骨干网络。在学术基准测试和人类评估中，其性能均稳居 GPT-4V 和 GPT-4o 之间。

此检查点是 Molmo 发布的预览版。创建 Molmo 所使用的所有工件（PixMo 数据集、训练代码、评估结果、中间检查点）将在稍后发布，以进一步践行我们对开源 AI 开发和可复现性的承诺。

在此注册，以便在工件发布时第一时间获知。

快速链接：

快速开始

要运行 Molmo，请先安装依赖项：

pip install einops torchvision

然后，请按照以下步骤操作：

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# load the processor
processor = AutoProcessor.from_pretrained(
    'allenai/Molmo-7B-O-0924',
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    'allenai/Molmo-7B-O-0924',
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
    text="Describe this image."
)

# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# generate output; maximum 200 new tokens; stop generation when <|endoftext|> is generated
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

# only get generated tokens; decode them to text
generated_tokens = output[0,inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

# print the generated text
print(generated_text)

# >>> This photograph captures an adorable black Labrador puppy sitting on a weathered
#     wooden deck. The deck's planks, which are a mix of light and dark brown with ...

为提高推理效率，请使用 autocast 运行：

with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
  output = model.generate_from_batch(
      inputs,
      GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
      tokenizer=processor.tokenizer
  )

我们在这种设置下完成了大部分评估（autocast 开启，但使用 float32 权重）

为了进一步降低内存需求，模型可以使用 bfloat16 权重运行：

model.to(dtype=torch.bfloat16)
inputs["images"] = inputs["images"].to(torch.bfloat16)
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

请注意，与使用 float32 权重运行相比，这有时可能会改变模型的输出。

评估

模型	11 项学术基准的平均得分	人类偏好 Elo 评分
Molmo 72B	81.2	1077
Molmo 7B-D	77.3	1056
Molmo 7B-O (本模型)	74.6	1051
MolmoE 1B	68.6	1032
GPT-4o	78.5	1079
GPT-4V	71.1	1041
Gemini 1.5 Pro	78.3	1074
Gemini 1.5 Flash	75.1	1054
Claude 3.5 Sonnet	76.7	1069
Claude 3 Opus	66.4	971
Claude 3 Haiku	65.3	999
Qwen VL2 72B	79.4	1037
Qwen VL2 7B	73.7	1025
Intern VL2 LLAMA 76B	77.1	1018
Intern VL2 8B	69.4	953
Pixtral 12B	69.5	1016
Phi3.5-Vision 4B	59.7	982
PaliGemma 3B	50.0	937
LLAVA OneVision 72B	76.6	1051
LLAVA OneVision 7B	72.0	1024
Cambrian-1 34B	66.8	953
Cambrian-1 8B	63.4	952
xGen - MM - Interleave 4B	59.5	979
LLAVA-1.5 13B	43.9	960
LLAVA-1.5 7B	40.7	951

基准测试：AI2D test、ChartQA test、VQA v2.0 test、DocQA test、InfographicVQA test、TextVQA val、RealWorldQA、MMMU val、MathVista testmini、CountBenchQA、Flickr Count（我们收集的这个新数据集比 CountBenchQA 难度显著更高）。

常见问题

处理图像时出现广播错误！

您的图像可能不是RGB格式。您可以使用以下代码片段进行转换：

from PIL import Image

image = Image.open(...)

if image.mode != "RGB":
    image = image.convert("RGB")

Molmo 对透明图像的处理效果欠佳！

我们收到反馈称，Molmo 模型在处理透明图像时可能存在困难。目前，我们建议在将图像输入模型之前，为其添加白色或深色背景。以下代码片段展示了如何使用 Python Imaging Library（PIL）实现这一操作：


# Load the image
url = "..."
image = Image.open(requests.get(url, stream=True).raw)

# Convert the image to grayscale to calculate brightness
gray_image = image.convert('L')  # Convert to grayscale

# Calculate the average brightness
stat = ImageStat.Stat(gray_image)
average_brightness = stat.mean[0]  # Get the average value

# Define background color based on brightness (threshold can be adjusted)
bg_color = (0, 0, 0) if average_brightness > 127 else (255, 255, 255)

# Create a new image with the same size as the original, filled with the background color
new_image = Image.new('RGB', image.size, bg_color)

# Paste the original image on top of the background (use image as a mask if needed)
new_image.paste(image, (0, 0), image if image.mode == 'RGBA' else None)

# Now you can pass the new_image to Molmo
processor = AutoProcessor.from_pretrained(
    'allenai/Molmo-7B-D-0924',
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

许可与使用

本模型采用 Apache 2.0 许可协议。其旨在用于研究和教育用途。如需了解更多信息，请参阅我们的负责任使用指南。