LocateAnything：基于并行框解码的快速高质量视觉语言定位

LocateAnything teaser

🔗 快速链接

🚀 在线演示：LocateAnything (Hugging Face Spaces)
💻 GitHub 代码：NVlabs/Eagle/Embodied
📄 论文：arXiv:2605.27365

模型概述

描述：

LocateAnything 是一款用于快速高质量视觉定位的视觉语言模型，能够在企业智能和物理 AI 的各种领域中实现精确的目标定位、密集检测和基于点的定位。该模型采用通用设计，支持指代表达定位、多目标检测、GUI 元素定位和文本定位等任务，在复杂和杂乱场景中表现出色。

其核心创新点——并行框解码（PBD），可在单个并行步骤中预测完整的边界框坐标，而非自回归的逐令牌解码，在保持几何一致性的同时提高了效率。与先前方法相比，这可将吞吐量提升高达 2.5 倍。

该模型在大规模多域数据集（1200 万张图像、1.38 亿+查询、7.85 亿个边界框）上进行训练，涵盖自然场景、机器人技术、驾驶、GUI 交互和文档理解。它可作为通用多模态感知的基础，并已集成到 NVIDIA 的前沿生产级视觉语言模型中，如 Nemotron 3 Nano Omni，支持定位、GUI 理解和多模态智能体能力。

LocateAnything 是 Eagle VLM 模型系列的一部分。本模型仅供研究和开发使用。

演示视频

许可协议/使用条款：

本模型依据 NVIDIA 许可协议发布，仅供非商业用途。该协议允许出于学术和非营利研究目的进行使用、复制和修改。禁止商业使用，除非由 NVIDIA 及其附属公司进行。重新分发时必须保留本许可协议以及所有适用的版权和归属声明。本模型按**“原样”提供，不提供任何形式的担保**，用户需自行承担所有相关风险。

本模型的构建使用了来自第三方模型的组件，这些组件各有其许可协议：

语言模型：Qwen2.5-3B-Instruct（Qwen 研究许可协议）
视觉编码器：MoonViT-SO-400M（MIT 许可协议）

模型在 Qwen 的基础上进行了改进。

部署地区：

全球

用例：

LocateAnything-3B 面向开发人员和研究人员，旨在构建视觉语言模型及相关应用，这些应用需要根据自然语言指令进行快速、精确的视觉定位。

支持的用例包括：

开放集、常见及长尾目标检测
复杂场景中的密集多目标检测
短语及指代表达式接地
自动化数据集标注与注释（例如检测、接地、指向）
用于交互和智能体系统的 GUI 元素接地
机器人与自动驾驶感知
文档理解、布局接地及 OCR 定位
工业检测、监控及遥感应用
基于点的定位和细粒度空间推理

发布日期[请在下方填写预计发布日期]：

Github [2026年5月26日]，地址：https://github.com/NVlabs/Eagle/tree/main/Embodied。
Hugging Face [2026年5月26日]，地址：https://huggingface.co/nvidia/LocateAnything-3B。
演示 [2026年5月26日]，地址：https://huggingface.co/spaces/nvidia/LocateAnything。
网页 [2026年5月26日]，地址：https://research.nvidia.com/labs/lpr/locate-anything/。
技术报告 [2026年5月26日]，地址：https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf

参考文献：

Wang 等人，LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding，NVIDIA 技术报告，2026
Kimi 团队，Kimi-VL 技术报告，arXiv:2504.07491，2025。
Qwen 团队，Qwen2.5: A Party of Foundation Models，Qwen 博客，2024。
Chen 等人，Pix2Seq: A Language Modeling Framework for Object Detection，国际学习表征会议（ICLR），2022。
Jiang 等人，Detect Anything via Next Point Prediction，arXiv:2510.12798，2025。
Liu 等人，Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection，arXiv:2303.05499，2023。
Lin 等人，Microsoft COCO: Common Objects in Context，欧洲计算机视觉会议（ECCV），2014。
Gupta 等人，LVIS: A Dataset for Large Vocabulary Instance Segmentation，计算机视觉与模式识别会议（CVPR），2019。
Li 等人，ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use，国际计算机协会多媒体会议（ACM MM），2025。

模型架构：

架构类型：基于 Transformer 的视觉语言模型（VLM）。

网络架构：原生分辨率视觉语言模型，包含以下组件：

视觉编码器：MoonViT
语言模型：Qwen2.5-3B-Instruct
多模态投影器：MLP 投影器
输出形式：用于视觉定位的基于块的结构

模型参数数量：30 亿。

LocateAnything 在视觉语言模型基础上扩展了并行框解码（PBD），这是一种用于高效视觉定位的块级多 token 预测框架。该模型不再采用自回归坐标生成方式，而是以并行结构化单元预测完整的边界框和点，在保持几何一致性的同时提高了解码效率。该架构联合优化下一个 token 预测和多 token 预测，以平衡推理能力和并行推理性能。训练遵循四阶段流程：首先使用字幕、视觉问答（VQA）、光学字符识别（OCR）及相关数据进行初始多模态知识适配，随后进行定位和密集场景定位微调。

输入：

输入类型： 图像和文本。

输入格式：

图像：原始分辨率的RGB图像输入。
文本：自然语言提示或任务模板，例如对象类别、指代表达式、GUI指令、OCR/布局请求或指向查询。

输入参数：

图像：二维（2D）
文本：一维（1D）

与输入相关的其他属性：

产品图像分辨率最高支持2.5K。
提示长度最高支持24K tokens。
训练检测和定位阶段使用的最大序列长度为25,600 tokens。
推理时支持最多8,192个新生成的tokens。

输出：

输出类型： 文本。

输出格式：

文本：模型生成的包含语义标签和结构化坐标标记的标记序列，例如边界框（<box> x1, y1, x2, y2 </box>）和点（<box> x, y </box>）。

输出参数：

文本：一维（1D）
边界框/点：二维（2D）空间坐标

与输出相关的其他属性：

输出被组织成固定长度的块（长度为6），包括语义块、框块、否定块和结束块。
框块使用结构化标记对量化的空间坐标进行编码；未使用的位置用<null>填充。
快速模式并行预测框对齐的块；慢速模式使用自回归解码；混合模式默认采用并行解码，当遇到格式不规则或空间模糊时回退到自回归解码。

我们的AI模型经过设计和优化，可在NVIDIA GPU加速的系统上运行。通过利用NVIDIA硬件（例如GPU核心）和软件框架（例如CUDA库），与仅使用CPU的解决方案相比，该模型实现了更高的训练和推理性能。

软件集成：

运行时引擎：

Transformers。推理设置使用标准的VLM生成，采用BF16精度和KV缓存。暂不支持TensorRT、TensorRT-LLM和Triton。

支持的硬件微架构兼容性：

NVIDIA Ampere（例如A100）
NVIDIA Blackwell
NVIDIA Hopper（例如H100）
NVIDIA Lovelace（例如L40、RTX 4090）

通过额外的模型优化（包括量化、压缩或蒸馏），可以在NVIDIA Thor等嵌入式平台上部署。根据可用内存、精度支持和软件配置，可能支持其他架构。

支持的操作系统：

Linux

将基础模型和微调模型集成到AI系统中，需要使用特定用例的数据进行额外测试，以确保安全有效的部署。遵循V模型方法，在单元和系统层面进行迭代测试和验证，对于在部署前减轻风险、满足技术和功能要求以及确保符合安全和道德标准至关重要。

模型版本：

LocateAnything-3B：默认情况下在混合模式下评估的30亿参数研究模型变体。同一模型结构支持快速、混合和慢速推理模式。

LocateAnything-3B 可集成到需要从自然语言进行空间定位的系统中，例如 GUI 代理、机器人/具身智能体、文档理解管道、OCR/文本定位以及开放世界检测工作流。

训练、测试和评估数据集：

数据模态：

图像和文本。

图像
文本

训练数据规模：

图像训练数据规模：

100 万至 10 亿张图像 - 1200 万张独特图像。

文本训练数据规模：

10 亿至 10 万亿个令牌 - 源自约 1.4 亿条自然语言查询。

按数据集划分的数据收集方法：

混合：人工、自动化
数据收集自人工策划的开源数据集，以及对公开可用数据源的自动化摄取。

按数据集划分的标注方法：

混合：人工、合成、自动化
标注包括原始人工或开源注释，以及使用 Qwen3-VL、Molmo、SAM 3 和 Rex-Omni 进行的模型辅助和合成注释生成，并带有自动化后验证。

属性： 训练数据由具有多模态输入的监督微调（SFT）数据集组成，主要是图像-文本对和结构化注释，如边界框、点和负样本。

数据涵盖多个领域，包括定位、开放世界定位、通用和密集目标检测、场景文本检测、GUI 理解与定位、文档布局理解以及 OCR。

模态包括视觉输入（图像）和自然语言查询或指令。该数据集源自多种公开可用的学术数据集，以及模型辅助和合成注释。它可能包含公开可用且可能受版权保护的内容；用户有责任确保遵守适用的使用权利。

语言内容主要由简短的、面向任务的自然语言表达组成，例如对象类别、指代表达式、GUI 指令、OCR 查询和定位提示，通常为英语。

评估数据集：

按数据集划分的数据收集方法：

混合：人工、自动化

按数据集划分的标注方法：

混合：人工、合成、自动化

特性： 评估数据集包含公开可用的基准测试，涵盖视觉定位、目标检测、文档理解、场景文本检测以及图形用户界面（GUI）相关任务。模态包括图像输入与自然语言查询的配对，以及边界框、点等结构化标注。

该评估套件涵盖边界框级和点级定位任务，在多个数据集上，用于边界框评估的图像约有48K张，用于点评估的图像约有35K张。这些数据集跨越自然场景、文档、航空影像和以人为中心的交互等多个领域，能够全面评估定位准确性和鲁棒性。

评估查询通常是简短的、面向任务的自然语言表达式，例如指代短语、目标类别和定位提示。

性能通过交并比（IoU）阈值为0.5和0.95时的基于边界框的F1值，以及检测、布局和光学字符识别（OCR）任务的平均IoU来衡量。点级定位的评估基于预测点是否落在真实分割掩码或边界框内。推理效率以单张NVIDIA H100 GPU、批大小为1时的每秒边界框数（BPS）报告。

定量评估基准

通用目标检测

密集目标检测

图形用户界面理解

布局定位与光学字符识别

指代表达式定位

指点定位

推理：

测试硬件：H100

我们建议使用 max_new_tokens=8192 和 generation_mode="hybrid"，以避免响应被截断，并在速度与准确性之间取得平衡。

安装

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

必须根据您的 CUDA 版本单独安装 PyTorch（torch）。请参阅 pytorch.org/get-started。

可选 — MagiAttention（仅适用于 Hopper / Blackwell GPU，推荐用于更快的 MTP 推理）：

git clone https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

如果已安装 MagiAttention，模型会自动使用它来实现高效的 MTP 块扩散注意力机制。若未安装，则会回退到 PyTorch SDPA——功能完整但在 MTP 解码时速度较慢。

Worker（推荐）

以下是一个独立的 worker，它会一次性加载模型，并通过统一的 predict() 方法以及特定任务的便捷方法来处理感知查询。您可以将此类集成到任何 FastAPI / gRPC / Triton 服务框架中。

import re
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor


class LocateAnythingWorker:
    """Stateful worker that loads the model once and serves perception queries."""

    def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
        self.device = device
        self.dtype = dtype

        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=dtype,
            trust_remote_code=True,
        ).to(device).eval()

    @torch.no_grad()
    def predict(
        self,
        image: Image.Image,
        question: str,
        generation_mode: str = "hybrid",   # "fast" (MTP) | "slow" (NTP/AR) | "hybrid"
        max_new_tokens: int = 2048,
        temperature: float = 0.7,
        verbose: bool = True,
    ) -> dict:
        messages = [
            {"role": "user", "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question},
            ]}
        ]

        text = self.processor.py_apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        images, videos = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text], images=images, videos=videos, return_tensors="pt"
        ).to(self.device)

        pixel_values = inputs["pixel_values"].to(self.dtype)
        input_ids = inputs["input_ids"]
        image_grid_hws = inputs.get("image_grid_hws", None)

        response = self.model.generate(
            pixel_values=pixel_values,
            input_ids=input_ids,
            attention_mask=inputs["attention_mask"],
            image_grid_hws=image_grid_hws,
            tokenizer=self.tokenizer,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            generation_mode=generation_mode,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            verbose=verbose,
        )

        result = {"answer": response[0] if isinstance(response, tuple) else response}
        if isinstance(response, tuple) and len(response) >= 3:
            result["history"] = response[1]
            result["stats"] = response[2]
        return result

    # ---- Convenience methods for each task ----

    def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
        """Object detection / document layout analysis."""
        cats = "</c>".join(categories)
        prompt = f"Locate all the instances that matches the following description: {cats}."
        return self.predict(image, prompt, **kwargs)

    def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — single instance."""
        prompt = f"Locate a single instance that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — multiple instances."""
        prompt = f"Locate all the instances that match the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Text grounding."""
        prompt = f"Please locate the text referred as {phrase}."
        return self.predict(image, prompt, **kwargs)

    def detect_text(self, image: Image.Image, **kwargs) -> dict:
        """Scene text detection."""
        prompt = "Detect all the text in box format."
        return self.predict(image, prompt, **kwargs)

    def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
        """GUI grounding (box or point)."""
        if output_type == "point":
            prompt = f"Point to: {phrase}."
        else:
            prompt = f"Locate the region that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Pointing."""
        prompt = f"Point to: {phrase}."
        return self.predict(image, prompt, **kwargs)

    # ---- Utility: parse model output ----

    @staticmethod
    def parse_boxes(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate bounding boxes.

        Coordinates in model output are normalized integers in [0, 1000].
        """
        boxes = []
        for m in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
            x1, y1, x2, y2 = [int(g) for g in m.groups()]
            boxes.append({
                "x1": x1 / 1000 * image_width,
                "y1": y1 / 1000 * image_height,
                "x2": x2 / 1000 * image_width,
                "y2": y2 / 1000 * image_height,
            })
        return boxes

    @staticmethod
    def parse_points(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate points."""
        points = []
        for m in re.finditer(r"<box><(\d+)><(\d+)></box>", answer):
            x, y = int(m.group(1)), int(m.group(2))
            points.append({
                "x": x / 1000 * image_width,
                "y": y / 1000 * image_height,
            })
        return points

使用示例

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

# Object Detection
result = worker.detect(img, ["person", "car", "bicycle"])
print("Detection:", result["answer"])

# Phrase Grounding (multiple)
result = worker.ground_multi(img, "people wearing red shirts")
print("Grounding:", result["answer"])

# Scene Text Detection
result = worker.detect_text(img)
print("Text Detection:", result["answer"])

# Pointing
result = worker.point(img, "the traffic light")
print("Pointing:", result["answer"])

# GUI Grounding (point)
result = worker.ground_gui(img, "the search button", output_type="point")
print("GUI Point:", result["answer"])

# Parse structured output into pixel coordinates
w, h = img.size
boxes = LocateAnythingWorker.parse_boxes(result["answer"], w, h)
points = LocateAnythingWorker.parse_points(result["answer"], w, h)

支持的任务与提示模板

任务	工作方法	输出	提示模板
目标检测	`worker.detect(img, [...])`	边界框	`定位所有符合以下描述的实例：[CATEGORIES]。`
短语定位（单个）	`worker.ground_single(img, phrase)`	单个边界框	`定位一个符合以下描述的实例：[PHRASE]。`
短语定位（多个）	`worker.ground_multi(img, phrase)`	多个边界框	`定位所有符合以下描述的实例：[PHRASE]。`
文本定位	`worker.ground_text(img, phrase)`	边界框	`请定位被称为[PHRASE]的文本。`
场景文本检测	`worker.detect_text(img)`	边界框	`以边界框形式检测所有文本。`
文档布局分析	`worker.detect(img, [...])`	边界框	`定位所有符合以下描述的实例：[CATEGORIES]。`
GUI 定位（边界框）	`worker.ground_gui(img, phrase, "box")`	边界框	`定位符合以下描述的区域：[PHRASE]。`
GUI 定位（点）/ 指向	`worker.ground_gui(img, phrase, "point")` / `worker.point(img, phrase)`	点	`指向：[PHRASE]。`

[PHRASE] 是自由形式的自然语言描述；[CATEGORIES] 是逗号分隔的列表（多个类别也可以用 </c> 连接）。

生成模式

模式	描述	速度	准确性
`fast`	仅使用 MTP，从不回退到 AR	最快	适用于简单场景
`slow`	纯自回归解码	最慢	最稳健
`hybrid`（默认）	首先使用 MTP，对不确定的边界框回退到 AR，边界框确定后切换回 MTP	平衡	总体最佳

伦理考量：

NVIDIA 认为可信 AI 是一项共同责任，我们已制定相关政策和实践，以支持广泛的 AI 应用开发。当根据我们的服务条款下载或使用时，开发人员应与内部模型团队合作，确保此模型满足相关行业和用例的要求，并应对意外的产品误用。

请确保您对所有输入图像和视频内容拥有适当的权利和许可；如果图像或视频包含人物、个人健康信息或知识产权，生成的图像或视频不会模糊或保持所包含图像主体的比例。

请在此处here报告模型质量、风险、安全漏洞或 NVIDIA AI 相关问题。

LocateAnything：基于并行框解码的快速高质量视觉语言定位

LocateAnything teaser

🔗 快速链接

🚀 在线演示：LocateAnything (Hugging Face Spaces)
💻 GitHub 代码：NVlabs/Eagle/Embodied
📄 论文：arXiv:2605.27365

模型概述

描述：

LocateAnything 是 Eagle VLM 模型系列的一部分。本模型仅供研究和开发使用。

演示视频

许可协议/使用条款：

本模型的构建使用了来自第三方模型的组件，这些组件各有其许可协议：

语言模型：Qwen2.5-3B-Instruct（Qwen 研究许可协议）
视觉编码器：MoonViT-SO-400M（MIT 许可协议）

模型在 Qwen 的基础上进行了改进。

部署地区：

全球

用例：

LocateAnything-3B 面向开发人员和研究人员，旨在构建视觉语言模型及相关应用，这些应用需要根据自然语言指令进行快速、精确的视觉定位。

支持的用例包括：

开放集、常见及长尾目标检测
复杂场景中的密集多目标检测
短语及指代表达式接地
自动化数据集标注与注释（例如检测、接地、指向）
用于交互和智能体系统的 GUI 元素接地
机器人与自动驾驶感知
文档理解、布局接地及 OCR 定位
工业检测、监控及遥感应用
基于点的定位和细粒度空间推理

发布日期[请在下方填写预计发布日期]：

Github [2026年5月26日]，地址：https://github.com/NVlabs/Eagle/tree/main/Embodied。
Hugging Face [2026年5月26日]，地址：https://huggingface.co/nvidia/LocateAnything-3B。
演示 [2026年5月26日]，地址：https://huggingface.co/spaces/nvidia/LocateAnything。
网页 [2026年5月26日]，地址：https://research.nvidia.com/labs/lpr/locate-anything/。
技术报告 [2026年5月26日]，地址：https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf

参考文献：

Wang 等人，LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding，NVIDIA 技术报告，2026
Kimi 团队，Kimi-VL 技术报告，arXiv:2504.07491，2025。
Qwen 团队，Qwen2.5: A Party of Foundation Models，Qwen 博客，2024。
Chen 等人，Pix2Seq: A Language Modeling Framework for Object Detection，国际学习表征会议（ICLR），2022。
Jiang 等人，Detect Anything via Next Point Prediction，arXiv:2510.12798，2025。
Liu 等人，Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection，arXiv:2303.05499，2023。
Lin 等人，Microsoft COCO: Common Objects in Context，欧洲计算机视觉会议（ECCV），2014。
Gupta 等人，LVIS: A Dataset for Large Vocabulary Instance Segmentation，计算机视觉与模式识别会议（CVPR），2019。
Li 等人，ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use，国际计算机协会多媒体会议（ACM MM），2025。

模型架构：

架构类型：基于 Transformer 的视觉语言模型（VLM）。

网络架构：原生分辨率视觉语言模型，包含以下组件：

视觉编码器：MoonViT
语言模型：Qwen2.5-3B-Instruct
多模态投影器：MLP 投影器
输出形式：用于视觉定位的基于块的结构

模型参数数量：30 亿。

输入：

输入类型： 图像和文本。

输入格式：

图像：原始分辨率的RGB图像输入。
文本：自然语言提示或任务模板，例如对象类别、指代表达式、GUI指令、OCR/布局请求或指向查询。

输入参数：

图像：二维（2D）
文本：一维（1D）

与输入相关的其他属性：

产品图像分辨率最高支持2.5K。
提示长度最高支持24K tokens。
训练检测和定位阶段使用的最大序列长度为25,600 tokens。
推理时支持最多8,192个新生成的tokens。

输出：

输出类型： 文本。

输出格式：

文本：模型生成的包含语义标签和结构化坐标标记的标记序列，例如边界框（<box> x1, y1, x2, y2 </box>）和点（<box> x, y </box>）。

输出参数：

文本：一维（1D）
边界框/点：二维（2D）空间坐标

与输出相关的其他属性：

输出被组织成固定长度的块（长度为6），包括语义块、框块、否定块和结束块。
框块使用结构化标记对量化的空间坐标进行编码；未使用的位置用<null>填充。
快速模式并行预测框对齐的块；慢速模式使用自回归解码；混合模式默认采用并行解码，当遇到格式不规则或空间模糊时回退到自回归解码。

软件集成：

运行时引擎：

Transformers。推理设置使用标准的VLM生成，采用BF16精度和KV缓存。暂不支持TensorRT、TensorRT-LLM和Triton。

支持的硬件微架构兼容性：

NVIDIA Ampere（例如A100）
NVIDIA Blackwell
NVIDIA Hopper（例如H100）
NVIDIA Lovelace（例如L40、RTX 4090）

通过额外的模型优化（包括量化、压缩或蒸馏），可以在NVIDIA Thor等嵌入式平台上部署。根据可用内存、精度支持和软件配置，可能支持其他架构。

支持的操作系统：

Linux

模型版本：

LocateAnything-3B：默认情况下在混合模式下评估的30亿参数研究模型变体。同一模型结构支持快速、混合和慢速推理模式。

训练、测试和评估数据集：

数据模态：

图像和文本。

图像
文本

训练数据规模：

图像训练数据规模：

100 万至 10 亿张图像 - 1200 万张独特图像。

文本训练数据规模：

10 亿至 10 万亿个令牌 - 源自约 1.4 亿条自然语言查询。

按数据集划分的数据收集方法：

混合：人工、自动化
数据收集自人工策划的开源数据集，以及对公开可用数据源的自动化摄取。

按数据集划分的标注方法：

混合：人工、合成、自动化
标注包括原始人工或开源注释，以及使用 Qwen3-VL、Molmo、SAM 3 和 Rex-Omni 进行的模型辅助和合成注释生成，并带有自动化后验证。

属性： 训练数据由具有多模态输入的监督微调（SFT）数据集组成，主要是图像-文本对和结构化注释，如边界框、点和负样本。

数据涵盖多个领域，包括定位、开放世界定位、通用和密集目标检测、场景文本检测、GUI 理解与定位、文档布局理解以及 OCR。

语言内容主要由简短的、面向任务的自然语言表达组成，例如对象类别、指代表达式、GUI 指令、OCR 查询和定位提示，通常为英语。

评估数据集：

按数据集划分的数据收集方法：

混合：人工、自动化

按数据集划分的标注方法：

混合：人工、合成、自动化

评估查询通常是简短的、面向任务的自然语言表达式，例如指代短语、目标类别和定位提示。

定量评估基准

通用目标检测

密集目标检测

图形用户界面理解

布局定位与光学字符识别

指代表达式定位

指点定位

推理：

测试硬件：H100

我们建议使用 max_new_tokens=8192 和 generation_mode="hybrid"，以避免响应被截断，并在速度与准确性之间取得平衡。

安装

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

必须根据您的 CUDA 版本单独安装 PyTorch（torch）。请参阅 pytorch.org/get-started。

可选 — MagiAttention（仅适用于 Hopper / Blackwell GPU，推荐用于更快的 MTP 推理）：

git clone https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

Worker（推荐）

import re
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor


class LocateAnythingWorker:
    """Stateful worker that loads the model once and serves perception queries."""

    def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
        self.device = device
        self.dtype = dtype

        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=dtype,
            trust_remote_code=True,
        ).to(device).eval()

    @torch.no_grad()
    def predict(
        self,
        image: Image.Image,
        question: str,
        generation_mode: str = "hybrid",   # "fast" (MTP) | "slow" (NTP/AR) | "hybrid"
        max_new_tokens: int = 2048,
        temperature: float = 0.7,
        verbose: bool = True,
    ) -> dict:
        messages = [
            {"role": "user", "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question},
            ]}
        ]

        text = self.processor.py_apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        images, videos = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text], images=images, videos=videos, return_tensors="pt"
        ).to(self.device)

        pixel_values = inputs["pixel_values"].to(self.dtype)
        input_ids = inputs["input_ids"]
        image_grid_hws = inputs.get("image_grid_hws", None)

        response = self.model.generate(
            pixel_values=pixel_values,
            input_ids=input_ids,
            attention_mask=inputs["attention_mask"],
            image_grid_hws=image_grid_hws,
            tokenizer=self.tokenizer,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            generation_mode=generation_mode,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            verbose=verbose,
        )

        result = {"answer": response[0] if isinstance(response, tuple) else response}
        if isinstance(response, tuple) and len(response) >= 3:
            result["history"] = response[1]
            result["stats"] = response[2]
        return result

    # ---- Convenience methods for each task ----

    def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
        """Object detection / document layout analysis."""
        cats = "</c>".join(categories)
        prompt = f"Locate all the instances that matches the following description: {cats}."
        return self.predict(image, prompt, **kwargs)

    def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — single instance."""
        prompt = f"Locate a single instance that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — multiple instances."""
        prompt = f"Locate all the instances that match the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Text grounding."""
        prompt = f"Please locate the text referred as {phrase}."
        return self.predict(image, prompt, **kwargs)

    def detect_text(self, image: Image.Image, **kwargs) -> dict:
        """Scene text detection."""
        prompt = "Detect all the text in box format."
        return self.predict(image, prompt, **kwargs)

    def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
        """GUI grounding (box or point)."""
        if output_type == "point":
            prompt = f"Point to: {phrase}."
        else:
            prompt = f"Locate the region that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Pointing."""
        prompt = f"Point to: {phrase}."
        return self.predict(image, prompt, **kwargs)

    # ---- Utility: parse model output ----

    @staticmethod
    def parse_boxes(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate bounding boxes.

        Coordinates in model output are normalized integers in [0, 1000].
        """
        boxes = []
        for m in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
            x1, y1, x2, y2 = [int(g) for g in m.groups()]
            boxes.append({
                "x1": x1 / 1000 * image_width,
                "y1": y1 / 1000 * image_height,
                "x2": x2 / 1000 * image_width,
                "y2": y2 / 1000 * image_height,
            })
        return boxes

    @staticmethod
    def parse_points(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate points."""
        points = []
        for m in re.finditer(r"<box><(\d+)><(\d+)></box>", answer):
            x, y = int(m.group(1)), int(m.group(2))
            points.append({
                "x": x / 1000 * image_width,
                "y": y / 1000 * image_height,
            })
        return points

使用示例

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

# Object Detection
result = worker.detect(img, ["person", "car", "bicycle"])
print("Detection:", result["answer"])

# Phrase Grounding (multiple)
result = worker.ground_multi(img, "people wearing red shirts")
print("Grounding:", result["answer"])

# Scene Text Detection
result = worker.detect_text(img)
print("Text Detection:", result["answer"])

# Pointing
result = worker.point(img, "the traffic light")
print("Pointing:", result["answer"])

# GUI Grounding (point)
result = worker.ground_gui(img, "the search button", output_type="point")
print("GUI Point:", result["answer"])

# Parse structured output into pixel coordinates
w, h = img.size
boxes = LocateAnythingWorker.parse_boxes(result["answer"], w, h)
points = LocateAnythingWorker.parse_points(result["answer"], w, h)

支持的任务与提示模板

任务	工作方法	输出	提示模板
目标检测	`worker.detect(img, [...])`	边界框	`定位所有符合以下描述的实例：[CATEGORIES]。`
短语定位（单个）	`worker.ground_single(img, phrase)`	单个边界框	`定位一个符合以下描述的实例：[PHRASE]。`
短语定位（多个）	`worker.ground_multi(img, phrase)`	多个边界框	`定位所有符合以下描述的实例：[PHRASE]。`
文本定位	`worker.ground_text(img, phrase)`	边界框	`请定位被称为[PHRASE]的文本。`
场景文本检测	`worker.detect_text(img)`	边界框	`以边界框形式检测所有文本。`
文档布局分析	`worker.detect(img, [...])`	边界框	`定位所有符合以下描述的实例：[CATEGORIES]。`
GUI 定位（边界框）	`worker.ground_gui(img, phrase, "box")`	边界框	`定位符合以下描述的区域：[PHRASE]。`
GUI 定位（点）/ 指向	`worker.ground_gui(img, phrase, "point")` / `worker.point(img, phrase)`	点	`指向：[PHRASE]。`

[PHRASE] 是自由形式的自然语言描述；[CATEGORIES] 是逗号分隔的列表（多个类别也可以用 </c> 连接）。

生成模式

模式	描述	速度	准确性
`fast`	仅使用 MTP，从不回退到 AR	最快	适用于简单场景
`slow`	纯自回归解码	最慢	最稳健
`hybrid`（默认）	首先使用 MTP，对不确定的边界框回退到 AR，边界框确定后切换回 MTP	平衡	总体最佳

伦理考量：

请在此处here报告模型质量、风险、安全漏洞或 NVIDIA AI 相关问题。