Toto-2.0-2.5B:可用于多变量时间序列的零样本预测，提供概率预测及不确定性估计。项目为Datadog开发的Toto系列模型，采用u-μP缩放的Transformer架构，在BOOM、GIFT-Eval等多个基准上达到SOTA水平。【此简介由AI生成】 - AtomGit AI社区

Toto-2.0-2.5B

Toto（面向可观测性的时间序列优化Transformer，Observability参考链接：https://www.datadoghq.com/knowledge-center/observability/）是由Datadog（官网链接：https://www.datadoghq.com/）开发的一系列用于多变量预测的时间序列基础模型。Toto 2.0是当前的最新版本，采用u-μP缩放的Transformer架构，参数规模从400万到25亿不等，所有模型均通过统一训练方案训练而成。该系列模型的预测质量随着参数数量的增加而稳定提升。

该系列模型在三个预测基准测试中均达到了新的技术水平：我们的可观测性基准测试BOOM（链接：https://huggingface.co/spaces/Datadog/BOOM）、通用标准基准测试GIFT-Eval（链接：https://huggingface.co/spaces/Salesforce/GIFT-Eval），以及最新的抗污染基准测试TIME（链接：https://arxiv.org/abs/2602.12147）。

📊 性能表现

Pareto frontier on BOOM and GIFT-Eval — Toto 2.0的所有规模模型在BOOM和GIFT-Eval基准测试中均位于或接近帕累托前沿。其中三个最大规模的模型在GIFT-Eval的CRPS排名中分别位列基础模型的第一、第二和第三。在TIME基准测试中，Toto 2.0各规模模型在所有指标上均占据前三名，领先于其他所有参与评估的外部基础模型。

⚡ 快速开始

推理代码可在GitHub（链接：https://github.com/DataDog/toto）上获取。

安装

pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"

推理示例

import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-2.5B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)

更多示例，请参见快速入门笔记本和 GluonTS 集成笔记本。

💾 可用检查点

所有五种 Toto 2.0 尺寸均采用相同的训练方案；请根据您的精度/延迟预算选择合适的尺寸。延迟是指在单张 A100 上，批大小为 8 时，1024 步单次预测的前向传播时间。

模型	参数数量	权重（fp32）	延迟	推荐用途
Toto‑2.0‑4m	400 万	16 MB	~3.8 毫秒	边缘/CPU 部署；对延迟或内存预算要求最严格的场景。
Toto‑2.0‑22m	2200 万	84 MB	~5.0 毫秒	高效默认选择 — 与 Toto 1.0 质量相当或更优，参数数量减少约 7 倍。
Toto‑2.0‑313m	3.13 亿	1.2 GB	~15.4 毫秒	强大的通用检查点；在 GIFT-Eval 上位列前三的基础模型。
Toto‑2.0‑1B	10 亿	3.9 GB	~20.9 毫秒	生产工作负载的最佳质量/成本权衡。
Toto‑2.0‑2.5B	25 亿	9.1 GB	~36.2 毫秒	最高精度；在所有基准测试中均排名第一的基础模型。

✨ 主要特性

零样本预测： 无需针对特定时间序列进行微调即可进行预测。
多变量支持： 使用交替的时间/变量注意力机制高效处理多个变量。
概率预测： 通过分位数输出头生成点预测和不确定性估计。
仅解码器架构： 支持可变的预测范围和上下文长度。
u-μP 缩放： 单一训练方案可在所有五种尺寸（400 万 → 25 亿）上平稳迁移。

🏗️ 架构

🔗 其他资源

技术报告
博客文章
GitHub 代码库
Toto 2.0 合集——全部五个基础检查点
BOOM 数据集——Datadog 的可观测性时间序列基准
Toto 1.0 权重

📖 引用

@misc{khwaja2026toto20timeseries,
      title={Toto 2.0: Time Series Forecasting Enters the Scaling Era}, 
      author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
      year={2026},
      eprint={2605.20119},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.20119}, 
}