Trinity-Large-Thinking

简介

Trinity-Large-Thinking 是 Arcee AI 推出的 Trinity-Large 系列中专注于推理优化的模型变体。作为拥有 3980 亿参数的稀疏混合专家模型（Mixture-of-Experts, MoE），它每 token 约激活 130 亿参数。该模型在 Trinity-Large-Base 的基础上，通过扩展思维链推理和智能体强化学习进行后训练，在智能体基准测试中实现了最先进的性能，同时保持了强大的通用能力。

Trinity-Large-Thinking 在生成最终响应前，会生成包裹在 </think>...</RichMediaReference> 块中的显式推理轨迹。这一思考过程对模型性能至关重要——在多轮对话和智能体循环中，必须将思考 token 保留在上下文中，才能确保功能正常运行。

可在 chat.arcee.ai 体验该模型。

有关 Trinity Large 训练的更多详情，请参见技术报告。

核心亮点

智能体优先设计：专为工具调用、多步骤规划和智能体工作流打造
最先进的智能体性能：在 τ²-Bench 上达到 94.7%，PinchBench 上达到 91.9%，LiveCodeBench 上达到 98.2%
原生推理轨迹：通过 </think>...superscript: 块实现扩展思维链
兼容主流智能体框架：可直接与 OpenClaw 和 Hermes Agent 配合使用
可在 OpenRouter 直接使用：无需额外设置，通过 API 即可获得完整的推理和工具调用支持

模型变体

Trinity Large 系列包含四个检查点：

Trinity-Large-Thinking（本次发布）：经过推理优化，采用增强型思维链进行智能体后训练
Trinity-Large-Preview：轻度后训练、可直接用于对话的指令模型（无 reasoning_content）。
Trinity-Large-TrueBase：10T token 预退火预训练检查点
Trinity-Large-Base：完整的 17T token 预训练基础模型，包含训练中期退火

架构

Trinity-Large-Thinking 与 Trinity-Large-Preview 采用相同的稀疏 MoE 架构。

超参数	值
总参数	~3980 亿
每 token 活跃参数	~130 亿
专家数量	256（1 个共享）
活跃专家数量	4
路由策略	4-of-256（1.56% 稀疏度）
密集层	6
预训练上下文长度	8,192
扩展后上下文长度	512k
架构	稀疏 MoE（AfmoeForCausalLM）

基准测试

基准测试图表

基准测试	Trinity-Large-Thinking	Opus-4.6	GLM-5	MiniMax-M2.7	Kimi-K2.5
IFBench	52.3	53.1	72.3	75.7	70.2
GPQA-Diamond	76.3	89.2	81.6	86.2	86.9
Tau2-Airline	88.0	82.0	80.5	80.0	80.0
Tau2-Telecom	94.7	92.1	98.2	84.8	95.9
PinchBench	91.9	93.3	86.4	89.8	84.8
AIME25	96.3	99.8	93.3	80.0	96.3
BCFLv4	70.1	77.0	70.8	70.6	68.3
MMLU-Pro	83.4	89.1	85.8	80.8	87.1
SWE-bench Verified*	63.2	75.6	72.8	75.4	70.8

*所有模型均在 mini-swe-agent-v2 中进行评估

上下文中思考：重要使用说明

Trinity-Large-Thinking 会在生成最终响应前，在 </think>...</RichMediaReference> 块内生成推理轨迹。

这意味着：

多轮对话：构建聊天应用时，需将完整的助手响应（思考 + 答案）包含在后续轮次的对话历史中。
智能体循环：将 Trinity-Large-Thinking 用作智能体（OpenClaw、Hermes Agent 或自定义智能体）的核心时，确保工具调用循环在步骤间的消息历史中保留推理内容。
上下文窗口管理：512k 的扩展上下文窗口可容纳多智能体步骤中的长推理链。如果必须截断历史记录，建议完全移除较早的轮次，而非从最近的轮次中剥离思考标记。

思维运作方式

模型在生成响应前会进行内部推理。当通过 vLLM 提供服务时，推理过程会在 API 响应中被分离到一个专门的字段中：

// API response structure
{
  "message": {
    "role": "assistant",
    "reasoning": "The user wants flight information. I need to determine the date for next Tuesday, search for flights SFO → JFK, and filter by price < $300.",
    "content": "\n",
    "tool_calls": [{
      "function": {
        "name": "search_flights",
        "arguments": "{\"origin\": \"SFO\", \"destination\": \"JFK\", \"date\": \"2026-04-07\", \"max_price\": 300}"
      }
    }]
  }
}

在多轮对话中保留推理过程

构建多轮智能体循环时，必须在后续请求的助手消息中回传推理字段。聊天模板会读取该字段，并在分词过程中使用 </think>...</RichMediaReference> 标签重新包装，从而在多轮对话中保持模型的思维链。

⚠️ 字段名称兼容性：在 vLLM OpenAI 兼容聊天 API 中，reasoning_content 的输入兼容性可能因版本而异，部分版本仅支持 reasoning（相关问题）。为确保多轮循环的最大兼容性，请将助手推理内容以 reasoning 字段回传。如果您的 SDK 在响应中暴露 reasoning_content，请在追加助手轮次时将其映射为 reasoning。

如果完全省略推理字段会怎样？ 如果助手消息根本没有推理字段（既没有 reasoning 也没有 reasoning_content），或者 content 为 null，模型可能会丢失先前的思维链上下文。对于简单任务，这可能不会有太大影响，但在复杂的多步骤智能体任务中，模型可能会生成格式错误的工具调用（例如，工具调用 XML 出现在推理字段内，而非结构化的 tool_calls 中）。为获得最佳效果，请始终保留推理字段，并在工具调用轮次中使用 "" 代替 null 作为内容。

训练配置

预训练

训练 tokens：17 万亿
数据合作伙伴：Datology

后训练

指令微调与智能体强化学习（含扩展思维链）
基于工具调用轨迹、多步骤智能体任务和推理链进行训练

基础设施

硬件：2,048 块 NVIDIA B300 GPU
并行策略：HSDP + 专家并行
计算合作伙伴：Prime Intellect

使用方法

运行模型

vLLM（推荐用于智能体部署）
Transformers
API

vLLM

vLLM 0.11.1+ 版本支持。如需同时使用推理和工具调用的智能体功能：

vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

此配置：

--reasoning-parser deepseek_r1 — 解析 </think>...</RichMediaReference> 推理块，并通过 API 响应中的 reasoning 字段将其公开
--tool-call-parser qwen3_coder — 从模型输出中解析结构化工具调用，转换为与 OpenAI 兼容的 tool_calls 数组

单轮示例

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="arcee-ai/Trinity-Large-Thinking",
    messages=[
        {"role": "user", "content": "What's the weather like in Paris?"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }],
)

# Access reasoning (thinking) content
reasoning = response.choices[0].message.reasoning_content

# Access final response or tool calls
content = response.choices[0].message.content
tool_calls = response.choices[0].message.tool_calls

多轮智能体循环示例

关键模式：每轮结束后，将完整的助手响应（包括推理过程）追加到消息历史中，然后追加工具结果，并将更新后的历史发送至下一轮。

import json
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
MODEL = "arcee-ai/Trinity-Large-Thinking"

tools = [
    {"type": "function", "function": {
        "name": "get_customer_by_email",
        "description": "Look up a customer by email.",
        "parameters": {"type": "object", "properties": {"email": {"type": "string"}}, "required": ["email"]}
    }},
    {"type": "function", "function": {
        "name": "cancel_subscription",
        "description": "Cancel a subscription. Requires customer_id.",
        "parameters": {"type": "object", "properties": {"customer_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["customer_id"]}
    }}
]

def execute_tool(name, arguments):
    """Simulate tool execution — replace with real implementations."""
    args = json.loads(arguments)
    if name == "get_customer_by_email":
        return json.dumps({"customer_id": "C2001", "name": "Jane Doe", "plan": "Premium", "status": "active"})
    elif name == "cancel_subscription":
        return json.dumps({"success": True, "message": f"Subscription cancelled for {args['customer_id']}"})

messages = [
    {"role": "system", "content": "You are a helpful customer service agent."},
    {"role": "user", "content": "I want to cancel my subscription. My email is jane@example.com"}
]

# Agent loop
while True:
    response = client.chat.completions.create(
        model=MODEL, messages=messages, tools=tools,
        tool_choice="auto", temperature=0, max_tokens=1000
    )
    msg = response.choices[0].message

    # Build assistant message — PRESERVE the reasoning field
    assistant_msg = {"role": "assistant", "content": msg.content}
    if msg.reasoning_content:
        assistant_msg["reasoning"] = msg.reasoning_content  # ← critical for multi-turn
    if msg.tool_calls:
        assistant_msg["tool_calls"] = [
            {"id": tc.id, "type": "function", "function": {"name": tc.function.name, "arguments": tc.function.arguments}}
            for tc in msg.tool_calls
        ]
    messages.append(assistant_msg)

    # If no tool calls, model gave its final response — done
    if not msg.tool_calls:
        print(f"Final response: {msg.content}")
        break

    # Execute tool calls and append results
    for tc in msg.tool_calls:
        result = execute_tool(tc.function.name, tc.function.arguments)
        print(f"  Tool: {tc.function.name}({tc.function.arguments}) → {result}")
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

预期输出：

  Tool: get_customer_by_email({"email": "jane@example.com"}) → {"customer_id": "C2001", ...}
  Tool: cancel_subscription({"customer_id": "C2001", ...}) → {"success": true, ...}
  Final response: Your subscription has been cancelled successfully.

关键行如下：

assistant_msg["reasoning"] = msg.reasoning_content  # ← pass reasoning back as "reasoning"

OpenAI SDK 在响应对象上将该字段公开为 reasoning_content，但 vLLM 0.18+ 期望在输入消息中使用 reasoning。聊天模板随后会自动将其重新包装在 </think>...</RichMediaReference> 标签中。

Transformers

使用 main transformers 分支，或在已发布版本中传入 trust_remote_code=True。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/Trinity-Large-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.6,
    top_k=50,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

API

OpenRouter

可在 OpenRouter 上使用，支持完整的推理和工具调用功能：

curl -X POST "https://openrouter.ai/v1/chat/completions" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arcee-ai/trinity-large-thinking",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
  }'

使用 OpenRouter 进行多轮对话：OpenRouter 会在 reasoning_details 对象（其统一的推理格式）中返回推理内容。对于多轮对话，在后续请求的助手消息中按原样传回 reasoning_details——OpenRouter 会处理特定于模型的上游转换（对于 Trinity，这会作为助手轮次的 reasoning_content 向上游发送）。如需调试，可启用回显功能来检查上游 API 调用：

{"debug": {"echo_upstream_body": true}}

详情请参见 OpenRouter 调试文档。

智能体应用场景

Trinity-Large-Thinking 经过优化，可作为 AI 智能体系统的推理核心进行部署。经评估，其在以下方面表现卓越：

OpenClaw

Trinity-Large-Thinking 可作为 OpenClaw 智能体的即插即用型“大脑”。其原生工具调用格式与 OpenClaw 的执行循环兼容，强大的推理能力支持从邮件分类、代码生成到会议安排等多步骤任务的可靠完成。我们 91.9% 的 PinchBench 分数反映了其在 OpenClaw 实际任务中的表现。

OpenClaw 用户部署指南：OpenClaw 会保留步骤间的完整助手轮次。在公开部署中为确保与 vLLM 兼容，请确保助手推理在下一轮次中以 reasoning（而非仅 reasoning_content）字段传递，并保持助手 content 不为空（空字符串即可）。如果您的 SDK 输出 reasoning_content，请在网关处添加一个小型适配器，在向 vLLM 发送请求前将其映射为 reasoning。

Hermes Agent

与 Nous Research 的 Hermes Agent 框架兼容。Trinity-Large-Thinking 的推理轨迹与 Hermes 的技能学习循环自然契合——模型清晰的思维链使技能提取更可靠，其强大的工具调用能力可通过 Hermes 工具使用协议直接集成。

自定义智能体循环

对于自定义实现，关键集成模式如下：

发送包含工具定义的用户消息
接收包含 reasoning + content + tool_calls 的响应
执行工具调用
将完整的助手响应（推理 + 内容 + 工具调用）和工具结果追加到消息历史中
将更新后的历史发送回去进行下一步
重复，直到模型生成不带工具调用的最终响应

重要提示：步骤 4 必须在助手消息中包含 reasoning 字段。聊天模板会读取此字段，并在 token 化过程中将其重新包装在 </think>...</think> 标签中。省略此字段会降低多步骤性能——详情请参见在多轮对话中保留推理过程。

许可证

Trinity-Large-Thinking 基于 Apache License, Version 2.0 许可证发布。

引用

如果您使用本模型，请引用：

@misc{singh2026arceetrinity,
  title        = {Arcee Trinity Large Technical Report},
  author       = {Varun Singh and Lucas Krauss and Sami Jaghouar and Matej Sirovatka and Charles Goddard and Fares Obied and Jack Min Ong and Jannik Straube and Fern and Aria Harley and Conner Stewart and Colin Kealty and Maziyar Panahi and Simon Kirsten and Anushka Deshpande and Anneketh Vij and Arthur Bresnu and Pranav Veldurthi and Raghav Ravishankar and Hardik Bishnoi and DatologyAI Team and Arcee AI Team and Prime Intellect Team and Mark McQuade and Johannes Hagemann and Lucas Atkins},
  year         = {2026},
  eprint       = {2602.17004},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  doi          = {10.48550/arXiv.2602.17004},
  url          = {https://arxiv.org/abs/2602.17004}
}