Trinity-Large-Thinking 是 Arcee AI 推出的 Trinity-Large 系列中专注于推理优化的模型变体。作为拥有 3980 亿参数的稀疏混合专家模型(Mixture-of-Experts, MoE),它每 token 约激活 130 亿参数。该模型在 Trinity-Large-Base 的基础上,通过扩展思维链推理和智能体强化学习进行后训练,在智能体基准测试中实现了最先进的性能,同时保持了强大的通用能力。
Trinity-Large-Thinking 在生成最终响应前,会生成包裹在 </think>...</RichMediaReference> 块中的显式推理轨迹。这一思考过程对模型性能至关重要——在多轮对话和智能体循环中,必须将思考 token 保留在上下文中,才能确保功能正常运行。
可在 chat.arcee.ai 体验该模型。
有关 Trinity Large 训练的更多详情,请参见 技术报告。
</think>...superscript: 块实现扩展思维链Trinity Large 系列包含四个检查点:
Trinity-Large-Thinking 与 Trinity-Large-Preview 采用相同的稀疏 MoE 架构。
| 超参数 | 值 |
|---|---|
| 总参数 | ~3980 亿 |
| 每 token 活跃参数 | ~130 亿 |
| 专家数量 | 256(1 个共享) |
| 活跃专家数量 | 4 |
| 路由策略 | 4-of-256(1.56% 稀疏度) |
| 密集层 | 6 |
| 预训练上下文长度 | 8,192 |
| 扩展后上下文长度 | 512k |
| 架构 | 稀疏 MoE(AfmoeForCausalLM) |

| 基准测试 | Trinity-Large-Thinking | Opus-4.6 | GLM-5 | MiniMax-M2.7 | Kimi-K2.5 |
|---|---|---|---|---|---|
| IFBench | 52.3 | 53.1 | 72.3 | 75.7 | 70.2 |
| GPQA-Diamond | 76.3 | 89.2 | 81.6 | 86.2 | 86.9 |
| Tau2-Airline | 88.0 | 82.0 | 80.5 | 80.0 | 80.0 |
| Tau2-Telecom | 94.7 | 92.1 | 98.2 | 84.8 | 95.9 |
| PinchBench | 91.9 | 93.3 | 86.4 | 89.8 | 84.8 |
| AIME25 | 96.3 | 99.8 | 93.3 | 80.0 | 96.3 |
| BCFLv4 | 70.1 | 77.0 | 70.8 | 70.6 | 68.3 |
| MMLU-Pro | 83.4 | 89.1 | 85.8 | 80.8 | 87.1 |
| SWE-bench Verified* | 63.2 | 75.6 | 72.8 | 75.4 | 70.8 |
*所有模型均在 mini-swe-agent-v2 中进行评估
Trinity-Large-Thinking 会在生成最终响应前,在 </think>...</RichMediaReference> 块内生成推理轨迹。
这意味着:
模型在生成响应前会进行内部推理。当通过 vLLM 提供服务时,推理过程会在 API 响应中被分离到一个专门的字段中:
// API response structure
{
"message": {
"role": "assistant",
"reasoning": "The user wants flight information. I need to determine the date for next Tuesday, search for flights SFO → JFK, and filter by price < $300.",
"content": "\n",
"tool_calls": [{
"function": {
"name": "search_flights",
"arguments": "{\"origin\": \"SFO\", \"destination\": \"JFK\", \"date\": \"2026-04-07\", \"max_price\": 300}"
}
}]
}
}构建多轮智能体循环时,必须在后续请求的助手消息中回传推理字段。聊天模板会读取该字段,并在分词过程中使用 </think>...</RichMediaReference> 标签重新包装,从而在多轮对话中保持模型的思维链。
⚠️ 字段名称兼容性:在 vLLM OpenAI 兼容聊天 API 中,reasoning_content 的输入兼容性可能因版本而异,部分版本仅支持 reasoning(相关问题)。为确保多轮循环的最大兼容性,请将助手推理内容以 reasoning 字段回传。如果您的 SDK 在响应中暴露 reasoning_content,请在追加助手轮次时将其映射为 reasoning。
如果完全省略推理字段会怎样? 如果助手消息根本没有推理字段(既没有 reasoning 也没有 reasoning_content),或者 content 为 null,模型可能会丢失先前的思维链上下文。对于简单任务,这可能不会有太大影响,但在复杂的多步骤智能体任务中,模型可能会生成格式错误的工具调用(例如,工具调用 XML 出现在推理字段内,而非结构化的 tool_calls 中)。为获得最佳效果,请始终保留推理字段,并在工具调用轮次中使用 "" 代替 null 作为内容。
vLLM 0.11.1+ 版本支持。如需同时使用推理和工具调用的智能体功能:
vllm serve arcee-ai/Trinity-Large-Thinking \
--dtype bfloat16 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder此配置:
--reasoning-parser deepseek_r1 — 解析 </think>...</RichMediaReference> 推理块,并通过 API 响应中的 reasoning 字段将其公开--tool-call-parser qwen3_coder — 从模型输出中解析结构化工具调用,转换为与 OpenAI 兼容的 tool_calls 数组from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="arcee-ai/Trinity-Large-Thinking",
messages=[
{"role": "user", "content": "What's the weather like in Paris?"}
],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}],
)
# Access reasoning (thinking) content
reasoning = response.choices[0].message.reasoning_content
# Access final response or tool calls
content = response.choices[0].message.content
tool_calls = response.choices[0].message.tool_calls关键模式:每轮结束后,将完整的助手响应(包括推理过程)追加到消息历史中,然后追加工具结果,并将更新后的历史发送至下一轮。
import json
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
MODEL = "arcee-ai/Trinity-Large-Thinking"
tools = [
{"type": "function", "function": {
"name": "get_customer_by_email",
"description": "Look up a customer by email.",
"parameters": {"type": "object", "properties": {"email": {"type": "string"}}, "required": ["email"]}
}},
{"type": "function", "function": {
"name": "cancel_subscription",
"description": "Cancel a subscription. Requires customer_id.",
"parameters": {"type": "object", "properties": {"customer_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["customer_id"]}
}}
]
def execute_tool(name, arguments):
"""Simulate tool execution — replace with real implementations."""
args = json.loads(arguments)
if name == "get_customer_by_email":
return json.dumps({"customer_id": "C2001", "name": "Jane Doe", "plan": "Premium", "status": "active"})
elif name == "cancel_subscription":
return json.dumps({"success": True, "message": f"Subscription cancelled for {args['customer_id']}"})
messages = [
{"role": "system", "content": "You are a helpful customer service agent."},
{"role": "user", "content": "I want to cancel my subscription. My email is jane@example.com"}
]
# Agent loop
while True:
response = client.chat.completions.create(
model=MODEL, messages=messages, tools=tools,
tool_choice="auto", temperature=0, max_tokens=1000
)
msg = response.choices[0].message
# Build assistant message — PRESERVE the reasoning field
assistant_msg = {"role": "assistant", "content": msg.content}
if msg.reasoning_content:
assistant_msg["reasoning"] = msg.reasoning_content # ← critical for multi-turn
if msg.tool_calls:
assistant_msg["tool_calls"] = [
{"id": tc.id, "type": "function", "function": {"name": tc.function.name, "arguments": tc.function.arguments}}
for tc in msg.tool_calls
]
messages.append(assistant_msg)
# If no tool calls, model gave its final response — done
if not msg.tool_calls:
print(f"Final response: {msg.content}")
break
# Execute tool calls and append results
for tc in msg.tool_calls:
result = execute_tool(tc.function.name, tc.function.arguments)
print(f" Tool: {tc.function.name}({tc.function.arguments}) → {result}")
messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})预期输出:
Tool: get_customer_by_email({"email": "jane@example.com"}) → {"customer_id": "C2001", ...}
Tool: cancel_subscription({"customer_id": "C2001", ...}) → {"success": true, ...}
Final response: Your subscription has been cancelled successfully.关键行如下:
assistant_msg["reasoning"] = msg.reasoning_content # ← pass reasoning back as "reasoning"OpenAI SDK 在响应对象上将该字段公开为 reasoning_content,但 vLLM 0.18+ 期望在输入消息中使用 reasoning。聊天模板随后会自动将其重新包装在 </think>...</RichMediaReference> 标签中。
使用 main transformers 分支,或在已发布版本中传入 trust_remote_code=True。
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "arcee-ai/Trinity-Large-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=4096,
do_sample=True,
temperature=0.6,
top_k=50,
top_p=0.95
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)可在 OpenRouter 上使用,支持完整的推理和工具调用功能:
curl -X POST "https://openrouter.ai/v1/chat/completions" \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "arcee-ai/trinity-large-thinking",
"messages": [
{
"role": "user",
"content": "What are some fun things to do in New York?"
}
]
}'使用 OpenRouter 进行多轮对话:OpenRouter 会在 reasoning_details 对象(其统一的推理格式)中返回推理内容。对于多轮对话,在后续请求的助手消息中按原样传回 reasoning_details——OpenRouter 会处理特定于模型的上游转换(对于 Trinity,这会作为助手轮次的 reasoning_content 向上游发送)。如需调试,可启用回显功能来检查上游 API 调用:
{"debug": {"echo_upstream_body": true}}详情请参见 OpenRouter 调试文档。
Trinity-Large-Thinking 经过优化,可作为 AI 智能体系统的推理核心进行部署。经评估,其在以下方面表现卓越:
Trinity-Large-Thinking 可作为 OpenClaw 智能体的即插即用型“大脑”。其原生工具调用格式与 OpenClaw 的执行循环兼容,强大的推理能力支持从邮件分类、代码生成到会议安排等多步骤任务的可靠完成。我们 91.9% 的 PinchBench 分数反映了其在 OpenClaw 实际任务中的表现。
OpenClaw 用户部署指南:OpenClaw 会保留步骤间的完整助手轮次。在公开部署中为确保与 vLLM 兼容,请确保助手推理在下一轮次中以 reasoning(而非仅 reasoning_content)字段传递,并保持助手 content 不为空(空字符串即可)。如果您的 SDK 输出 reasoning_content,请在网关处添加一个小型适配器,在向 vLLM 发送请求前将其映射为 reasoning。
与 Nous Research 的 Hermes Agent 框架兼容。Trinity-Large-Thinking 的推理轨迹与 Hermes 的技能学习循环自然契合——模型清晰的思维链使技能提取更可靠,其强大的工具调用能力可通过 Hermes 工具使用协议直接集成。
对于自定义实现,关键集成模式如下:
reasoning + content + tool_calls 的响应重要提示:步骤 4 必须在助手消息中包含
reasoning字段。聊天模板会读取此字段,并在 token 化过程中将其重新包装在</think>...</think>标签中。省略此字段会降低多步骤性能——详情请参见在多轮对话中保留推理过程。
Trinity-Large-Thinking 基于 Apache License, Version 2.0 许可证发布。
如果您使用本模型,请引用:
@misc{singh2026arceetrinity,
title = {Arcee Trinity Large Technical Report},
author = {Varun Singh and Lucas Krauss and Sami Jaghouar and Matej Sirovatka and Charles Goddard and Fares Obied and Jack Min Ong and Jannik Straube and Fern and Aria Harley and Conner Stewart and Colin Kealty and Maziyar Panahi and Simon Kirsten and Anushka Deshpande and Anneketh Vij and Arthur Bresnu and Pranav Veldurthi and Raghav Ravishankar and Hardik Bishnoi and DatologyAI Team and Arcee AI Team and Prime Intellect Team and Mark McQuade and Johannes Hagemann and Lucas Atkins},
year = {2026},
eprint = {2602.17004},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
doi = {10.48550/arXiv.2602.17004},
url = {https://arxiv.org/abs/2602.17004}
}