Apriel-1.6-15B-Thinker：经济高效的前沿多模态性能

thumbnail /ˈɑː.pri.əl/

摘要

Apriel-1.6-15B-Thinker 是 ServiceNow Apriel SLM 系列中的更新版多模态推理模型，基于 Apriel-1.5-15B-Thinker 构建。 Apriel-1.6 的文本和图像推理能力得到显著提升，与规模达其 10 倍的模型相比也能取得具有竞争力的性能。与前代模型一样，它得益于在文本和图像领域广泛的持续预训练。我们额外进行了侧重于监督微调（SFT）和强化学习（RL）的后训练。 Apriel-1.6 在不牺牲推理 token 效率的前提下实现了前沿性能。与 Apriel-1.5-15B-Thinker 相比，该模型在提升或保持任务性能的同时，将推理 token 使用量减少了 30% 以上。

亮点

在人工分析指数（Artificial Analysis index）上取得 57 分，优于 Gemini 2.5 Flash、Claude Haiku 4.5 和 GPT OSS 20b 等模型。其得分与 Qwen3 235B A22B 相当，但效率显著更高。
将推理 token 使用量减少 30% 以上，比 Apriel-1.5-15B-Thinker 效率显著提升。
在 Tau2 Bench Telecom 上得分为 69，在 IFBench 上同样得分为 69，这两个都是企业领域的关键基准测试。
该模型拥有 150 亿参数，可在单块 GPU 上运行，内存效率极高。
根据社区对 Apriel-1.5-15B-Thinker 的反馈，我们简化了聊天模板，移除了冗余标签，并为分词器引入了四个特殊 token（<tool_calls>、</tool_calls>、[BEGIN FINAL RESPONSE]、<|end|>），以便更轻松地进行输出解析。

更多详情请参见我们的博客文章。

评估

人工智能分析指数 v3.0 中包含的文本基准测试使用 Artificial Analysis 报告的分数。所有其他基准测试均为内部评估。

类别	基准测试	Apriel-1.6-15B-Thinker	Apriel-1.5-15B-Thinker	GPT OSS 120B	DeepSeek R1 0528	Gemini 2.5 Flash (Sep)	GPT 5 mini (high)	Claude 4.5 Sonnet (thinking)	o3-mini (high)
	平均得分**	53.62	46.56	52.56	51.92	50.71	62.58	60.37	48.85
函数调用	BFCL v3 only	63.50	51.88	50.62	39.75	39.75	17.62	-	50
	Tau2 bench Telecom	69	57.8	66	37	32	68	50.8	31
	Tau2 bench Retail	66.67	46.78	61.4	59.94	61.69	73.39	69.8	75.73
	Tau2 bench Airline	58	52	45.3	47.33	56.66	59.33	58	61.33
	ComplexFuncBench	33.2	19	24.6	24.2	26.3	37.5	24.6	18.9
指令遵循	Agent IF	57.2	55	54.20	52.20	49.70	57.60	54.50	54.90
	Multi IF	83.34	76.91	82.95	73.76	82.49	85.37	84.32	87.28
	Multi-Challenge	46.15	41.39	46.90	44.50	49.08	57.90	42.49	38.46
	IF Bench	69	62	69	40	50	75	57	70.07
数学	AIME 25	88	88	93	76	73	91	88	86.67
编码	Struct Eval	79	48.50	71	73	70	69.92	76	73
	LCB	81	73	88	77	70	84	71	73
	SciCode	37	35	39	40	41	39	45	40
智能体能力	DeepresearchBench	36.47	32.73	36.30	34.19	38.15	-	-	33.40
	GAIA	40	30.91	21.21	32.12	47.88	65.45	69.09	23.03
	Work-Arena L1	59.1	51.5	50.9	63.9	51.8	65.5	62.7	52.4
	OS World Small	16.70	13.90	16.70	25	19.40	22.20	30.60	19.40
	SWE Bench Verified	23	16	31	29.60	34.20	61	64.2	22.60
	Terminal Bench	14	10	22	15	13	31	33	5.67
	Aider Polyglot	37.68	26.37	42	71.40	40	71.60	78	60.40
知识	MMLU Pro	79	77	81	85	83	84	88	80
创意写作	Creative writing v3 / EQ Bench	59.73	60.24	53.70	79.40	74.25	75.25	80.70	30.40
其他	GPQA Diamond	73	71	78	81	79	83	83	77
其他	HLE	10	12	18.5	14.9	11.1	19.7	17.3	12.3
长文本理解	AA LCR	50*	20	51	55	62	68	66	30***

* 此分数是在启用 DCA 的情况下获得的。若未启用，模型得分为 36。

** 平均得分计算涵盖所有基准测试，但不包括 BFCL v3 Only 和 DeepResearchBench，因为部分模型未提供这两项的分数。

*** o3-mini-high 的 AA LCR 分数是基于其 AA 指数得分的预测值。

对于图像基准测试，我们报告的评估结果来自 https://github.com/open-compass/VLMEvalKit

基准测试	Apriel-1.6-15B-Thinker	Apriel-1.5-15B-Thinker	GPT-5 (high)	GLM-4.5V (Thinking)	Gemini 2.5 Flash (high)	Claude Sonnet 3.7 (Thinking)	GPT-5 (Minimal)	Grok 4 Fast (Thinking)
MMMU (validation)	72	70.22	81.33	74.33	70.66	73.66	66.66	70.11
MMMU-PRO (10 choice)	60.28	55.38	74.73	64.16	67.86	64.50	66.06	61.61
MMMU-PRO (Vision Only)	52.89	48.21	66.93	61.50	56.76	60.11	57.68	22.94
LogicVista	58.61	58.39	69.35	63.53	63.75	69.12	44.51	47.42
MathVision	60.85	50.99	67.10	59.53	59.21	50.32	35.52	48.35
MathVista	79.90	75.50	83.30	83.60	78.50	74.60	61.20	68.20
MathVerse (Vision Dominant)	66.75	58.38	79.82	68.65	70.68	56.09	39.84	54.69
MathVerse (Text Dominant)	79.06	76.40	84.64	77.41	78.80	69.28	43.78	72.20
MMStar	70.66	67.73	77.74	74.46	73.86	70	63.60	64.80
CharXiv (descriptive)	89.85	88.20	91.25	90.80	83.60	93.27	82.45	68.15
CharXiv (reasoning)	56.00	50.10	71.50	63.00	56.50	70.90	52.80	33.50
AI2D Test	86.04	82.87	90.05	87.75	82.09	84.19	85.16	81.86
BLINK	63.96	58.71	70.22	66.59	65.64	64.49	64.59	54.39

预期用途

Apriel 系列模型适用于多种通用指令任务，包括：

代码辅助与生成
逻辑推理与多步骤任务
问答与信息检索
函数调用、复杂指令遵循及智能体应用场景

不建议在无人工监督的安全关键型应用中使用，或用于需要确保事实准确性的场景。

使用方法

pip install transformers

运行推理模型

以下是使用 transformers 库的 generate 函数演示模型用法的代码片段：

# Tested with transformers==4.48

import re
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

# Load model
model_id = "ServiceNow-AI/Apriel-1.6-15b-Thinker"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Example 1: Text-only prompt
chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital for France?"},
        ],
    }
]

inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"

$$BEGIN FINAL RESPONSE$$

(.*?)(?:<\|end\|>)", output, re.DOTALL)[0].strip()

print("Text-only Response:", response)

# Example 2: Image understanding
url = "https://picsum.photos/id/237/200/300"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Which animal is this?"},
            {"type": "image"},
        ],
    }
]

prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"

$$BEGIN FINAL RESPONSE$$

(.*?)(?:<\|end\|>)", output, re.DOTALL)[0].strip()

print("Image Response:", response)

使用指南

使用模型默认的对话模板，该模板已包含系统提示。
我们建议将temperature设置为0.6。
在所有评估中，我们确保模型以Here are my reasoning steps:\n开头。这已在默认对话模板中实现。
对于多轮对话，中间轮次（历史模型输出）应仅包含最终响应，不包含推理步骤。

对话模板

<|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. Analyze each question carefully, present your reasoning step-by-step, then provide the final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
# user message here
<|begin_assistant|>
Here are my reasoning steps:
# thoughts here
[BEGIN FINAL RESPONSE]
# assistant response here
<|end|>

模型将首先生成其思考过程，然后生成最终响应，最终响应以 [BEGIN FINAL RESPONSE] 开头。以下是展示聊天模板应用的代码片段：

from transformers import AutoTokenizer
model_name = "ServiceNow-AI/Apriel-1.6-15b-Thinker"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
custom_system_prompt = "Answer like a pirate."
prompt = "You are an expert assistant in the implementation of customer experience management aspect of retail applications \n \nYou will be using Python as the programming language. \n \nYou will utilize a factory design pattern for the implementation and following the dependency inversion principle \n \nYou will modify the implementation based on user requirements. \n \nUpon user request, you will add, update, and remove the features & enhancements in the implementation provided by you. \n \nYou will ask whether the user wants to refactor the provided code or needs a sample implementation for reference. Upon user confirmation, I will proceed accordingly. \n \n**Guidelines:** \n 1. **User Requirements:** \n - You have to ask users about their requirements, clarify the user expectations, and suggest the best possible solution by providing examples of Python code snippets. \n - Ask users about which type of reports they need to assess the AI model's performance, accuracy, and reliability. \n - After providing the solution, you have to ask the user about the trial of the solution and modify the solution based on the user feedback. \n \n 2. **Libraries/Frameworks:** \n - You will be utilizing Python as a programming language. \n - You will be using Flask framework for REST APIS implementation \n \n 3. **Communication Gesture:** \n - Your conversation with the user should be interactive, supportive, courageous, and professional. \n - You have to break down the complex concepts into sub-concepts and try to explain them to the user. \n - You have to ask the user for the required parameters. If the user refuses to provide in 2 attempts, politely exit the conversation. \n - You have to provide your supported parameters to the user, if the user refuses to accept them then you have to put an apology note and exit the conversation. \n - You have to track the conversation about unasked questions by the user. If some/one of the questions remain then you have to remind the user about these questions and proceed to answer them based on the user's confirmation \n \n 4. **Implementation:** \n - Your code/implementations should be reliable, scaleable, modular, and reusable. \n - You will be providing unit tests for the implementation upon user request. \n - You will be following MVC architecture for the applications \n - Your implementations must be well-commented and readable \n \n \n- Today's date is 23rd August 2024. \n- The default sender email is sender-assistant@email.com.\nHi, I am conducting research on retail customer feedback systems and I need assistance with designing and implementing them. Could you kindly provide me with a list of general customer feedback system modules?"
messages = [
    {"role": "user", "content": custom_system_prompt + "\n\n" + prompt}
]
# example tools
tools = [{"type": "function", "function": {"name": "getRetailFeedbackModules", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"page": {"type": "integer", "description": "The current page number.", "default": 1}, "page_size": {"type": "integer", "description": "The number of items per page.", "default": 3}}}}}, {"type": "function", "function": {"name": "verifyImplementation", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"coding_language": {"type": "string", "description": "The supported languages for verification of implementation.", "default": "python", "enum": ["python", "java", "php"]}, "code": {"type": "string", "description": "The code which needs verification"}, "design_pattern": {"type": "string", "description": "The design pattern to verify in the implementation", "enum": ["factory", "strategy", "singleton"]}, "verify_best_practices": {"type": "boolean", "description": "The verification of the coding style based on the language selected", "default": true}}}}}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)
model_inputs = tokenizer([text], return_tensors="pt")

使用 vLLM 运行

由于上游 PR 尚未合并，您可以使用此自定义镜像作为替代方式来运行模型，并启用工具和推理解析器。

Docker 镜像

docker.io/amant555/vllm_apriel:latest

启动命令

python3 -m vllm.entrypoints.openai.api_server \
  --model ServiceNow-AI/Apriel-1.6-15b-Thinker \
  --served-model-name Apriel-1p6-15B-Thinker \
  --trust_remote_code \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser apriel \
  --reasoning-parser apriel

额外推理选项

Apriel-1.6-15b-Thinker 可在 Together AI 上进行托管推理，也可在 Ollama 上本地使用。

训练详情

训练框架： Fast-LLM、VERL

持续预训练： 涵盖数学、代码、科学、逻辑推理和多模态图文数据的数十亿tokens。

监督微调（SFT）： 240万样本，涉及数学、代码、指令遵循、函数调用和对话，随后进行增量轻量级多模态SFT。

强化学习（RL）： 多阶段RL，结合可验证奖励机制和 GSPO 算法，应用于文本和视觉任务。我们的RL阶段优化推理效率：通过减少不必要的中间步骤降低token消耗，在置信度高时提前停止推理，并对简单查询直接给出答案。

有关训练方法的更多详情，请参阅我们的博客文章。

局限性

事实准确性： 可能生成不正确、误导性或过时的内容。在关键场景中使用前，应验证输出结果。
偏见： 可能反映训练数据中存在的社会、文化或系统性偏见。
伦理： 不得使用本模型生成有害、非法或不道德的内容。
语言： 在英语上表现最佳。在代表性不足的语言中，输出质量可能下降。
关键用途： 若无安全保障措施，不适合医疗、法律、金融或其他高风险应用场景。

安全与负责任使用

安全责任：
部署者和用户应强烈将其安全实践与既定框架和监管指南（如欧盟AI法案和NIST AI风险管理框架（RMF））保持一致。

部署者指南

定期进行鲁棒性评估，以识别和缓解对抗性输入。
实施验证和过滤流程，防止生成有害或有偏见的输出。
持续执行数据隐私检查，防止意外数据泄露。
记录并告知模型的局限性、预期用途和已知安全风险给所有最终用户。
安排定期安全审查和更新，以应对新出现的威胁和漏洞。

用户指南

遵循部署者提供的既定安全政策和使用指南。
与模型交互时，保护和管理敏感信息。
向部署者或开发者报告异常、可疑行为或不安全的输出。
保持人工监督，并在交互过程中运用判断力，以减轻潜在的安全或伦理风险。

免责声明：
用户对安全部署、管理和使用此开源LLM承担责任。本模型按"现状"提供，不就其安全性或对任何特定应用或环境的适用性作出明示或暗示的保证。

许可证

MIT

引用

@misc{radhakrishna2025apriel1515bthinker,
      title={Apriel-1.5-15b-Thinker}, 
      author={Shruthan Radhakrishna and Aman Tiwari and Aanjaneya Shukla and Masoud Hashemi and Rishabh Maheshwary and Shiva Krishna Reddy Malay and Jash Mehta and Pulkit Pattnaik and Saloni Mittal and Khalil Slimi and Kelechi Ogueji and Akintunde Oladipo and Soham Parikh and Oluwanifemi Bamgbose and Toby Liang and Ahmed Masry and Khyati Mahajan and Sai Rajeswar Mudumba and Vikas Yadav and Sathwik Tejaswi Madhusudhan and Torsten Scholak and Sagar Davasam and Srinivas Sunkara and Nicholas Chapados},
      year={2025},
      eprint={2510.01141},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.01141}, 
}

Apriel-1.6-15B-Thinker：经济高效的前沿多模态性能

thumbnail /ˈɑː.pri.əl/

摘要

亮点

在人工分析指数（Artificial Analysis index）上取得 57 分，优于 Gemini 2.5 Flash、Claude Haiku 4.5 和 GPT OSS 20b 等模型。其得分与 Qwen3 235B A22B 相当，但效率显著更高。
将推理 token 使用量减少 30% 以上，比 Apriel-1.5-15B-Thinker 效率显著提升。
在 Tau2 Bench Telecom 上得分为 69，在 IFBench 上同样得分为 69，这两个都是企业领域的关键基准测试。
该模型拥有 150 亿参数，可在单块 GPU 上运行，内存效率极高。
根据社区对 Apriel-1.5-15B-Thinker 的反馈，我们简化了聊天模板，移除了冗余标签，并为分词器引入了四个特殊 token（<tool_calls>、</tool_calls>、[BEGIN FINAL RESPONSE]、<|end|>），以便更轻松地进行输出解析。

更多详情请参见我们的博客文章。

评估

人工智能分析指数 v3.0 中包含的文本基准测试使用 Artificial Analysis 报告的分数。所有其他基准测试均为内部评估。

类别	基准测试	Apriel-1.6-15B-Thinker	Apriel-1.5-15B-Thinker	GPT OSS 120B	DeepSeek R1 0528	Gemini 2.5 Flash (Sep)	GPT 5 mini (high)	Claude 4.5 Sonnet (thinking)	o3-mini (high)
	平均得分**	53.62	46.56	52.56	51.92	50.71	62.58	60.37	48.85
函数调用	BFCL v3 only	63.50	51.88	50.62	39.75	39.75	17.62	-	50
	Tau2 bench Telecom	69	57.8	66	37	32	68	50.8	31
	Tau2 bench Retail	66.67	46.78	61.4	59.94	61.69	73.39	69.8	75.73
	Tau2 bench Airline	58	52	45.3	47.33	56.66	59.33	58	61.33
	ComplexFuncBench	33.2	19	24.6	24.2	26.3	37.5	24.6	18.9
指令遵循	Agent IF	57.2	55	54.20	52.20	49.70	57.60	54.50	54.90
	Multi IF	83.34	76.91	82.95	73.76	82.49	85.37	84.32	87.28
	Multi-Challenge	46.15	41.39	46.90	44.50	49.08	57.90	42.49	38.46
	IF Bench	69	62	69	40	50	75	57	70.07
数学	AIME 25	88	88	93	76	73	91	88	86.67
编码	Struct Eval	79	48.50	71	73	70	69.92	76	73
	LCB	81	73	88	77	70	84	71	73
	SciCode	37	35	39	40	41	39	45	40
智能体能力	DeepresearchBench	36.47	32.73	36.30	34.19	38.15	-	-	33.40
	GAIA	40	30.91	21.21	32.12	47.88	65.45	69.09	23.03
	Work-Arena L1	59.1	51.5	50.9	63.9	51.8	65.5	62.7	52.4
	OS World Small	16.70	13.90	16.70	25	19.40	22.20	30.60	19.40
	SWE Bench Verified	23	16	31	29.60	34.20	61	64.2	22.60
	Terminal Bench	14	10	22	15	13	31	33	5.67
	Aider Polyglot	37.68	26.37	42	71.40	40	71.60	78	60.40
知识	MMLU Pro	79	77	81	85	83	84	88	80
创意写作	Creative writing v3 / EQ Bench	59.73	60.24	53.70	79.40	74.25	75.25	80.70	30.40
其他	GPQA Diamond	73	71	78	81	79	83	83	77
其他	HLE	10	12	18.5	14.9	11.1	19.7	17.3	12.3
长文本理解	AA LCR	50*	20	51	55	62	68	66	30***

* 此分数是在启用 DCA 的情况下获得的。若未启用，模型得分为 36。

** 平均得分计算涵盖所有基准测试，但不包括 BFCL v3 Only 和 DeepResearchBench，因为部分模型未提供这两项的分数。

*** o3-mini-high 的 AA LCR 分数是基于其 AA 指数得分的预测值。

对于图像基准测试，我们报告的评估结果来自 https://github.com/open-compass/VLMEvalKit

基准测试	Apriel-1.6-15B-Thinker	Apriel-1.5-15B-Thinker	GPT-5 (high)	GLM-4.5V (Thinking)	Gemini 2.5 Flash (high)	Claude Sonnet 3.7 (Thinking)	GPT-5 (Minimal)	Grok 4 Fast (Thinking)
MMMU (validation)	72	70.22	81.33	74.33	70.66	73.66	66.66	70.11
MMMU-PRO (10 choice)	60.28	55.38	74.73	64.16	67.86	64.50	66.06	61.61
MMMU-PRO (Vision Only)	52.89	48.21	66.93	61.50	56.76	60.11	57.68	22.94
LogicVista	58.61	58.39	69.35	63.53	63.75	69.12	44.51	47.42
MathVision	60.85	50.99	67.10	59.53	59.21	50.32	35.52	48.35
MathVista	79.90	75.50	83.30	83.60	78.50	74.60	61.20	68.20
MathVerse (Vision Dominant)	66.75	58.38	79.82	68.65	70.68	56.09	39.84	54.69
MathVerse (Text Dominant)	79.06	76.40	84.64	77.41	78.80	69.28	43.78	72.20
MMStar	70.66	67.73	77.74	74.46	73.86	70	63.60	64.80
CharXiv (descriptive)	89.85	88.20	91.25	90.80	83.60	93.27	82.45	68.15
CharXiv (reasoning)	56.00	50.10	71.50	63.00	56.50	70.90	52.80	33.50
AI2D Test	86.04	82.87	90.05	87.75	82.09	84.19	85.16	81.86
BLINK	63.96	58.71	70.22	66.59	65.64	64.49	64.59	54.39

预期用途

Apriel 系列模型适用于多种通用指令任务，包括：

代码辅助与生成
逻辑推理与多步骤任务
问答与信息检索
函数调用、复杂指令遵循及智能体应用场景

不建议在无人工监督的安全关键型应用中使用，或用于需要确保事实准确性的场景。

使用方法

pip install transformers

运行推理模型

以下是使用 transformers 库的 generate 函数演示模型用法的代码片段：

# Tested with transformers==4.48

import re
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

# Load model
model_id = "ServiceNow-AI/Apriel-1.6-15b-Thinker"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Example 1: Text-only prompt
chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital for France?"},
        ],
    }
]

inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"

$$BEGIN FINAL RESPONSE$$

(.*?)(?:<\|end\|>)", output, re.DOTALL)[0].strip()

print("Text-only Response:", response)

# Example 2: Image understanding
url = "https://picsum.photos/id/237/200/300"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

chat = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Which animal is this?"},
            {"type": "image"},
        ],
    }
]

prompt = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.6)

generated_ids = output_ids[:, inputs['input_ids'].shape[1]:]
output = processor.decode(generated_ids[0], skip_special_tokens=True)
response = re.findall(r"

$$BEGIN FINAL RESPONSE$$

(.*?)(?:<\|end\|>)", output, re.DOTALL)[0].strip()

print("Image Response:", response)

使用指南

使用模型默认的对话模板，该模板已包含系统提示。
我们建议将temperature设置为0.6。
在所有评估中，我们确保模型以Here are my reasoning steps:\n开头。这已在默认对话模板中实现。
对于多轮对话，中间轮次（历史模型输出）应仅包含最终响应，不包含推理步骤。

对话模板

<|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. Analyze each question carefully, present your reasoning step-by-step, then provide the final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
# user message here
<|begin_assistant|>
Here are my reasoning steps:
# thoughts here
[BEGIN FINAL RESPONSE]
# assistant response here
<|end|>

模型将首先生成其思考过程，然后生成最终响应，最终响应以 [BEGIN FINAL RESPONSE] 开头。以下是展示聊天模板应用的代码片段：

from transformers import AutoTokenizer
model_name = "ServiceNow-AI/Apriel-1.6-15b-Thinker"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
custom_system_prompt = "Answer like a pirate."
prompt = "You are an expert assistant in the implementation of customer experience management aspect of retail applications \n \nYou will be using Python as the programming language. \n \nYou will utilize a factory design pattern for the implementation and following the dependency inversion principle \n \nYou will modify the implementation based on user requirements. \n \nUpon user request, you will add, update, and remove the features & enhancements in the implementation provided by you. \n \nYou will ask whether the user wants to refactor the provided code or needs a sample implementation for reference. Upon user confirmation, I will proceed accordingly. \n \n**Guidelines:** \n 1. **User Requirements:** \n - You have to ask users about their requirements, clarify the user expectations, and suggest the best possible solution by providing examples of Python code snippets. \n - Ask users about which type of reports they need to assess the AI model's performance, accuracy, and reliability. \n - After providing the solution, you have to ask the user about the trial of the solution and modify the solution based on the user feedback. \n \n 2. **Libraries/Frameworks:** \n - You will be utilizing Python as a programming language. \n - You will be using Flask framework for REST APIS implementation \n \n 3. **Communication Gesture:** \n - Your conversation with the user should be interactive, supportive, courageous, and professional. \n - You have to break down the complex concepts into sub-concepts and try to explain them to the user. \n - You have to ask the user for the required parameters. If the user refuses to provide in 2 attempts, politely exit the conversation. \n - You have to provide your supported parameters to the user, if the user refuses to accept them then you have to put an apology note and exit the conversation. \n - You have to track the conversation about unasked questions by the user. If some/one of the questions remain then you have to remind the user about these questions and proceed to answer them based on the user's confirmation \n \n 4. **Implementation:** \n - Your code/implementations should be reliable, scaleable, modular, and reusable. \n - You will be providing unit tests for the implementation upon user request. \n - You will be following MVC architecture for the applications \n - Your implementations must be well-commented and readable \n \n \n- Today's date is 23rd August 2024. \n- The default sender email is sender-assistant@email.com.\nHi, I am conducting research on retail customer feedback systems and I need assistance with designing and implementing them. Could you kindly provide me with a list of general customer feedback system modules?"
messages = [
    {"role": "user", "content": custom_system_prompt + "\n\n" + prompt}
]
# example tools
tools = [{"type": "function", "function": {"name": "getRetailFeedbackModules", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"page": {"type": "integer", "description": "The current page number.", "default": 1}, "page_size": {"type": "integer", "description": "The number of items per page.", "default": 3}}}}}, {"type": "function", "function": {"name": "verifyImplementation", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"coding_language": {"type": "string", "description": "The supported languages for verification of implementation.", "default": "python", "enum": ["python", "java", "php"]}, "code": {"type": "string", "description": "The code which needs verification"}, "design_pattern": {"type": "string", "description": "The design pattern to verify in the implementation", "enum": ["factory", "strategy", "singleton"]}, "verify_best_practices": {"type": "boolean", "description": "The verification of the coding style based on the language selected", "default": true}}}}}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)
model_inputs = tokenizer([text], return_tensors="pt")

使用 vLLM 运行

由于上游 PR 尚未合并，您可以使用此自定义镜像作为替代方式来运行模型，并启用工具和推理解析器。

Docker 镜像

docker.io/amant555/vllm_apriel:latest

启动命令

python3 -m vllm.entrypoints.openai.api_server \
  --model ServiceNow-AI/Apriel-1.6-15b-Thinker \
  --served-model-name Apriel-1p6-15B-Thinker \
  --trust_remote_code \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser apriel \
  --reasoning-parser apriel

额外推理选项

Apriel-1.6-15b-Thinker 可在 Together AI 上进行托管推理，也可在 Ollama 上本地使用。

训练详情

训练框架： Fast-LLM、VERL

持续预训练： 涵盖数学、代码、科学、逻辑推理和多模态图文数据的数十亿tokens。

监督微调（SFT）： 240万样本，涉及数学、代码、指令遵循、函数调用和对话，随后进行增量轻量级多模态SFT。

有关训练方法的更多详情，请参阅我们的博客文章。

局限性

事实准确性： 可能生成不正确、误导性或过时的内容。在关键场景中使用前，应验证输出结果。
偏见： 可能反映训练数据中存在的社会、文化或系统性偏见。
伦理： 不得使用本模型生成有害、非法或不道德的内容。
语言： 在英语上表现最佳。在代表性不足的语言中，输出质量可能下降。
关键用途： 若无安全保障措施，不适合医疗、法律、金融或其他高风险应用场景。

安全与负责任使用

安全责任：
部署者和用户应强烈将其安全实践与既定框架和监管指南（如欧盟AI法案和NIST AI风险管理框架（RMF））保持一致。

部署者指南

定期进行鲁棒性评估，以识别和缓解对抗性输入。
实施验证和过滤流程，防止生成有害或有偏见的输出。
持续执行数据隐私检查，防止意外数据泄露。
记录并告知模型的局限性、预期用途和已知安全风险给所有最终用户。
安排定期安全审查和更新，以应对新出现的威胁和漏洞。

用户指南

遵循部署者提供的既定安全政策和使用指南。
与模型交互时，保护和管理敏感信息。
向部署者或开发者报告异常、可疑行为或不安全的输出。
保持人工监督，并在交互过程中运用判断力，以减轻潜在的安全或伦理风险。

许可证

MIT

引用

@misc{radhakrishna2025apriel1515bthinker,
      title={Apriel-1.5-15b-Thinker}, 
      author={Shruthan Radhakrishna and Aman Tiwari and Aanjaneya Shukla and Masoud Hashemi and Rishabh Maheshwary and Shiva Krishna Reddy Malay and Jash Mehta and Pulkit Pattnaik and Saloni Mittal and Khalil Slimi and Kelechi Ogueji and Akintunde Oladipo and Soham Parikh and Oluwanifemi Bamgbose and Toby Liang and Ahmed Masry and Khyati Mahajan and Sai Rajeswar Mudumba and Vikas Yadav and Sathwik Tejaswi Madhusudhan and Torsten Scholak and Sagar Davasam and Srinivas Sunkara and Nicholas Chapados},
      year={2025},
      eprint={2510.01141},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.01141}, 
}

Apriel-1.6-15B-Thinker：经济高效的前沿多模态性能

目录

摘要

评估

预期用途

使用方法

运行推理模型

使用指南

对话模板

使用 vLLM 运行

Docker 镜像

启动命令

额外推理选项

训练详情

局限性

安全与负责任使用

许可证

引用

Apriel-1.6-15B-Thinker：经济高效的前沿多模态性能

目录

摘要

评估

预期用途

使用方法

运行推理模型

使用指南

对话模板

使用 vLLM 运行

Docker 镜像

启动命令

额外推理选项

训练详情

局限性

安全与负责任使用

许可证

引用