Mistral Medium 3.5 是我们的首款旗舰级融合模型。这是一个拥有 1280 亿参数的密集型模型,具备 256k 的上下文窗口,可在单一权重集内处理指令遵循、推理及编码任务。Mistral Medium 3.5 已在 Le Chat 中取代其前身 Mistral Medium 3.1 和 Magistral,并在我们的编码代理 Vibe 中取代了 Devstral 2。具体而言,与我们之前发布的模型相比,这款全新的统一模型在指令遵循、推理和编码任务上的性能有望得到提升。
推理力度可按请求配置,因此同一模型既能快速回复聊天内容,也能处理复杂的智能体运行流程。我们从零开始训练视觉编码器,以应对可变的图像尺寸和宽高比。
更多信息请参见我们的 博客。
[!Note] 若要使用 vLLM 或 SGLang 加速本地推理,请查看我们发布的 EAGLE 模型。
[!Warning] Transformers 配置最初存在一个错误条目,导致长上下文性能下降。此问题已在该 提交 中修复。使用此提交之前的 Transformers 配置生成的 GGUF 文件也受到影响。请使用正确的配置以获得最佳性能。
Mistral Medium 3.5 包含以下架构选择:
Mistral Medium 3.5 具备以下功能:
我们依据 Modified MIT License 发布此模型:这是一种开源许可,允许商业和非商业使用,但对高收入公司有例外规定。
'none' → 不使用推理'high' → 使用推理(建议用于复杂提示词和智能体场景)
对于复杂任务和智能体编码,请使用 reasoning_effort="high"。reasoning_effort="high" 时,建议设为 0.7。当 reasoning_effort="none" 时,根据具体任务,温度系数可在 0.0 到 0.7 之间调整。
通常,较低的温度系数会使回答更切中要点,较高的温度系数则允许模型更具创造性。尝试不同的数值以优化模型性能,从而满足您的需求,是一种良好的实践。reasoning_effort="high" 时,建议设为 0.95。您可以尝试不同的数值,但保持在该数值附近通常能获得最佳性能。当 reasoning_effort="none" 时,建议设为 None(或 1.0)。Mistral Medium 3.5 在所有基准测试中均超越了我们之前所有的编码模型,即 Devstral。它在 τ³-Telecom 上的得分为 91.4%,在 SWE-Bench Verified 上的得分为 77.6%。由于其更强的智能体能力,Mistral Medium 3.5 已在我们的编码智能体 Vibe CLI 中取代了 Devstral 2。

我们在指令遵循、推理(数学)和编码基准测试中,将 Mistral Medium 3.5 与竞争模型进行了比较。得益于其统一的能力,它在所有这些任务中都取得了优异的成绩,并且 Mistral Medium 3.5 现已为 Le Chat 提供支持。

您可以在多个库中找到对 Mistral Medium 3.5 的支持,用于推理和微调。
在此,我们要感谢所有帮助实现这一目标的贡献者和维护者。
通过 Mistral Vibe 使用 Mistral Medium 3.5。
安装最新版本:
uv pip install mistral-vibe --upgrade启动 vibe 时可选择 Mistral Medium 3.5 模型。若您是首次启动 vibe,程序将执行以下操作:
现在选择 mistral-medium-3.5 即可开始构建!
如果您不想调用 Mistral API,而是希望使用本地 vLLM 服务器,可按以下步骤操作:
~/.vibe/config.toml 中添加模型配置:display_name = "Mistral Medium 3.5 (local vLLM)"
description = "Mistral Medium 3.5 mode using local vLLM"
safety = "neutral"
active_model = "mistral-medium-3.5" # Make sure this is the only active_model entry
[[providers]]
name = "vllm"
api_base = "http://<your-host-url>:8000/v1"
api_key_env_var = ""
backend = "generic"
api_style = "reasoning"
[[models]]
name = "mistralai/Mistral-Medium-3.5-128B"
provider = "vllm"
alias = "mistral-medium-3.5"
thinking = "high"
temperature = 0.7
auto_compact_threshold = 168000
[tools.bash]
default_timeout = 1200注意:
<your-host-url> 替换为您服务器的 URL。然后重启 vibe 并通过 “tab-shift” 切换到 “mistral-medium-3.5” 模式。
尝试一些编码智能体任务,开始构建一些很酷的东西吧!
该模型可通过以下方式部署:
vllm(推荐):参见此处。llama.cpp:Unsloth 的 GGUF 文件参见此处。LM Studio:开发中,敬请期待!Ollama:参见此处。SGLang:参见此处。transformers:参见此处。[!Note] 为获得最佳性能,如果本地服务表现不佳,我们建议使用 Mistral AI API。
[!Warning] 确保依赖 Transformers 配置的框架(包括 GGUF 文件)已更新至包含此提交中引入的修复。否则,您将遇到性能不佳的问题,尤其是在长上下文会话中。
可通过以下方式微调模型:
我们建议将 Mistral Medium 3.5 与 vLLM 库 结合使用,以实现生产级推理。
[!Note] 要使用 vLLM 加速本地推理,请查看我们发布的 EAGLE 模型
请确保安装vllm nightly:
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly执行此操作应会自动安装 mistral_common >= 1.11.1 和 transformers >= 5.4.0。
检查方法:
python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"您也可以使用现成的 docker image 或 docker hub 上的镜像。
我们推荐采用服务器/客户端架构:
vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8Mistral Medium 3.5 能够严格按照您的指令执行操作。
from datetime import datetime, timedelta
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)借助我们简单的 Python 计算器工具来解一些方程吧。
import json
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
tool_choice="auto",
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)来看看Mistral Medium 3.5是否知道何时该“出手”!
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "high" # Remove reasoning with 'none'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)借助 SGLang 库 部署 Mistral Medium 3.5,实现生产级推理。
[!Note] 如需使用 SGLang 加速本地推理,请查看我们发布的 EAGLE 模型。
首日支持已集成在专用 Docker 标签中:
docker pull lmsysorg/sglang:dev-mistral-medium-3.5 # H100 / H200 (Hopper, CUDA 12.9)
docker pull lmsysorg/sglang:dev-cu13-mistral-medium-3.5 # B200 / B300 (Blackwell, CUDA 13.0)或者参考 SGLang 安装指南。需要 transformers >= 5.4.0。
python -m sglang.launch_server --model-path mistralai/Mistral-Medium-3.5-128B \
--tp 8 --tool-call-parser mistral --reasoning-parser mistral如需完整的部署指南、基准测试以及每个请求的示例(推理过程、工具调用、视觉功能、流式传输),请参阅 Mistral Medium 3.5 的 SGLang 参考文档。
首先安装 Transformers 框架,以使用 Mistral Medium 3.5:
uv pip install transformersimport torch
from transformers import AutoProcessor, Mistral3ForConditionalGeneration
REASONING_EFFORT = "high" # Remove reasoning with 'none'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = 1.0
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
model_id = "mistralai/Mistral-Medium-3.5-128B"
processor = AutoProcessor.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto"
)
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort=REASONING_EFFORT)
inputs = inputs.to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=TEMP,
top_p=TOP_P,
)[0]
# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=False)
print(decoded_output)本模型根据修改后的 MIT 许可协议进行许可。
您不得将本模型用于侵犯、盗用或以其他方式违反任何第三方权利(包括知识产权)的行为。