Mistral Small 4 是一款功能强大的混合模型,既能作为通用指令模型,也能作为推理模型。它将 Instruct、Reasoning(前称 Magistral)和 Devstral 这三个不同模型家族的能力整合到了一个统一的模型中。
凭借其多模态能力、高效架构和灵活的模式切换,它成为适用于任何任务的强大通用模型。在延迟优化配置下,Mistral Small 4 实现了端到端完成时间缩短 40%;在吞吐量优化配置下,与 Mistral Small 3 相比,每秒可处理3 倍以上的请求。
要进一步提升效率,您可以利用以下任一方式:
mistralai/Mistral-Small-4-119B-2603-eagle 进行推测解码。mistralai/Mistral-Small-4-119B-2603-NVFP4 进行 4 位浮点精度量化。Mistral Small 4 包含以下架构选择:
Mistral Small 4 具备以下功能:
'none' → 不使用推理'high' → 使用推理(复杂提示推荐)
对于复杂任务,请使用 reasoning_effort="high"reasoning_effort="high" 时建议设为 0.7。当 reasoning_effort="none" 时,根据任务需求,温度参数可在 0.0 到 0.7 之间调整。Mistral Small 4 适用于通用聊天助手、编码、智能体任务以及推理任务(需开启推理模式)。其多模态能力还支持文档和图像理解,可用于数据提取与分析。
其功能特别适合以下场景:
Mistral Small 4 也非常适合通过定制和微调来适应更专业的任务。
根据任务需求,您可以通过 按请求 参数 reasoning_effort 触发推理功能。设置方式如下:
reasoning_effort="none":快速、轻量级响应,适用于日常任务,聊天风格与 mistralai/Mistral-Small-3.2-24B-Instruct-2506 相当。reasoning_effort="high":深度、逐步推理,适用于复杂问题,详细程度与之前的 Magistral 模型(如 mistralai/Magistral-Small-2509)相当。

具备推理能力的Mistral Small 4取得了具有竞争力的分数,在所有三个基准测试中均达到或超越了GPT-OSS 120B,同时生成的输出内容显著更短。在AA LCR上,Mistral Small 4仅用1.6K字符就获得了0.72的分数,而Qwen模型需要3.5-4倍更多的输出内容(5.8-6.1K字符)才能达到相当的性能。在LiveCodeBench上,Mistral Small 4在性能上优于GPT-OSS 120B,同时输出内容减少了20%。这种效率降低了延迟和推理成本,并改善了用户体验。

多个库均支持Mistral Small 4的推理和微调功能。在此,我们感谢所有为此提供帮助的贡献者和维护者。
该模型可通过以下方式部署:
为获得最佳性能,如果本地部署效果不佳,建议使用Mistral AI API。
可通过以下方式对模型进行微调:
我们建议将 Mistral Small 4 与 vLLM 库 结合使用,以实现生产级推理。
确保安装 vllm nightly 版本:
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly执行此命令应会自动安装 mistral_common >= 1.11.0。
检查安装情况:
python -c "import mistral_common; print(mistral_common.__version__)"您也可以使用现成的 docker 镜像 或 docker hub 上的镜像。
从主分支安装 transformers:
uv pip install git+https://github.com/huggingface/transformers.git我们建议采用服务器/客户端架构:
vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8Mistral Small 4 能够严格按照您的指令执行。
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.1
# use TEMP = 0.7 for reasoning="high"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
reasoning_effort="none",
)
assistant_message = response.choices[0].message.content
print(assistant_message)借助我们简单的 Python 计算器工具来解一些方程吧。
import json
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.1
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
tools=tools,
tool_choice="auto",
reasoning_effort="none",
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
reasoning_effort="none",
)
print(response.choices[0].message.content)来看看 Mistral Small 4 是否知道何时该出手!
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.7
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
reasoning_effort="high",
)
print(response.choices[0].message.content)若要使用 Mistral Small 4,您需要安装 Transformers 的主分支:
uv pip install git+https://github.com/huggingface/transformers.gitimport torch
from transformers import AutoProcessor, Mistral3ForConditionalGeneration
model_id = "mistralai/Mistral-Small-4-119B-2603"
processor = AutoProcessor.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto"
)
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort="high")
inputs = inputs.to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.7,
)[0]
# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=False)
print(decoded_output)本模型根据 Apache 2.0 许可协议 进行许可。
您不得将本模型用于侵犯、盗用或违反任何第三方权利(包括知识产权)的行为。