试用 LFM • 文档 • LEAP • Discord

LFM2.5-350M

LFM2.5 是专为设备端部署设计的全新混合模型系列。它在 LFM2 架构的基础上，通过扩展预训练和强化学习进行了优化。

业界领先性能：35000万参数模型可媲美规模大得多的模型，将高质量 AI 带到您的口袋中。
快速边缘推理：在 AMD CPU 上解码速度达 313 tok/s，在骁龙 Gen4 上达 188 tok/s。内存占用低于 1GB，并且从发布初期就支持 llama.cpp、MLX 和 vLLM。
规模化训练：预训练数据量从 10T 扩展到 28T tokens，并进行了大规模多阶段强化学习。

有关 LFM2.5-350M 的更多信息，请参阅我们的博客文章。

[!NOTE] 💻 演示：https://huggingface.co/spaces/webml-community/lfm2.5-webgpu-summarizer

🗒️ 模型详情

模型	参数数量	描述
LFM2.5-350M-Base	350M	用于微调的预训练基础模型
LFM2.5-350M	350M	通用指令调优模型

LFM2.5-350M 是一款通用纯文本模型，具有以下特性：

参数数量：350M
层数：16 层（10 个双门控 LIV 卷积块 + 6 个 GQA 块）
训练数据量：28T tokens
上下文长度：32,768 tokens
词汇表大小：65,536
知识截止日期：2024 年年中
支持语言：英语、阿拉伯语、中文、法语、德语、日语、韩语、葡萄牙语、西班牙语
生成参数：
- temperature: 0.1
- top_k: 50
- repetition_penalty: 1.05

模型	描述
LFM2.5-350M	原生格式的原始模型检查点。最适合使用 Transformers 和 vLLM 进行微调或推理。
LFM2.5-350M-GGUF	用于 llama.cpp 及兼容工具的量化格式。针对 CPU 推理和本地部署进行了优化，内存占用更低。
LFM2.5-350M-ONNX	用于跨平台部署的 ONNX Runtime 格式。支持在各种环境（云、边缘、移动设备）中进行硬件加速推理。
LFM2.5-350M-MLX	用于 Apple Silicon 的 MLX 格式。针对使用 MLX 框架在 Mac 设备上进行快速推理进行了优化。
LFM2.5-350M-OpenVINO	用于英特尔硬件加速的 OpenVINO 格式。针对在英特尔 CPU、GPU 和 NPU 上进行高效推理进行了优化。

我们建议将其用于数据提取、结构化输出和工具调用。不建议将其用于知识密集型任务和编程。

聊天模板

LFM2.5采用类ChatML格式。详情请参见聊天模板文档。示例：

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

你可以使用 tokenizer.apply_chat_template() 自动格式化消息。

工具使用

LFM2.5 支持如下函数调用：

函数定义：建议在系统提示词中以 JSON 对象的形式提供工具列表。你也可以使用带有工具的 tokenizer.apply_chat_template() 函数。
函数调用：默认情况下，LFM2.5 会生成类 Python 风格的函数调用（在 <|tool_call_start|> 和 <|tool_call_end|> 特殊标记之间的 Python 列表）作为助手回答。你可以在系统提示词中要求模型输出 JSON 格式的函数调用来覆盖此行为。
函数执行：函数调用被执行后，结果将以 “tool” 角色返回。
最终回答：LFM2 会解释函数调用的结果，以纯文本形式回答原始用户提示。

完整指南请参见工具使用文档。示例：

<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>

🏃 推理

LFM2.5 支持多种推理框架。完整列表请参见推理文档。

名称	描述	文档	笔记本
Transformers	可直接访问模型内部结构的简单推理。	链接
vLLM	基于 GPU 的高吞吐量生产部署。	链接
llama.cpp	支持 CPU 卸载的跨平台推理。	链接
MLX	Apple 的机器学习框架，针对 Apple Silicon 优化。	链接	—
LM Studio	用于在本地运行 LLM 的桌面应用程序。	链接	—
OpenVINO	英特尔工具包，用于在 CPU、GPU 和 NPU 上进行优化推理。	链接	—

以下是使用 Transformers 的快速入门示例：

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "LiquidAI/LFM2.5-350M"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
#   attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "What is C. elegans?"

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(model.device)

output = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
    streamer=streamer,
)

🔧 微调

为获得最佳效果，建议针对您的特定使用场景对 LFM2.5 进行微调。

名称	描述	文档
CPT (Unsloth)	使用 Unsloth 进行文本补全的持续预训练。	链接
CPT (Unsloth)	使用 Unsloth 进行翻译的持续预训练。	链接
SFT (Unsloth)	使用 Unsloth 进行带 LoRA 的监督微调。	链接
SFT (TRL)	使用 TRL 进行带 LoRA 的监督微调。	链接
DPO (TRL)	使用 TRL 进行带 LoRA 的直接偏好优化。	链接
GRPO (Unsloth)	使用 Unsloth 进行带 LoRA 的 GRPO。	链接
GRPO (TRL)	使用 TRL 进行带 LoRA 的 GRPO。	链接

📊 性能表现

基准测试

模型	GPQA Diamond	MMLU-Pro	IFEval	IFBench	Multi-IF
LFM2.5-350M	30.64	20.01	76.96	40.69	44.92
LFM2-350M	27.58	19.29	64.96	18.20	32.92
Granite 4.0-H-350M	22.32	13.14	61.27	17.22	28.70
Granite 4.0-350M	25.91	12.84	53.48	15.98	24.21
Qwen3.5-0.8B (Instruct)	27.41	37.42	59.94	22.87	41.68
Qwen3.5-0.8B (Thinking)	19.29	-*	32.93	22.00	26.44
Gemma 3 1B IT	23.89	14.04	63.49	20.33	44.25

模型	CaseReportBench	BFCLv3	BFCLv4	τ²-Bench Telecom	τ²-Bench Retail
LFM2.5-350M	32.45	44.11	21.86	18.86	17.84
LFM2-350M	11.67	22.95	12.29	10.82	5.56
Granite 4.0-H-350M	12.44	43.07	13.28	13.74	6.14
Granite 4.0-350M	0.84	39.58	13.73	2.92	6.14
Qwen3.5-0.8B (Instruct)	13.83	35.08	18.70	12.57	6.14
Qwen3.5-0.8B (Thinking)	0.39	39.64	25.39	14.33	7.02
Gemma 3 1B IT	2.28	16.61	7.17	9.36	6.43

*由于陷入无限循环，评估未能完成。

CPU 推理

GPU 推理

📬 联系方式

有疑问或想交流？加入我们的 Discord 社区
如对边缘部署的定制解决方案感兴趣，请联系我们的销售团队。

引用

@article{liquidAI2026350M,
  author = {Liquid AI},
  title = {LFM2.5-350M: No Size Left Behind},
  journal = {Liquid AI Blog},
  year = {2026},
  note = {www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind},
}

@article{liquidai2025lfm2,
  title={LFM2 Technical Report},
  author={Liquid AI},
  journal={arXiv preprint arXiv:2511.23404},
  year={2025}
}