LongCat-Flash-Lite-FP8

模型介绍

我们推出 LongCat-Flash-Lite，这是一款非思考型的 68.5B 参数混合专家（Mixture-of-Experts, MoE）模型，激活参数约为 3B，通过 YaRN 方法支持 256k 上下文长度。在 LongCat-Flash 架构的基础上，LongCat-Flash-Lite 融入了N-gram 嵌入表，旨在同时提升模型性能与推理速度。尽管将超过 30B 参数分配给嵌入层，LongCat-Flash-Lite 不仅优于参数规模相当的 MoE 基准模型，在智能体和代码领域更展现出与同量级现有模型相比的卓越竞争力。

核心特性

🌟 卓越的扩展效率：优于MoE的替代方案

通过在多种场景下进行全面的扩展实验，我们发现了特定场景下，嵌入扩展相比增加专家数量能实现更优的帕累托前沿，从而为模型扩展提供了一种高效的替代方案。我们进一步阐述了一系列决定嵌入扩展效果的关键架构因素，包括集成时机、参数分配、哈希冲突缓解、超参数配置、嵌入初始化，以及模型宽度和深度的影响。

🌟 专业系统优化带来的卓越推理效率

与基于FFN的专家相比，N-gram嵌入表从根本上缓解了MoE层内的I/O瓶颈，显著降低了推理延迟。此外，我们引入了专门的N-gram缓存并开发了同步内核，共同显著提升了推理效率。

🌟 强大的智能体能力与代码性能

LongCat-Flash-Lite在智能体工具使用和代码编写方面展现出强大的能力，与其模型规模相比具有高度竞争力。

详细信息请参阅我们的技术报告！

评估结果

基准测试	Kimi-Linear-48B-A3B	Qwen3-Next-80B-A3B-Instruct	Gemini 2.5 Flash-Lite	LongCat-Flash-Lite
架构	MoE	MoE	-	MoE + NE
总参数数量	48B	80B	-	68.5B
激活参数数量	3B	3B	-	2.9B~4.5B
智能体工具使用
Tau2-Airline(avg@8)	44.00	45.5*	35.00	58.00
Tau2-Retail(avg@8)	18.86	57.3*	37.50	73.10
Tau2-Telecom(avg@8)	15.68	13.2*	21.93	72.80
智能体代码编写
SWE-Bench(准确率)	32.80	37.60	41.3*	54.40
TerminalBench(准确率)	20.00	15.19	20.00	33.75
SWE-Bench 多语言	37.20	31.30	-	38.10
PRDBench	-	15.36	-	39.63
通用领域
GPQA-Diamond(avg@16)	69.89	74.33	70.20*	66.78
MMLU(准确率)	79.91	89.28	84.68	85.52
MMLU-Pro(准确率)	67.22	82.93	78.95	78.29
CEval(准确率)	78.48	90.91	75.16	86.55
CMMLU(准确率)	76.26	86.50	72.06	82.48
数学推理
MATH500(准确率)	94.20	98.00	95.20	96.80
AIME24(avg@32)	70.52	81.35	63.33	72.19
AIME25(avg@32)	59.58	68.44	50.1*	63.23

注：标有 * 的值来源于公开报告。NE 是 N-gram Embedding 的缩写。

快速开始

要使用基于transformers的LongCat-Flash-Lite，我们至少需要2块GPU（每块GPU显存为80GB，例如H100/A100 80GB），并且建议使用以下环境：

python >= 3.10
torch >= 2.6
transformers >= 4.57.6
accelerate >= 1.10.0

pip install -U transformers==4.57.6 accelerate==1.10.0

基本使用示例：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meituan-longcat/LongCat-Flash-Lite"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a brief introduction to large language models."}
]
input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
output_ids = generated_ids[0][len(input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(response)

工具调用示例：

tools = [
    {
        "type": "function",
        "function": {
            "name": "func_add",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "x1": {"type": "number", "description": "The first addend"},
                    "x2": {"type": "number", "description": "The second addend"}
                },
                "required": ["x1", "x2"]
            }
        }
    }
]
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please tell me what is $$125679 + 234519$$?"},
    {
        "role": "assistant", 
        "content": "I'll calculate the sum of 125679 and 234519 for you.", 
        "tool_calls": [{"type": "function", "function": {"name": "func_add", "arguments": {"x1": 125679, "x2": 234519}}}]
    },
    {"role": "tool", "name": "func_add", "content": '{"ans": 360198}'}
]

input_ids = tokenizer.apply_chat_template(
    messages, 
    tools=tools,
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
output_ids = generated_ids[0][len(input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(response)

响应解析：

from parse_model_response import parse_model_response

response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
parsed_message = parse_model_response(response, tools)

详见 parse_model_response.py 以获取详细实现和示例。

推荐的采样设置：

{ "repetition_penalty": 1.06, "temperature": 0.7, "top_p": 0.95, "top_k": 4 }

部署

我们已在SGLang中完成基础适配（PR），以支持LongCat-Flash-Lite的部署。

LongCat-Flash-Lite可通过结合张量并行（Tensor Parallelism）与专家并行（Expert Parallelism）在单节点（例如8xH20-141G）上进行服务部署。

请先编译并更新sgl-kernel。

cd sgl-kernel
python3 -m uv build --wheel --color=always --no-build-isolation \
        -Ccmake.define.SGL_KERNEL_ENABLE_SM90A=1 \
        -Ccmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5 \
        -Cbuild-dir=build .
pip3 install dist/sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl --force-reinstall

然后启动服务器。

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Lite \
    --port 8080 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.9 \
    --max-running-requests 64 \
    --trust-remote-code \
    --skip-server-warmup \
    --attention-backend flashinfer \
    --ep 8 \
    --tp 8 \
    --disable-cuda-graph

许可协议

本仓库（包括模型权重和源代码）均依据MIT 许可协议发布。

除非另有说明，对本仓库的任何贡献均基于 MIT 许可协议授权。本许可协议不授予使用美团商标或专利的任何权利。

详情请参见 LICENSE 文件。

使用注意事项

本模型并非针对所有可能的下游应用场景进行专门设计或全面评估。

开发者应考虑到大型语言模型的已知局限性，包括在不同语言间的性能差异，并在将模型部署于敏感或高风险场景前，仔细评估其准确性、安全性和公平性。开发者及下游用户有责任了解并遵守与其使用场景相关的所有适用法律法规，包括但不限于数据保护、隐私和内容安全要求。

本模型卡片中的任何内容均不应被解释为对模型发布所依据的 MIT 许可协议条款的修改或限制。

引用

如果您认为我们的工作对您有所帮助，我们恳请您引用我们的成果。

@misc{liu2026scalingembeddingsoutperformsscaling,
      title={Scaling Embeddings Outperforms Scaling Experts in Language Models}, 
      author={Hong Liu and Jiaqi Zhang and Chao Wang and Xing Hu and Linkun Lyu and Jiaqi Sun and Xurui Yang and Bo Wang and Fengcun Li and Yulei Qian and Lingtong Si and Yerui Sun and Rumei Li and Peng Pei and Yuchen Xie and Xunliang Cai},
      year={2026},
      eprint={2601.21204},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.21204}, 
}

联系方式

如有任何问题，请通过邮箱 longcat-team@meituan.com 与我们联系，或提交 issue。