MoonshotAI/Kimi-Linear-48B-A3B-Base
模型介绍文件和版本Pull Requests讨论分析
下载使用量0


Kimi Linear: 一种高效且富表达力的注意力架构

  论文   代码   模型

(a)在MMLU-Pro(4k上下文长度)上,Kimi Linear达到51.0的性能,速度与全注意力相当。在RULER(128k上下文长度)上,它展现出帕累托最优性能(84.3)和3.98倍的速度提升。(b)Kimi Linear的TPOT比MLA快6.3倍,在长序列(100万token)上提供显著的速度提升。

概述

Kimi Linear是一种混合线性注意力架构,在包括短上下文、长上下文和强化学习(RL)扩展等多种场景下,性能均超越传统全注意力方法。其核心是Kimi Delta Attention(KDA)——这是Gated DeltaNet的优化版本,引入了更高效的门控机制以优化有限状态RNN内存的使用。

Kimi Linear在长上下文任务中尤其表现出卓越的性能和硬件效率。它能将对大容量KV缓存的需求减少高达75%,并在100万token的长上下文下将解码吞吐量提升最高6倍。

我们在FLA中开源了KDA内核,并发布了两个版本的模型 checkpoint,均使用5.7万亿token训练而成。

模型总参数量激活参数量上下文长度下载链接
Kimi-Linear-Base480亿30亿100万🤗 Hugging Face
Kimi-Linear-Instruct480亿30亿100万🤗 Hugging Face

核心特性

  • Kimi Delta 注意力机制(KDA):一种线性注意力机制,通过细粒度门控优化门控 delta 规则。
  • 混合架构:KDA 与全局 MLA 采用 3:1 的比例,在保持甚至超越全注意力质量的同时降低内存占用。
  • 卓越性能:在 1.4T token 训练量的公平对比中,在多种任务(包括长上下文和 RL 风格基准测试)上表现优于全注意力模型。
  • 高吞吐量:解码速度提升高达 6 倍,并显著降低每输出 token 的时间(TPOT)。

使用方法

使用 Hugging Face Transformers 进行推理

使用 Kimi Linear 模型时,我们推荐以下环境:

  • python >= 3.10
  • torch >= 2.6
  • fla-core >= 0.4.0
pip install -U fla-core

示例代码:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
    {"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

部署

在部署时,您可以使用最新版本的vllm来创建一个兼容OpenAI的API端点。

vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code

引用说明

如果您发现我们的研究工作对您有所帮助,请引用:

@misc{team2025kimi,
    title         = {Kimi Linear: An Expressive, Efficient Attention Architecture},
    author        = {Zhang, Yu  and Lin, Zongyu  and Yao, Xingcheng  and Hu, Jiaxi  and Meng, Fanqing  and Liu, Chengyin  and Men, Xin  and Yang, Songlin  and Li, Zhiyuan  and Li, Wentao  and Lu, Enzhe  and Liu, Weizhou  and Chen, Yanru  and Xu, Weixin  and Yu, Longhui  and Wang, Yejie  and Fan, Yu  and Zhong, Longguang  and Yuan, Enming  and Zhang, Dehao  and Zhang, Yizhi  and T. Liu, Y.  and Wang, Haiming  and Fang, Shengjun  and He, Weiran  and Liu, Shaowei  and Li, Yiwei  and Su, Jianlin  and Qiu, Jiezhong  and Pang, Bo  and Yan, Junjie  and Jiang, Zhejun  and Huang, Weixiao  and Yin, Bohong  and You, Jiacheng  and Wei, Chu  and Wang, Zhengtao  and Hong, Chao  and Chen, Yutian  and Chen, Guanduo  and Wang, Yucheng  and Zheng, Huabin  and Wang, Feng  and Liu, Yibo  and Dong, Mengnan  and Zhang, Zheng  and Pan, Siyuan  and Wu, Wenhao  and Wu, Yuhao  and Guan, Longyu  and Tao, Jiawen  and Fu, Guohong  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Wu, Yuxin  and Zhou, Xinyu  and Yang, Zhilin  and Du, Yulun},
    year          = {2025},
    eprint        = {2510.26692},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CL}
}