(a)在MMLU-Pro(4k上下文长度)上,Kimi Linear达到51.0的性能,速度与全注意力相当。在RULER(128k上下文长度)上,它展现出帕累托最优性能(84.3)和3.98倍的速度提升。(b)Kimi Linear的TPOT比MLA快6.3倍,在长序列(100万token)上提供显著的速度提升。
Kimi Linear是一种混合线性注意力架构,在包括短上下文、长上下文和强化学习(RL)扩展等多种场景下,性能均超越传统全注意力方法。其核心是Kimi Delta Attention(KDA)——这是Gated DeltaNet的优化版本,引入了更高效的门控机制以优化有限状态RNN内存的使用。
Kimi Linear在长上下文任务中尤其表现出卓越的性能和硬件效率。它能将对大容量KV缓存的需求减少高达75%,并在100万token的长上下文下将解码吞吐量提升最高6倍。
我们在FLA中开源了KDA内核,并发布了两个版本的模型 checkpoint,均使用5.7万亿token训练而成。
| 模型 | 总参数量 | 激活参数量 | 上下文长度 | 下载链接 |
|---|---|---|---|---|
| Kimi-Linear-Base | 480亿 | 30亿 | 100万 | 🤗 Hugging Face |
| Kimi-Linear-Instruct | 480亿 | 30亿 | 100万 | 🤗 Hugging Face |
使用 Kimi Linear 模型时,我们推荐以下环境:
python >= 3.10torch >= 2.6fla-core >= 0.4.0pip install -U fla-core示例代码:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
messages = [
{"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
{"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)在部署时,您可以使用最新版本的vllm来创建一个兼容OpenAI的API端点。
vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 1048576 \
--trust-remote-code如果您发现我们的研究工作对您有所帮助,请引用:
@misc{team2025kimi,
title = {Kimi Linear: An Expressive, Efficient Attention Architecture},
author = {Zhang, Yu and Lin, Zongyu and Yao, Xingcheng and Hu, Jiaxi and Meng, Fanqing and Liu, Chengyin and Men, Xin and Yang, Songlin and Li, Zhiyuan and Li, Wentao and Lu, Enzhe and Liu, Weizhou and Chen, Yanru and Xu, Weixin and Yu, Longhui and Wang, Yejie and Fan, Yu and Zhong, Longguang and Yuan, Enming and Zhang, Dehao and Zhang, Yizhi and T. Liu, Y. and Wang, Haiming and Fang, Shengjun and He, Weiran and Liu, Shaowei and Li, Yiwei and Su, Jianlin and Qiu, Jiezhong and Pang, Bo and Yan, Junjie and Jiang, Zhejun and Huang, Weixiao and Yin, Bohong and You, Jiacheng and Wei, Chu and Wang, Zhengtao and Hong, Chao and Chen, Yutian and Chen, Guanduo and Wang, Yucheng and Zheng, Huabin and Wang, Feng and Liu, Yibo and Dong, Mengnan and Zhang, Zheng and Pan, Siyuan and Wu, Wenhao and Wu, Yuhao and Guan, Longyu and Tao, Jiawen and Fu, Guohong and Xu, Xinran and Wang, Yuzhi and Lai, Guokun and Wu, Yuxin and Zhou, Xinyu and Yang, Zhilin and Du, Yulun},
year = {2025},
eprint = {2510.26692},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}