技术报告👁️

DeepSeek-V2：性能强劲、经济高效的混合专家语言模型

1. 引言

今天，我们正式推出DeepSeek-V2——一款性能强劲的混合专家（Mixture-of-Experts, MoE）语言模型，其核心特点是训练经济且推理高效。该模型总计拥有2360亿参数，每个token仅需激活210亿参数。与DeepSeek 67B相比，DeepSeek-V2不仅性能更优，还节省了42.5%的训练成本，将KV缓存减少93.3%，并将最大生成吞吐量提升至5.76倍。

我们在规模达8.1万亿token的多样化高质量语料库上对DeepSeek-V2进行了预训练。在全面的预训练之后，我们进一步通过监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL）流程，充分释放模型的各项能力。评估结果证实了我们方法的有效性：DeepSeek-V2在标准基准测试和开放式生成评估中均取得了卓越表现。

2. 模型下载

模型	上下文长度	下载链接
DeepSeek-V2	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	128k	🤗 HuggingFace

由于HuggingFace平台的限制，当前开源代码在GPU上运行时，性能略逊于我们的内部代码库。为了帮助用户高效运行我们的模型，我们提供了专门的vllm解决方案，可对模型运行性能进行优化。

3. 评估结果

基础模型

标准基准测试

基准测试	领域	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	英文	78.9	77.6	71.3	78.5
BBH	英文	81.0	78.9	68.7	78.9
C-Eval	中文	67.5	58.6	66.1	81.7
CMMLU	中文	69.3	60.0	70.8	84.0
HumanEval	代码	48.2	53.1	45.1	48.8
MBPP	代码	68.6	64.2	57.4	66.6
GSM8K	数学	83.0	80.3	63.4	79.2
Math	数学	42.2	42.5	18.7	43.6

更多评估细节，如少样本设置和提示词，请查阅我们的论文。

上下文窗口

“大海捞针”（NIAH）测试的评估结果。DeepSeek-V2 在所有上下文窗口长度下均表现出色，最长可达128K。

对话模型

标准基准测试

基准测试	领域	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	英文	76.2	77.8	80.3	71.1	78.4	77.8
BBH	英文	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	中文	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	中文	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	代码	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	代码	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	代码	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	数学	81.9	87.9	93.2	84.1	90.8	92.2
Math	数学	40.6	49.8	48.5	32.6	52.7	53.9

英文开放式生成评估

我们在 AlpacaEval 2.0 和 MTBench 上对模型进行了评估，结果显示 DeepSeek-V2-Chat-RL 在英文对话生成方面具有竞争力。

中文开放式生成评估

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.35
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71

代码基准测试

我们在 LiveCodeBench (0901-0401) 上对模型进行了评估，这是一个专为实时编码挑战设计的基准测试。如图所示，DeepSeek-V2 在 LiveCodeBench 上表现出相当高的熟练度，其 Pass@1 分数超过了其他几个复杂模型。这一表现凸显了该模型在处理实时编码任务方面的有效性。

4. 模型架构

DeepSeek-V2 采用创新架构，确保经济高效的训练与推理：

在注意力机制方面，我们设计了 MLA（多头潜在注意力，Multi-head Latent Attention），它利用低秩键值联合压缩技术，消除了推理时键值缓存的瓶颈，从而支持高效推理。
在前馈网络（FFNs）方面，我们采用了 DeepSeekMoE 架构，这是一种高性能的混合专家（MoE）架构，能够以更低成本训练出更强的模型。

5. 聊天网站

您可以在深度求索（DeepSeek）官方网站与 DeepSeek-V2 进行对话：chat.deepseek.com

6. API 平台

我们还在深度求索平台（DeepSeek Platform）提供与 OpenAI 兼容的 API：platform.deepseek.com。注册即可获得数百万免费 tokens。您也可以选择极具竞争力的按使用量付费模式。

7. 本地运行方法

若要使用 BF16 格式的 DeepSeek-V2 进行推理，需要 8 张 80GB GPU。

使用 Huggingface Transformers 进行推理

您可以直接使用 Huggingface's Transformers 进行模型推理。

文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

对话补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

完整的对话模板可在 Hugging Face 模型仓库的 tokenizer_config.json 文件中找到。

对话模板示例如下：

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

您还可以添加一个可选的系统消息：

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

使用 vLLM 进行推理（推荐）

若要使用 vLLM 进行模型推理，请将以下拉取请求合并到您的 vLLM 代码库中：https://github.com/vllm-project/vllm/pull/4650。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

8. 许可协议

本代码仓库采用 MIT 许可协议授权。DeepSeek-V2 Base/Chat 模型的使用受模型许可协议约束。DeepSeek-V2 系列（包括 Base 和 Chat）支持商业用途。

9. 引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

10. 联系方式

若您有任何问题，请提交 issue 或通过 service@deepseek.com 与我们联系。

技术报告👁️

DeepSeek-V2：性能强劲、经济高效的混合专家语言模型

1. 引言

2. 模型下载

模型	上下文长度	下载链接
DeepSeek-V2	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	128k	🤗 HuggingFace

3. 评估结果

基础模型

标准基准测试

基准测试	领域	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	英文	78.9	77.6	71.3	78.5
BBH	英文	81.0	78.9	68.7	78.9
C-Eval	中文	67.5	58.6	66.1	81.7
CMMLU	中文	69.3	60.0	70.8	84.0
HumanEval	代码	48.2	53.1	45.1	48.8
MBPP	代码	68.6	64.2	57.4	66.6
GSM8K	数学	83.0	80.3	63.4	79.2
Math	数学	42.2	42.5	18.7	43.6

更多评估细节，如少样本设置和提示词，请查阅我们的论文。

上下文窗口

“大海捞针”（NIAH）测试的评估结果。DeepSeek-V2 在所有上下文窗口长度下均表现出色，最长可达128K。

对话模型

标准基准测试

基准测试	领域	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	英文	76.2	77.8	80.3	71.1	78.4	77.8
BBH	英文	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	中文	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	中文	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	代码	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	代码	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	代码	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	数学	81.9	87.9	93.2	84.1	90.8	92.2
Math	数学	40.6	49.8	48.5	32.6	52.7	53.9

英文开放式生成评估

我们在 AlpacaEval 2.0 和 MTBench 上对模型进行了评估，结果显示 DeepSeek-V2-Chat-RL 在英文对话生成方面具有竞争力。

中文开放式生成评估

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.35
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71

代码基准测试

4. 模型架构

DeepSeek-V2 采用创新架构，确保经济高效的训练与推理：

在注意力机制方面，我们设计了 MLA（多头潜在注意力，Multi-head Latent Attention），它利用低秩键值联合压缩技术，消除了推理时键值缓存的瓶颈，从而支持高效推理。
在前馈网络（FFNs）方面，我们采用了 DeepSeekMoE 架构，这是一种高性能的混合专家（MoE）架构，能够以更低成本训练出更强的模型。

5. 聊天网站

您可以在深度求索（DeepSeek）官方网站与 DeepSeek-V2 进行对话：chat.deepseek.com

6. API 平台

7. 本地运行方法

若要使用 BF16 格式的 DeepSeek-V2 进行推理，需要 8 张 80GB GPU。

使用 Huggingface Transformers 进行推理

您可以直接使用 Huggingface's Transformers 进行模型推理。

文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

对话补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

完整的对话模板可在 Hugging Face 模型仓库的 tokenizer_config.json 文件中找到。

对话模板示例如下：

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

您还可以添加一个可选的系统消息：

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

使用 vLLM 进行推理（推荐）

若要使用 vLLM 进行模型推理，请将以下拉取请求合并到您的 vLLM 代码库中：https://github.com/vllm-project/vllm/pull/4650。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

8. 许可协议

本代码仓库采用 MIT 许可协议授权。DeepSeek-V2 Base/Chat 模型的使用受模型许可协议约束。DeepSeek-V2 系列（包括 Base 和 Chat）支持商业用途。

9. 引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

10. 联系方式

若您有任何问题，请提交 issue 或通过 service@deepseek.com 与我们联系。