HuggingFace镜像/DeepSeek-V2

论文链接👁️

DeepSeek-V2：性能强劲、经济高效的混合专家语言模型

1. 引言

今天，我们正式推出DeepSeek-V2，这是一款性能强劲的混合专家（MoE）语言模型，其特点是训练经济且推理高效。该模型总参数达2360亿，其中每个token激活210亿参数。与DeepSeek 67B相比，DeepSeek-V2不仅性能更优，还节省了42.5%的训练成本，KV缓存减少93.3%，最大生成吞吐量提升至5.76倍。

我们在包含8.1万亿token的多样化高质量语料库上对DeepSeek-V2进行了预训练。在全面的预训练之后，我们又进行了监督微调（SFT）和强化学习（RL）过程，以充分释放模型的能力。评估结果证实了我们方法的有效性，DeepSeek-V2在标准基准测试和开放式生成评估中均取得了卓越性能。

2. 模型下载

模型	上下文长度	下载
DeepSeek-V2	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	128k	🤗 HuggingFace

由于HuggingFace的限制，当前开源代码在GPU上运行时的性能低于我们内部代码库。为了便于高效执行我们的模型，我们提供了专门的vllm解决方案，以优化模型的运行性能。

3. 评估结果

基础模型

标准基准测试

基准测试	领域	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	英文	78.9	77.6	71.3	78.5
BBH	英文	81.0	78.9	68.7	78.9
C-Eval	中文	67.5	58.6	66.1	81.7
CMMLU	中文	69.3	60.0	70.8	84.0
HumanEval	代码	48.2	53.1	45.1	48.8
MBPP	代码	68.6	64.2	57.4	66.6
GSM8K	数学	83.0	80.3	63.4	79.2
Math	数学	42.2	42.5	18.7	43.6

有关更多评估细节，如少样本设置和提示词，请查阅我们的论文。

上下文窗口

“大海捞针”（NIAH）测试的评估结果。DeepSeek-V2 在所有上下文窗口长度下均表现出色，最高支持128K。

对话模型

标准基准测试

基准测试	领域	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	英文	76.2	77.8	80.3	71.1	78.4	77.8
BBH	英文	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	中文	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	中文	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	代码	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	代码	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	代码	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	数学	81.9	87.9	93.2	84.1	90.8	92.2
Math	数学	40.6	49.8	48.5	32.6	52.7	53.9

英文开放式生成评估

我们在AlpacaEval 2.0和MTBench上对模型进行了评估，结果显示DeepSeek-V2-Chat-RL在英文对话生成方面具有竞争力。

中文开放式生成评估

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.35
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71

代码基准测试

我们在LiveCodeBench（0901-0401）上对模型进行了评估，这是一个专为实时编码挑战设计的基准测试。如图所示，DeepSeek-V2在LiveCodeBench上表现出相当高的熟练度，其Pass@1分数超过了其他几个复杂模型。这一表现凸显了该模型在处理实时编码任务方面的有效性。

4. 模型架构

DeepSeek-V2 采用创新架构，确保经济高效的训练和推理：

在注意力机制方面，我们设计了 MLA（Multi-head Latent Attention，多头潜在注意力），它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。
对于前馈网络（FFNs），我们采用 DeepSeekMoE 架构，这是一种高性能的 MoE 架构，能够以更低的成本训练出更强的模型。

5. 聊天网站

您可以在深度求索官方网站与 DeepSeek-V2 进行对话：chat.deepseek.com

6. API 平台

我们还在深度求索平台提供与 OpenAI 兼容的 API：platform.deepseek.com。注册即可获得数百万免费 tokens。您也可以选择极具竞争力的按需付费模式。

7. 本地运行方法

若要使用 BF16 格式的 DeepSeek-V2 进行推理，需要 80GB*8 的 GPU。

使用 Huggingface's Transformers 进行推理

您可以直接使用 Huggingface's Transformers 进行模型推理。

文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

对话补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

完整的聊天模板可在 Hugging Face 模型仓库的 tokenizer_config.json 文件中找到。

聊天模板示例如下：

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

您还可以添加一个可选的系统消息：

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

使用 vLLM 进行推理（推荐）

若要使用 vLLM 进行模型推理，请将以下拉取请求合并到您的 vLLM 代码库中：https://github.com/vllm-project/vllm/pull/4650。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

8. 许可协议

本代码仓库基于 MIT 许可协议授权。DeepSeek-V2 Base/Chat 模型的使用受模型许可协议约束。DeepSeek-V2 系列（包括 Base 和 Chat）支持商业用途。

9. 引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

10. 联系方式

如有任何问题，请提交 issue 或通过 service@deepseek.com 与我们联系。

论文链接👁️

DeepSeek-V2：性能强劲、经济高效的混合专家语言模型

1. 引言

2. 模型下载

模型	上下文长度	下载
DeepSeek-V2	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	128k	🤗 HuggingFace

3. 评估结果

基础模型

标准基准测试

基准测试	领域	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	英文	78.9	77.6	71.3	78.5
BBH	英文	81.0	78.9	68.7	78.9
C-Eval	中文	67.5	58.6	66.1	81.7
CMMLU	中文	69.3	60.0	70.8	84.0
HumanEval	代码	48.2	53.1	45.1	48.8
MBPP	代码	68.6	64.2	57.4	66.6
GSM8K	数学	83.0	80.3	63.4	79.2
Math	数学	42.2	42.5	18.7	43.6

有关更多评估细节，如少样本设置和提示词，请查阅我们的论文。

上下文窗口

“大海捞针”（NIAH）测试的评估结果。DeepSeek-V2 在所有上下文窗口长度下均表现出色，最高支持128K。

对话模型

标准基准测试

基准测试	领域	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	英文	76.2	77.8	80.3	71.1	78.4	77.8
BBH	英文	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	中文	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	中文	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	代码	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	代码	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	代码	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	数学	81.9	87.9	93.2	84.1	90.8	92.2
Math	数学	40.6	49.8	48.5	32.6	52.7	53.9

英文开放式生成评估

我们在AlpacaEval 2.0和MTBench上对模型进行了评估，结果显示DeepSeek-V2-Chat-RL在英文对话生成方面具有竞争力。

中文开放式生成评估

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.35
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71

代码基准测试

4. 模型架构

DeepSeek-V2 采用创新架构，确保经济高效的训练和推理：

在注意力机制方面，我们设计了 MLA（Multi-head Latent Attention，多头潜在注意力），它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。
对于前馈网络（FFNs），我们采用 DeepSeekMoE 架构，这是一种高性能的 MoE 架构，能够以更低的成本训练出更强的模型。

5. 聊天网站

您可以在深度求索官方网站与 DeepSeek-V2 进行对话：chat.deepseek.com

6. API 平台

我们还在深度求索平台提供与 OpenAI 兼容的 API：platform.deepseek.com。注册即可获得数百万免费 tokens。您也可以选择极具竞争力的按需付费模式。

7. 本地运行方法

若要使用 BF16 格式的 DeepSeek-V2 进行推理，需要 80GB*8 的 GPU。

使用 Huggingface's Transformers 进行推理

您可以直接使用 Huggingface's Transformers 进行模型推理。

文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

对话补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

完整的聊天模板可在 Hugging Face 模型仓库的 tokenizer_config.json 文件中找到。

聊天模板示例如下：

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

您还可以添加一个可选的系统消息：

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

使用 vLLM 进行推理（推荐）

若要使用 vLLM 进行模型推理，请将以下拉取请求合并到您的 vLLM 代码库中：https://github.com/vllm-project/vllm/pull/4650。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

8. 许可协议

本代码仓库基于 MIT 许可协议授权。DeepSeek-V2 Base/Chat 模型的使用受模型许可协议约束。DeepSeek-V2 系列（包括 Base 和 Chat）支持商业用途。

9. 引用

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

10. 联系方式

如有任何问题，请提交 issue 或通过 service@deepseek.com 与我们联系。