DeepSeek-V3:DeepSeek-V3：强大开源的混合专家模型，671B总参数，激活37B，采用多头潜在注意力机制与DeepSeekMoE架构，训练高效、成本低，性能卓越，开源界表现领先，逼近闭源模型水平，推理加速，推理稳定，适用于多种硬件和开源软件。【此简介由AI生成】。

transformers 库

论文链接👁️

1. 简介

我们介绍 DeepSeek-V3，这是一个强大的混合专家（MoE）语言模型，总参数量为 6710 亿，每个标记激活 370 亿个参数。为了实现高效的推理和具有成本效益的训练，DeepSeek-V3 采用了多头潜在注意力（MLA）和 DeepSeekMoE 架构，这些架构已在 DeepSeek-V2 中得到了充分验证。此外，DeepSeek-V3 开创了一种无辅助损失的负载平衡策略，并设定了一个多标记预测训练目标，以实现更强的性能。我们在 14.8 万亿个不同且高质量的标记上预训练 DeepSeek-V3，然后进行监督式微调和强化学习阶段，以充分发挥其能力。综合评估表明，DeepSeek-V3 的性能优于其他开源模型，并且达到了与领先的封闭源代码模型相当的性能。尽管性能优异，但 DeepSeek-V3 的完整训练只需要 278.8 万个 H800 GPU 小时。此外，它的训练过程非常稳定。在整个训练过程中，我们没有遇到任何无法恢复的损失峰值或执行任何回滚。

2. 模型概述

架构：创新的负载平衡策略和训练目标

在 DeepSeek-V2 效率高的架构之上，我们开创了一种无辅助损失的负载平衡策略，它将因鼓励负载平衡而产生的性能下降降至最低。
我们研究了一种多标记预测（MTP）目标，并证明它对模型性能有益。它还可以用于推理加速的投机解码。

预训练：走向极致的训练效率

我们设计了一个 FP8 混合精度训练框架，并首次验证了 FP8 训练在超大规模模型上的可行性和有效性。
通过算法、框架和硬件的共同设计，我们克服了跨节点 MoE 训练中的通信瓶颈，几乎实现了计算和通信的完全重叠。这大大提高了我们的训练效率并降低了训练成本，使我们能够在没有额外开销的情况下扩大模型规模。
仅需 266.4 万个 H800 GPU 小时，我们就完成了 DeepSeek-V3 在 14.8 万亿个标记上的预训练，生成了目前最强劲的开源基模型。预训练后的后续训练阶段只需要 0.1 万个 GPU 小时。

后训练：从 DeepSeek-R1 中提取知识

我们引入了一种创新方法，可以从长链式思维（CoT）模型，特别是 DeepSeek R1 系列模型之一中提取推理能力，并将它融入标准 LLM，特别是 DeepSeek-V3 中。我们的管道巧妙地将 R1 的验证和反思模式融入 DeepSeek-V3 中，并显着提高了其推理性能。同时，我们还控制了 DeepSeek-V3 的输出风格和长度。

3. 模型下载

模型	#总参数	#激活参数	上下文长度	下载
DeepSeek-V3-Base	6710 亿	370 亿	128K	🤗 HuggingFace
DeepSeek-V3	6710 亿	370 亿	128K	🤗 HuggingFace

注意：HuggingFace 上 DeepSeek-V3 模型的总大小为 6850 亿，其中包括 6710 亿的主模型权重和 140 亿的多标记预测（MTP）模块权重。

为了确保最佳性能和灵活性，我们与开源社区和硬件供应商合作，提供了多种方式在本地运行模型。有关逐步指南，请查看第 6 节：如何在本地运行。

对于希望深入研究的开发人员，我们建议查看 README_WEIGHTS.md，了解主模型权重和多标记预测（MTP）模块的详细信息。请注意，MTP 支持目前正在社区内积极开发中，我们欢迎您的贡献和反馈。

4. Evaluation Results

Base Model

Standard Benchmarks

	Benchmark (Metric)	# Shots	DeepSeek-V2	Qwen2.5 72B	LLaMA3.1 405B	DeepSeek-V3
	Architecture	-	MoE	Dense	Dense	MoE
	# Activated Params	-	21B	72B	405B	37B
	# Total Params	-	236B	72B	405B	671B
English	Pile-test (BPB)	-	0.606	0.638	0.542	0.548
	BBH (EM)	3-shot	78.8	79.8	82.9	87.5
	MMLU (Acc.)	5-shot	78.4	85.0	84.4	87.1
	MMLU-Redux (Acc.)	5-shot	75.6	83.2	81.3	86.2
	MMLU-Pro (Acc.)	5-shot	51.4	58.3	52.8	64.4
	DROP (F1)	3-shot	80.4	80.6	86.0	89.0
	ARC-Easy (Acc.)	25-shot	97.6	98.4	98.4	98.9
	ARC-Challenge (Acc.)	25-shot	92.2	94.5	95.3	95.3
	HellaSwag (Acc.)	10-shot	87.1	84.8	89.2	88.9
	PIQA (Acc.)	0-shot	83.9	82.6	85.9	84.7
	WinoGrande (Acc.)	5-shot	86.3	82.3	85.2	84.9
	RACE-Middle (Acc.)	5-shot	73.1	68.1	74.2	67.1
	RACE-High (Acc.)	5-shot	52.6	50.3	56.8	51.3
	TriviaQA (EM)	5-shot	80.0	71.9	82.7	82.9
	NaturalQuestions (EM)	5-shot	38.6	33.2	41.5	40.0
	AGIEval (Acc.)	0-shot	57.5	75.8	60.6	79.6
Code	HumanEval (Pass@1)	0-shot	43.3	53.0	54.9	65.2
	MBPP (Pass@1)	3-shot	65.0	72.6	68.4	75.4
	LiveCodeBench-Base (Pass@1)	3-shot	11.6	12.9	15.5	19.4
	CRUXEval-I (Acc.)	2-shot	52.5	59.1	58.5	67.3
	CRUXEval-O (Acc.)	2-shot	49.8	59.9	59.9	69.8
Math	GSM8K (EM)	8-shot	81.6	88.3	83.5	89.3
	MATH (EM)	4-shot	43.4	54.4	49.0	61.6
	MGSM (EM)	8-shot	63.6	76.2	69.9	79.8
	CMath (EM)	3-shot	78.7	84.5	77.3	90.7
Chinese	CLUEWSC (EM)	5-shot	82.0	82.5	83.0	82.7
	C-Eval (Acc.)	5-shot	81.4	89.2	72.5	90.1
	CMMLU (Acc.)	5-shot	84.0	89.5	73.7	88.8
	CMRC (EM)	1-shot	77.4	75.8	76.0	76.3
	C3 (Acc.)	0-shot	77.4	76.7	79.7	78.6
	CCPM (Acc.)	0-shot	93.0	88.5	78.6	92.0
Multilingual	MMMLU-non-English (Acc.)	5-shot	64.0	74.8	73.8	79.4

Note: Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks.

For more evaluation details, please check our paper.

Context Window

Evaluation results on the "Needle In A Haystack" (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.

Chat Model

Standard Benchmarks (Models larger than 67B)

	Benchmark (Metric)	DeepSeek V2-0506	DeepSeek V2.5-0905	Qwen2.5 72B-Inst.	LLaMA3.1 405B-Inst.	Claude-3.5-Sonnet-1022	GPT-4o 0513	DeepSeek V3
	Architecture	MoE	MoE	Dense	Dense	-	-	MoE
	# Activated Params	21B	21B	72B	405B	-	-	37B
	# Total Params	236B	236B	72B	405B	-	-	671B
English	MMLU (EM)	78.2	80.6	85.3	88.6	88.3	87.2	88.5
	MMLU-Redux (EM)	77.9	80.3	85.6	86.2	88.9	88.0	89.1
	MMLU-Pro (EM)	58.5	66.2	71.6	73.3	78.0	72.6	75.9
	DROP (3-shot F1)	83.0	87.8	76.7	88.7	88.3	83.7	91.6
	IF-Eval (Prompt Strict)	57.7	80.6	84.1	86.0	86.5	84.3	86.1
	GPQA-Diamond (Pass@1)	35.3	41.3	49.0	51.1	65.0	49.9	59.1
	SimpleQA (Correct)	9.0	10.2	9.1	17.1	28.4	38.2	24.9
	FRAMES (Acc.)	66.9	65.4	69.8	70.0	72.5	80.5	73.3
	LongBench v2 (Acc.)	31.6	35.4	39.4	36.1	41.0	48.1	48.7
Code	HumanEval-Mul (Pass@1)	69.3	77.4	77.3	77.2	81.7	80.5	82.6
	LiveCodeBench (Pass@1-COT)	18.8	29.2	31.1	28.4	36.3	33.4	40.5
	LiveCodeBench (Pass@1)	20.3	28.4	28.7	30.1	32.8	34.2	37.6
	Codeforces (Percentile)	17.5	35.6	24.8	25.3	20.3	23.6	51.6
	SWE Verified (Resolved)	-	22.6	23.8	24.5	50.8	38.8	42.0
	Aider-Edit (Acc.)	60.3	71.6	65.4	63.9	84.2	72.9	79.7
	Aider-Polyglot (Acc.)	-	18.2	7.6	5.8	45.3	16.0	49.6
Math	AIME 2024 (Pass@1)	4.6	16.7	23.3	23.3	16.0	9.3	39.2
	MATH-500 (EM)	56.3	74.7	80.0	73.8	78.3	74.6	90.2
	CNMO 2024 (Pass@1)	2.8	10.8	15.9	6.8	13.1	10.8	43.2
Chinese	CLUEWSC (EM)	89.9	90.4	91.4	84.7	85.4	87.9	90.9
	C-Eval (EM)	78.6	79.5	86.1	61.5	76.7	76.0	86.5
	C-SimpleQA (Correct)	48.5	54.1	48.4	50.4	51.3	59.3	64.8

Note: All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.

Open Ended Generation Evaluation

Model	Arena-Hard	AlpacaEval 2.0
DeepSeek-V2.5-0905	76.2	50.5
Qwen2.5-72B-Instruct	81.2	49.1
LLaMA-3.1 405B	69.3	40.5
GPT-4o-0513	80.4	51.1
Claude-Sonnet-3.5-1022	85.2	52.0
DeepSeek-V3	85.5	70.0

Note: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.

5. Chat Website & API Platform

You can chat with DeepSeek-V3 on DeepSeek's official website: chat.deepseek.com

We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com

6. How to Run Locally

DeepSeek-V3 can be deployed locally using the following hardware and open-source community software:

DeepSeek-Infer Demo: We provide a simple and lightweight demo for FP8 and BF16 inference.
SGLang: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes.
LMDeploy: Enables efficient FP8 and BF16 inference for local and cloud deployment.
TensorRT-LLM: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon.
vLLM: Support DeekSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism.
AMD GPU: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
Huawei Ascend NPU: Supports running DeepSeek-V3 on Huawei Ascend devices.

Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.

Here is an example of converting FP8 weights to BF16:

cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights

注意：目前尚不支持直接使用 Huggingface 的 Transformers。

6.1 使用 DeepSeek-Infer 演示进行推理（仅供示例）

模型权重和演示代码准备

首先，克隆我们的 DeepSeek-V3 GitHub 仓库：

git clone https://github.com/deepseek-ai/DeepSeek-V3.git

进入 inference 文件夹并安装 requirements.txt 中列出的依赖项。

cd DeepSeek-V3/inference
pip install -r requirements.txt

从 HuggingFace 下载模型权重，并将它们放入 /path/to/DeepSeek-V3 文件夹中。

模型权重转换

将 HuggingFace 模型权重转换为特定格式：

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16

运行

然后，您可以与 DeepSeek-V3 聊天：

torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200

或者对给定文件进行批量推理：

torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE

6.2 使用 SGLang 进行推理（推荐）

SGLang 目前支持 MLA 优化、FP8（W8A8）、FP8 KV 缓存和 Torch Compile，在开源框架中提供最先进的延迟和吞吐量性能。

值得注意的是，SGLang v0.4.1 完全支持在 NVIDIA 和 AMD GPU 上运行 DeepSeek-V3，使其成为一个高度通用且强大的解决方案。

以下是 SGLang 团队提供的启动说明：https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

6.3 使用 LMDeploy 进行推理（推荐）

LMDeploy 是一个灵活高效的推理和服务框架，专为大型语言模型量身打造，现已支持 DeepSeek-V3。它提供离线管道处理和在线部署功能，无缝集成基于 PyTorch 的工作流程。

有关使用 LMDeploy 运行 DeepSeek-V3 的详细逐步说明，请参考此处：https://github.com/InternLM/lmdeploy/issues/2960

6.4 使用 TRT-LLM 进行推理（推荐）

TensorRT-LLM 现已支持 DeepSeek-V3 模型，提供 BF16 和 INT4/INT8 权重only 等精度选项。FP8 的支持目前正在进行中，并将很快发布。您可以通过以下链接访问 TRTLLM 的自定义分支，专门针对 DeepSeek-V3 支持，以便直接体验新功能：https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3。

6.5 使用 vLLM 进行推理（推荐）

vLLM v0.6.6 支持在 NVIDIA 和 AMD GPU 上针对 FP8 和 BF16 模式进行 DeepSeek-V3 推理。除了标准技术之外，vLLM 还提供 管道并行 功能，使您能够在多台通过网络连接的机器上运行此模型。有关详细指南，请参阅 vLLM 指示。请随时关注增强计划。

6.6 推荐在 AMD GPU 上进行推理的功能

与 AMD 团队的合作，我们实现了对 AMD GPU 的第一天支持，使用 SGLang，完全兼容 FP8 和 BF16 精度。有关详细指南，请参考 SGLang 指示。

6.7 推荐在华为昇腾 NPU 上进行推理的功能

华为昇腾社区的 MindIE 框架已成功适配 DeepSeek-V3 的 BF16 版本。有关昇腾 NPU 的逐步指南，请遵循此处的说明。

7. 许可证

此代码存储库受 MIT 许可证许可。使用 DeepSeek-V3 Base/Chat 模型受模型许可证约束。DeepSeek-V3 系列（包括 Base 和 Chat）支持商业用途。

8. 引用

@misc{deepseekai2024deepseekv3technicalreport,
      title={DeepSeek-V3 Technical Report}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2412.19437},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.19437}, 
}

9. 联系方式

如果您有任何疑问，请提交问题或通过 service@deepseek.com 联系我们。