百川大模型2

百川API现已支持搜索增强与192K长上下文窗口，新增知识库检索功能并限时免费开放！
🚀 百川大模型对话平台已正式面向公众开放 🎉

修改说明/Modification

优化了快速入门示例代码。/ Optimized the quick start code samples.

目录/Table of Contents

📖 模型概述/Introduction
⚙️ 快速入门/Quick Start
📊 性能评估/Benchmark Evaluation
📜 使用条款/Terms and Conditions

模型概述/Introduction

百川大模型2是百川智能推出的新一代开源大语言模型，基于2.6万亿高质量训练语料，在权威中英文基准测试中均达到同规模最佳表现。本次发布包含7B、13B的基础版与对话版，并提供对话版4位量化模型。所有版本不仅向学术研究完全开放，开发者只需通过邮件申请获取官方商用授权后，即可免费用于商业场景。具体版本与下载链接如下：

Baichuan 2 is the new generation of open-source large language models developed by Baichuan Intelligence Inc.. Trained on 2.6 trillion high-quality tokens, it achieves state-of-the-art performance on authoritative Chinese and English benchmarks. This release includes 7B and 13B versions of both base and chat models, with 4-bit quantized chat variants. All models are fully accessible for academic research, and commercial use is permitted after obtaining authorization via email request. Available versions are listed below:

	基础模型	对话模型	4位量化对话模型
7B	Baichuan2-7B-Base	Baichuan2-7B-Chat	Baichuan2-7B-Chat-4bits
13B	Baichuan2-13B-Base	Baichuan2-13B-Chat	Baichuan2-13B-Chat-4bits

性能评估/Benchmark Evaluation

我们在[通用领域]、[法律与医疗]、[数学与编程]、多语言翻译等六大领域的中英文权威测试集上进行了全面评估，详细结果请参阅GitHub。

The models have been rigorously evaluated across six domains: General, Legal & Medical, Mathematics & Coding, and Multilingual Translation. For comprehensive results, visit GitHub.

7B模型表现

	C-Eval	MMLU	CMMLU	高考	AGIEval	BBH
	5次采样	5次采样	5次采样	5次采样	5次采样	3次采样
GPT-4	68.40	83.93	70.33	66.15	63.27	75.12
GPT-3.5 Turbo	51.10	68.54	54.06	47.07	46.13	61.59
LLaMA-7B	27.10	35.10	26.75	27.81	28.17	32.38
LLaMA2-7B	28.90	45.73	31.38	25.97	26.53	39.16
MPT-7B	27.15	27.93	26.00	26.54	24.83	35.20
Falcon-7B	24.23	26.03	25.66	24.24	24.10	28.77
ChatGLM2-6B	50.20	45.90	49.00	49.44	45.28	31.65
Baichuan-7B	42.80	42.30	44.02	36.34	34.44	32.48
Baichuan2-7B-Base	54.00	54.16	57.07	47.47	42.73	41.56

13B模型表现

	C-Eval	MMLU	CMMLU	高考	AGIEval	BBH
	5次采样	5次采样	5次采样	5次采样	5次采样	3次采样
GPT-4	68.40	83.93	70.33	66.15	63.27	75.12
GPT-3.5 Turbo	51.10	68.54	54.06	47.07	46.13	61.59
LLaMA-13B	28.50	46.30	31.15	28.23	28.22	37.89
LLaMA2-13B	35.80	55.09	37.99	30.83	32.29	46.98
Vicuna-13B	32.80	52.00	36.28	30.11	31.55	43.04
Chinese-Alpaca-Plus-13B	38.80	43.90	33.43	34.78	35.46	28.94
XVERSE-13B	53.70	55.21	58.44	44.69	42.54	38.06
Baichuan-13B-Base	52.40	51.60	55.30	49.69	43.20	43.01
Baichuan2-13B-Base	58.10	59.17	61.97	54.33	48.17	48.78

训练过程存档/Training Dynamics

除最终版Baichuan2-7B-Base模型外，我们还开源了训练过程中11个中间检查点（对应约0.2至2.4万亿Tokens训练阶段），供研究社区使用（检查点下载）。下图展示了这些检查点在C-Eval、MMLU、CMMLU三大基准测试上的表现演进：

Alongside the final Baichuan2-7B-Base model, we release 11 intermediate checkpoints (covering ~0.2 to 2.4 trillion training tokens) for research purposes (Checkpoints). The graph illustrates their performance evolution on C-Eval, MMLU, and CMMLU benchmarks:

checkpoint

使用条款/Terms and Conditions

声明

我们郑重声明，团队未基于百川大模型2开发任何移动端、网页端或其他平台应用。我们强烈要求所有使用者：不得利用本模型从事任何危害国家安全或违法的活动；不得将本模型用于未经安全审核及备案的互联网服务。我们期望所有使用者能共同维护技术发展的合规环境。

尽管我们已尽最大努力确保训练数据的合规性，但由于模型与数据的复杂性，仍可能存在不可预见的问题。因此，对于因使用百川大模型2开源模型导致的任何问题（包括但不限于数据安全、舆论风险，或模型被误用、滥用及非法传播引发的风险），我们均不承担任何责任。

We hereby clarify that no official applications (iOS/Android/web/etc.) have been developed based on Baichuan 2 models. We strictly prohibit: 1) Any activities compromising national/social security or violating laws; 2) Deploying the models in unregistered internet services without proper security reviews.

While we've made exhaustive efforts to ensure training data compliance, unforeseen issues may persist due to model/data complexity. Therefore, we disclaim all liability for any consequences arising from using Baichuan 2 open-source models, including but not limited to data breaches, public opinion risks, or damages caused by model misuse/abuse/illegal distribution.

协议

使用 Baichuan 2 模型需遵守 Apache 2.0 协议及《Baichuan 2 模型社区许可协议》。该模型支持商业应用，若您计划将 Baichuan 2 模型或其衍生作品用于商业用途，请确保您的实体符合以下条件：

您或关联方的服务/产品日均活跃用户（DAU）未超过100万
您及关联方非软件服务提供商或云服务提供商
您及关联方不会在未经百川授权的情况下，将商用许可二次授予第三方

满足上述条件后，请发送《Baichuan 2 模型社区许可协议》要求的申请材料至邮箱 opensource@baichuan-inc.com。审核通过后，百川将授予您非排他性、全球范围、不可转让、不可分许可且可撤销的商用授权。

快速开始

微调指南

数据准备

我们提供belle_chat_ramdon数据集的预处理与微调示例，数据集可通过以下链接获取：

belle_chat_ramdon_10k

运行belle_preprocess.py脚本，可将带prompt模板的原始数据转换为mindrecord格式，完成数据预处理工作。

# 脚本路径：example/dataset/belle_preprocess.py
python example/dataset/belle_preprocess.py \
--input_glob /{path}/belle_chat_ramdon_10k.json \
--output_file /{path}/belle_512.mindrecord \
--seq_length 512

# 参数说明
input_glob: 输入数据集路径
model_file: 词表文件路径
output_file: 输出数据集的路径和名称
seq_length: 生成数据集的序列长度

Training

cd example
bash msrun.sh "finetune.py --train_dataset /{path}/belle_512.mindrecord"

Reasoning

from mindspore import set_context
from openmind import pipeline

set_context(mode=0, device_id=0)

pipeline_task = pipeline(task="text_generation", model='MindSpore-Lab/baichuan2_7b_base', framework='ms', trust_remote_code=True)
pipeline_result = pipeline_task("<reserved_106>你是谁？<reserved_107>", do_sample=False)
print(pipeline_result)