优化了快速入门示例代码。/ Optimized the quick start code samples.
百川大模型2是百川智能推出的新一代开源大语言模型,基于2.6万亿高质量训练语料,在权威中英文基准测试中均达到同规模最佳表现。本次发布包含7B、13B的基础版与对话版,并提供对话版4位量化模型。所有版本不仅向学术研究完全开放,开发者只需通过邮件申请获取官方商用授权后,即可免费用于商业场景。具体版本与下载链接如下:
Baichuan 2 is the new generation of open-source large language models developed by Baichuan Intelligence Inc.. Trained on 2.6 trillion high-quality tokens, it achieves state-of-the-art performance on authoritative Chinese and English benchmarks. This release includes 7B and 13B versions of both base and chat models, with 4-bit quantized chat variants. All models are fully accessible for academic research, and commercial use is permitted after obtaining authorization via email request. Available versions are listed below:
| 基础模型 | 对话模型 | 4位量化对话模型 | |
|---|---|---|---|
| 7B | Baichuan2-7B-Base | Baichuan2-7B-Chat | Baichuan2-7B-Chat-4bits |
| 13B | Baichuan2-13B-Base | Baichuan2-13B-Chat | Baichuan2-13B-Chat-4bits |
我们在[通用领域]、[法律与医疗]、[数学与编程]、多语言翻译等六大领域的中英文权威测试集上进行了全面评估,详细结果请参阅GitHub。
The models have been rigorously evaluated across six domains: General, Legal & Medical, Mathematics & Coding, and Multilingual Translation. For comprehensive results, visit GitHub.
| C-Eval | MMLU | CMMLU | 高考 | AGIEval | BBH | |
|---|---|---|---|---|---|---|
| 5次采样 | 5次采样 | 5次采样 | 5次采样 | 5次采样 | 3次采样 | |
| GPT-4 | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
| GPT-3.5 Turbo | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
| LLaMA-7B | 27.10 | 35.10 | 26.75 | 27.81 | 28.17 | 32.38 |
| LLaMA2-7B | 28.90 | 45.73 | 31.38 | 25.97 | 26.53 | 39.16 |
| MPT-7B | 27.15 | 27.93 | 26.00 | 26.54 | 24.83 | 35.20 |
| Falcon-7B | 24.23 | 26.03 | 25.66 | 24.24 | 24.10 | 28.77 |
| ChatGLM2-6B | 50.20 | 45.90 | 49.00 | 49.44 | 45.28 | 31.65 |
| Baichuan-7B | 42.80 | 42.30 | 44.02 | 36.34 | 34.44 | 32.48 |
| Baichuan2-7B-Base | 54.00 | 54.16 | 57.07 | 47.47 | 42.73 | 41.56 |
| C-Eval | MMLU | CMMLU | 高考 | AGIEval | BBH | |
|---|---|---|---|---|---|---|
| 5次采样 | 5次采样 | 5次采样 | 5次采样 | 5次采样 | 3次采样 | |
| GPT-4 | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
| GPT-3.5 Turbo | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
| LLaMA-13B | 28.50 | 46.30 | 31.15 | 28.23 | 28.22 | 37.89 |
| LLaMA2-13B | 35.80 | 55.09 | 37.99 | 30.83 | 32.29 | 46.98 |
| Vicuna-13B | 32.80 | 52.00 | 36.28 | 30.11 | 31.55 | 43.04 |
| Chinese-Alpaca-Plus-13B | 38.80 | 43.90 | 33.43 | 34.78 | 35.46 | 28.94 |
| XVERSE-13B | 53.70 | 55.21 | 58.44 | 44.69 | 42.54 | 38.06 |
| Baichuan-13B-Base | 52.40 | 51.60 | 55.30 | 49.69 | 43.20 | 43.01 |
| Baichuan2-13B-Base | 58.10 | 59.17 | 61.97 | 54.33 | 48.17 | 48.78 |
除最终版Baichuan2-7B-Base模型外,我们还开源了训练过程中11个中间检查点(对应约0.2至2.4万亿Tokens训练阶段),供研究社区使用(检查点下载)。下图展示了这些检查点在C-Eval、MMLU、CMMLU三大基准测试上的表现演进:
Alongside the final Baichuan2-7B-Base model, we release 11 intermediate checkpoints (covering ~0.2 to 2.4 trillion training tokens) for research purposes (Checkpoints). The graph illustrates their performance evolution on C-Eval, MMLU, and CMMLU benchmarks:

我们郑重声明,团队未基于百川大模型2开发任何移动端、网页端或其他平台应用。我们强烈要求所有使用者:不得利用本模型从事任何危害国家安全或违法的活动;不得将本模型用于未经安全审核及备案的互联网服务。我们期望所有使用者能共同维护技术发展的合规环境。
尽管我们已尽最大努力确保训练数据的合规性,但由于模型与数据的复杂性,仍可能存在不可预见的问题。因此,对于因使用百川大模型2开源模型导致的任何问题(包括但不限于数据安全、舆论风险,或模型被误用、滥用及非法传播引发的风险),我们均不承担任何责任。
We hereby clarify that no official applications (iOS/Android/web/etc.) have been developed based on Baichuan 2 models. We strictly prohibit: 1) Any activities compromising national/social security or violating laws; 2) Deploying the models in unregistered internet services without proper security reviews.
While we've made exhaustive efforts to ensure training data compliance, unforeseen issues may persist due to model/data complexity. Therefore, we disclaim all liability for any consequences arising from using Baichuan 2 open-source models, including but not limited to data breaches, public opinion risks, or damages caused by model misuse/abuse/illegal distribution.
使用 Baichuan 2 模型需遵守 Apache 2.0 协议及《Baichuan 2 模型社区许可协议》。该模型支持商业应用,若您计划将 Baichuan 2 模型或其衍生作品用于商业用途,请确保您的实体符合以下条件:
满足上述条件后,请发送《Baichuan 2 模型社区许可协议》要求的申请材料至邮箱 opensource@baichuan-inc.com。审核通过后,百川将授予您非排他性、全球范围、不可转让、不可分许可且可撤销的商用授权。
我们提供belle_chat_ramdon数据集的预处理与微调示例,数据集可通过以下链接获取:
运行belle_preprocess.py脚本,可将带prompt模板的原始数据转换为mindrecord格式,完成数据预处理工作。
# 脚本路径:example/dataset/belle_preprocess.py
python example/dataset/belle_preprocess.py \
--input_glob /{path}/belle_chat_ramdon_10k.json \
--output_file /{path}/belle_512.mindrecord \
--seq_length 512
# 参数说明
input_glob: 输入数据集路径
model_file: 词表文件路径
output_file: 输出数据集的路径和名称
seq_length: 生成数据集的序列长度cd example
bash msrun.sh "finetune.py --train_dataset /{path}/belle_512.mindrecord"from mindspore import set_context
from openmind import pipeline
set_context(mode=0, device_id=0)
pipeline_task = pipeline(task="text_generation", model='MindSpore-Lab/baichuan2_7b_base', framework='ms', trust_remote_code=True)
pipeline_result = pipeline_task("<reserved_106>你是谁?<reserved_107>", do_sample=False)
print(pipeline_result)