简介

Aquila-135M 模型是一款小型中英双语语言模型，采用两阶段范式进行训练：预训练和退火训练。该模型在预训练阶段使用了 1.66TB 的中英双语 token，在退火训练阶段使用了 100B 的 token。在退火阶段，我们精选了 100B token 的高质量双语数据，最终得到了我们的模型。

Aquila-135M-Instuct 模型是基于 Infinity Instruct 进行微调得到的。

整个训练过程基于 Triton，使用 FlagGems 以及名为 FlagScale 的并行训练框架完成。

此外，我们还开源了所有中间检查点。

数据集

我们开源了预训练和退火阶段使用的所有双语数据集。数据集的构成和混合比例如下图所示。 datasets composition

评估

我们遵循 SmolLM 模型的评估设置，并使用 lighteval 工具对模型进行评估。

参数数量不包括嵌入部分，并且 Aquila-135M 与 SmolLM2-135M 具有相同的模型结构。Aquila-135M 在英文基准测试上达到了相当的性能，而在中文基准测试上则表现出显著更好的结果。

在总参数数量低于及约为 4 亿的小型模型中，Aquila-135M 在保持领先处理能力的同时，显著提升了中文语言能力。

指标（零样本）	Aquila-135M (Trition)	Aquila-135M (CUDA)	SmolLM-135M	SmolLM2-135M	gpt2-medium-360M	TinyMistral-248M	TinyMistral-248M-2.5	OpenELM-270M	Wide-Sheared-LLaMA-290M	opt-350m	MobileLLM-350M	pythia-410m	SmolLM-360M	SmolLM2-360M
HellaSwag	41.19	41.12	41.15	42.10	37.08	27.06	26.80	45.74	24.94	36.08	26.28	39.22	51.73	54.66
ARC (Average)	44.76	44.15	42.34	43.93	34.34	29.71	27.63	35.74	26.20	31.91	27.72	35.14	49.95	53.24
PIQA	66.38	67.52	68.28	68.44	66.38	57.40	53.92	69.75	50.60	64.36	50.27	67.19	71.55	71.98
MMLU (cloze)	31.07	30.67	30.26	31.58	27.75	25.82	25.59	27.89	24.75	26.58	24.86	28.88	34.32	36.09
CommonsenseQA	32.10	31.70	32.02	32.92	31.70	24.57	21.46	35.71	16.54	32.10	17.53	31.45	36.61	38.74
TriviaQA	6.65	7.02	4.24	4.03	2.36	0.50	0.08	1.34	0.00	1.38	0.00	2.06	9.19	16.92
Winograde	51.07	51.70	51.22	50.99	49.49	49.25	49.01	52.41	49.72	51.54	49.41	49.96	53.12	52.49
OpenBookQA	34.40	34.40	33.80	34.60	31.40	29.40	27.40	30.60	26.00	27.80	24.80	28.40	37.20	37.00
GSM8K (5-shot)	2.12	2.12	1.00	1.52	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	2.81
SIQA	41.81	42.32	41.15	41.45	41.30	41.86	39.71	42.73	39.76	42.37	37.10	42.02	43.45	41.61
CEval	29.22	29.82	28.28	26.41	25.40	25.38	26.89	26.69	26.37	26.67	25.68	27.97	27.66	28.51
CMMLU	29.48	29.63	26.01	26.66	27.20	26.67	25.57	26.25	26.33	26.93	25.61	26.91	27.06	27.39
Average-English	35.16	35.27	34.55	35.16	32.18	28.56	27.16	34.19	25.85	31.41	25.80	32.43	38.71	40.55
Average-Chinese	29.35	29.73	27.15	26.54	26.30	26.03	26.23	26.47	26.35	26.80	25.65	27.44	27.36	27.95
Average	32.25	32.50	30.85	30.85	29.24	27.29	26.70	30.33	26.10	29.11	25.72	29.94	33.04	34.25

对于对比模型，评估是在本地环境中进行的，因此分数可能与论文中报告的略有不同。

如何使用

指令模型

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "BAAI/Aquila-135M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "什么是引力？"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## 引力是宇宙中的一个基本力，由多个物体相互作用而产生的。它由能量和质量组成，与引力定律密切相关。

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## Gravity is the force that keeps us on Earth as we orbit it. It pulls objects towards each other with a strength that depends on how far apart they are from each other, and how strong the gravitational pull is. The stronger the object's mass, the greater its gravitational pull.

未来规划

我们计划进一步优化数据集的构成与比例。
我们计划进一步探索小规模模型在特定场景下的应用。

引用

如果您觉得本项目有帮助，请引用以下成果

@misc{aquila-135m,
      title={Aquila-135M: A Bilingual Small Language Model in Chinese and English}, 
      author={BAAI},
      year={},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={}, 
}

简介

Aquila-135M-Instuct 模型是基于 Infinity Instruct 进行微调得到的。

整个训练过程基于 Triton，使用 FlagGems 以及名为 FlagScale 的并行训练框架完成。

此外，我们还开源了所有中间检查点。

数据集

我们开源了预训练和退火阶段使用的所有双语数据集。数据集的构成和混合比例如下图所示。 datasets composition

评估

我们遵循 SmolLM 模型的评估设置，并使用 lighteval 工具对模型进行评估。

在总参数数量低于及约为 4 亿的小型模型中，Aquila-135M 在保持领先处理能力的同时，显著提升了中文语言能力。

指标（零样本）	Aquila-135M (Trition)	Aquila-135M (CUDA)	SmolLM-135M	SmolLM2-135M	gpt2-medium-360M	TinyMistral-248M	TinyMistral-248M-2.5	OpenELM-270M	Wide-Sheared-LLaMA-290M	opt-350m	MobileLLM-350M	pythia-410m	SmolLM-360M	SmolLM2-360M
HellaSwag	41.19	41.12	41.15	42.10	37.08	27.06	26.80	45.74	24.94	36.08	26.28	39.22	51.73	54.66
ARC (Average)	44.76	44.15	42.34	43.93	34.34	29.71	27.63	35.74	26.20	31.91	27.72	35.14	49.95	53.24
PIQA	66.38	67.52	68.28	68.44	66.38	57.40	53.92	69.75	50.60	64.36	50.27	67.19	71.55	71.98
MMLU (cloze)	31.07	30.67	30.26	31.58	27.75	25.82	25.59	27.89	24.75	26.58	24.86	28.88	34.32	36.09
CommonsenseQA	32.10	31.70	32.02	32.92	31.70	24.57	21.46	35.71	16.54	32.10	17.53	31.45	36.61	38.74
TriviaQA	6.65	7.02	4.24	4.03	2.36	0.50	0.08	1.34	0.00	1.38	0.00	2.06	9.19	16.92
Winograde	51.07	51.70	51.22	50.99	49.49	49.25	49.01	52.41	49.72	51.54	49.41	49.96	53.12	52.49
OpenBookQA	34.40	34.40	33.80	34.60	31.40	29.40	27.40	30.60	26.00	27.80	24.80	28.40	37.20	37.00
GSM8K (5-shot)	2.12	2.12	1.00	1.52	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	2.81
SIQA	41.81	42.32	41.15	41.45	41.30	41.86	39.71	42.73	39.76	42.37	37.10	42.02	43.45	41.61
CEval	29.22	29.82	28.28	26.41	25.40	25.38	26.89	26.69	26.37	26.67	25.68	27.97	27.66	28.51
CMMLU	29.48	29.63	26.01	26.66	27.20	26.67	25.57	26.25	26.33	26.93	25.61	26.91	27.06	27.39
Average-English	35.16	35.27	34.55	35.16	32.18	28.56	27.16	34.19	25.85	31.41	25.80	32.43	38.71	40.55
Average-Chinese	29.35	29.73	27.15	26.54	26.30	26.03	26.23	26.47	26.35	26.80	25.65	27.44	27.36	27.95
Average	32.25	32.50	30.85	30.85	29.24	27.29	26.70	30.33	26.10	29.11	25.72	29.94	33.04	34.25

对于对比模型，评估是在本地环境中进行的，因此分数可能与论文中报告的略有不同。

如何使用

指令模型

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "BAAI/Aquila-135M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "什么是引力？"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## 引力是宇宙中的一个基本力，由多个物体相互作用而产生的。它由能量和质量组成，与引力定律密切相关。

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## Gravity is the force that keeps us on Earth as we orbit it. It pulls objects towards each other with a strength that depends on how far apart they are from each other, and how strong the gravitational pull is. The stronger the object's mass, the greater its gravitational pull.

未来规划

我们计划进一步优化数据集的构成与比例。
我们计划进一步探索小规模模型在特定场景下的应用。

引用

如果您觉得本项目有帮助，请引用以下成果

@misc{aquila-135m,
      title={Aquila-135M: A Bilingual Small Language Model in Chinese and English}, 
      author={BAAI},
      year={},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={}, 
}

简介

最新动态

数据集

评估

如何使用

指令模型

未来规划

引用

简介

最新动态

数据集

评估

如何使用

指令模型

未来规划

引用