面向通用任务指令调优的参数高效稀疏构建：从密集型模型到混合专家模型（EMNLP'24）

新闻

2024年9月20日 - 我们的论文被EMNLP'24接收。
2024年3月12日 - 我们在🤗 HuggingFace上发布了Qwen2idae-16x14B-v1.0，该模型在数学和代码领域表现出色，激活参数为15B。
2024年2月7日 - Serp-ai为我们的参数高效稀疏构建方法添加了unsloth支持，以实现更快、内存更高效的训练，并基于mistral-7B发布了新的sparsetral模型。
2024年1月10日 - Camelidae模型现已在🤗 HuggingFace上可用。
2024年1月4日 - 我们发布了论文《Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks》，链接：https://arxiv.org/abs/2401.02731。
2023年12月22日 - 我们发布了训练仓库，用于将LLaMA架构的密集型模型构建为MoE模型。

简介

Camelidae和Qwen2idae模型采用参数高效稀疏构建技术进行训练。

我们提出参数高效稀疏构建方法，以帮助密集型模型学习不同领域（包括代码和数学）的知识。该方法执行指令调优并高效利用MoE结构。

具体而言，参数高效稀疏构建利用包括QLoRA和Adapter在内的参数高效技术，来执行高效的稀疏升级。

模型列表

Camelidae系列	下载地址
Camelidae-8x7B	🤗 HuggingFace
Camelidae-8x13B	🤗 HuggingFace
Camelidae-8x34B	🤗 HuggingFace
Camelidae-8x34B-pro	🤗 即将发布

Qwen2idae系列	下载地址
Qwen2idae-16x14B-v1.0	🤗 HuggingFace
Qwen2idae-16x7B-v1.0	🤗 即将发布
Qwen2idae-16x1.8B-v1.0	🤗 即将发布

性能表现

模型	激活参数	MMLU（5轮示例）	GSM8k（5轮示例）	MATH（4轮示例）	HumanEval（0轮示例）	MBPP（4轮示例）	HellaSwag（10轮示例）
GPT3.5	-	70.0%	57.1%	34.1%	48.1%	-	85.5%
LLaMA2-70B-chat	70B	63.8%	59.3%	10.4%	32.3%	35.6%	84.8%
Camelidae-8x34B-pro	35B	75.7%	79.4%	24.0%	48.8%	43.2%	85.2%
Camelidae-8x34B	35B	75.6%	78.3%	22.6%	43.9%	41.4%	85.3%
SUSChat-34B	34B	76.4%	72.3%	22.0%	11.6%	40.2%	83.9%
Yi-34B-chat	34B	74.8%	67.6%	17.3%	20.1%	41.0%	83.9%
Qwen2idae-16x14B-v1.0	15B	66.7%	77.8%	29.9%	62.8%	48.6%	82.3%
Mixtral-8x7B-instruct	14B	68.7%	71.7%	22.1%	25.6%	40.6%	86.5%
Camelidae-8x13B	13B	54.4%	52.6%	9.8%	30.6%	30.4%	82.5%
LLaMA2-13B-chat	13B	53.9%	37.1%	5.2%	18.9%	27.2%	81.9%
Camelidae-8x7B	7B	48.3%	44.0%	5.8%	18.3%	23.4%	79.2%
LLaMA2-7B-chat	7B	47.2%	26.3%	3.9%	12.2%	17.6%	78.6%

我们对所有模型的前三名分数分别进行了加粗处理。

使用方法

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import sys
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/Camelidae-8x13B",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    start_time = time.time()      
    model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
    model.eval()

    prompt = "Hello, who are you?"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    max_new_tokens = 100
    outputs = model.generate(input_ids=input_ids, max_length=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.3, top_k=0)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(response)
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

引用格式

@article{wu2024parameter,
  title={Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks},
  author={Wu, Haoyuan and Zheng, Haisheng and Yu, Bei},
  journal={arXiv preprint arXiv:2401.02731},
  year={2024}
}

许可协议

本仓库中的源代码遵循 Apache 2.0 许可协议。Camelidae 模型仅供学术研究和免费商业使用，所有使用必须遵守 facebookresearch 和 01-ai 的许可协议。

新闻

2024年9月20日 - 我们的论文被EMNLP'24接收。

2024年3月12日 - 我们在🤗 HuggingFace上发布了Qwen2idae-16x14B-v1.0，该模型在数学和代码领域表现出色，激活参数为15B。

2024年2月7日 - Serp-ai为我们的参数高效稀疏构建方法添加了unsloth支持，以实现更快、内存更高效的训练，并基于mistral-7B发布了新的sparsetral模型。

2024年1月10日 - Camelidae模型现已在🤗 HuggingFace上可用。

2024年1月4日 - 我们发布了论文《Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks》，链接：https://arxiv.org/abs/2401.02731。

2023年12月22日 - 我们发布了训练仓库，用于将LLaMA架构的密集型模型构建为MoE模型。

简介

Camelidae和Qwen2idae模型采用参数高效稀疏构建技术进行训练。

我们提出参数高效稀疏构建方法，以帮助密集型模型学习不同领域（包括代码和数学）的知识。该方法执行指令调优并高效利用MoE结构。

具体而言，参数高效稀疏构建利用包括QLoRA和Adapter在内的参数高效技术，来执行高效的稀疏升级。

Camelidae系列

下载地址

Camelidae-8x7B

🤗 HuggingFace

Camelidae-8x13B

🤗 HuggingFace

Camelidae-8x34B

🤗 HuggingFace

Camelidae-8x34B-pro

🤗 即将发布

Qwen2idae系列

下载地址

Qwen2idae-16x14B-v1.0

🤗 HuggingFace

Qwen2idae-16x7B-v1.0

🤗 即将发布

Qwen2idae-16x1.8B-v1.0

🤗 即将发布

性能表现

模型	激活参数	MMLU（5轮示例）	GSM8k（5轮示例）	MATH（4轮示例）	HumanEval（0轮示例）	MBPP（4轮示例）	HellaSwag（10轮示例）
GPT3.5	-	70.0%	57.1%	34.1%	48.1%	-	85.5%
LLaMA2-70B-chat	70B	63.8%	59.3%	10.4%	32.3%	35.6%	84.8%
Camelidae-8x34B-pro	35B	75.7%	79.4%	24.0%	48.8%	43.2%	85.2%
Camelidae-8x34B	35B	75.6%	78.3%	22.6%	43.9%	41.4%	85.3%
SUSChat-34B	34B	76.4%	72.3%	22.0%	11.6%	40.2%	83.9%
Yi-34B-chat	34B	74.8%	67.6%	17.3%	20.1%	41.0%	83.9%
Qwen2idae-16x14B-v1.0	15B	66.7%	77.8%	29.9%	62.8%	48.6%	82.3%
Mixtral-8x7B-instruct	14B	68.7%	71.7%	22.1%	25.6%	40.6%	86.5%
Camelidae-8x13B	13B	54.4%	52.6%	9.8%	30.6%	30.4%	82.5%
LLaMA2-13B-chat	13B	53.9%	37.1%	5.2%	18.9%	27.2%	81.9%
Camelidae-8x7B	7B	48.3%	44.0%	5.8%	18.3%	23.4%	79.2%
LLaMA2-7B-chat	7B	47.2%	26.3%	3.9%	12.2%	17.6%	78.6%

我们对所有模型的前三名分数分别进行了加粗处理。

使用方法

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import sys
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/Camelidae-8x13B",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    start_time = time.time()      
    model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
    model.eval()

    prompt = "Hello, who are you?"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    max_new_tokens = 100
    outputs = model.generate(input_ids=input_ids, max_length=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.3, top_k=0)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(response)
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

@article{wu2024parameter, title={Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks}, author={Wu, Haoyuan and Zheng, Haisheng and Yu, Bei}, journal={arXiv preprint arXiv:2401.02731}, year={2024} }