MPT-7B-Instruct-8k

MPT-7B-Instruct-8k 是一个用于长文本指令遵循的模型，尤其适用于长文档的问答和摘要生成任务。

该模型通过在 Dolly HHRLHF 数据集上对 MPT-7B-8k 进行微调构建而成。其中，Dolly HHRLHF 数据集衍生自 Databricks Dolly-15k 和 Anthropic Helpful and Harmless (HH-RLHF) 数据集。此外，该模型还在 Competition Math、Duorc、CoT GSM8k、Qasper、Quality、Summ Screen FD 和 Spider 等数据集上进行了训练。

此训练数据集与 MPT-30B-Instruct 所使用的数据集完全一致。

许可证：Apache 2.0

该模型由 MosaicML 训练，采用了改进的仅解码器 transformer 架构。

模型日期

2023 年 7 月 18 日

模型许可证

Apache 2.0

文档

使用方法

该模型最适合与 MosaicML 的 llm-foundry 代码库配合使用，以进行训练和微调。

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-instruct-8k',
  trust_remote_code=True
)

注意：此模型要求将 trust_remote_code=True 传递给 from_pretrained 方法。这是因为我们使用了自定义的 MPT 模型架构，该架构尚未集成到 Hugging Face 的 transformers 软件包中。 MPT 包含许多训练效率功能选项，例如 FlashAttention、ALiBi、QK LayerNorm 等。

要使用 FlashAttention 的优化 triton 实现，您可以在 GPU（cuda:0）上加载模型，并指定 attn_impl='triton' 和 bfloat16 精度：

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import sys
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/mpt-7b-8k-instruct",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    start_time = time.time()      
    model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model.eval()

    prompt = "Hello, who are you?"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(input_ids=input_ids, max_length=100)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(response)
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

该模型最初以2048的序列长度进行训练，并通过额外的预训练阶段将序列长度适配至8192。然而，ALiBi允许用户在微调及/或推理过程中进一步增加最大序列长度。例如：

import transformers

name = 'mosaicml/mpt-7b-instruct-8k'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 16384 # (input + output) tokens can now be up to 16384

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

该模型使用 MPT-7B-chat 分词器进行训练，该分词器基于 EleutherAI/gpt-neox-20b 分词器，并包含额外的 ChatML 标记。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-7b-8k')

例如，该模型可用于文本生成流水线中。
注意：在以较低精度运行 Torch 模块时，最佳实践是使用 torch.autocast context manager。

from transformers import pipeline

with torch.autocast('cuda', dtype=torch.bfloat16):
    inputs = tokenizer('Here is a recipe for vegan banana bread:\n', return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs, max_new_tokens=100)
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

# or using the HF pipeline
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.bfloat16):
    print(
        pipe('Here is a recipe for vegan banana bread:\n',
            max_new_tokens=100,
            do_sample=True,
            use_cache=True))

模型描述

该架构是对标准仅解码器 transformer 的一种改进。

该模型在以下方面对标准 transformer 进行了改进：

采用 FlashAttention
采用 ALiBi（带线性偏置的注意力机制），不使用位置嵌入
不使用偏置项

超参数	值
n_parameters	6.7B
n_layers	32
n_heads	32
d_model	4096
vocab size	50432
sequence length	2048

数据混合

该模型在以下数据混合上进行训练：

数据来源	来源中的令牌数量	占比
competition_math	1.6 M	3.66%
cot_gsm8k	3.36 M	7.67%
dialogsum	0.1 M	0.23%
dolly_hhrlhf	5.89 M	13.43%
duorc	7.8 M	17.80%
qasper	8.72 M	19.90%
quality	11.29 M	25.78%
scrolls/summ_screen_fd	4.97 M	11.33%
spider	0.089 M	0.20%

训练配置

该模型使用 MosaicML Platform 在 8 台 80GB A100 上训练了约 6.3 小时。模型训练采用分片数据并行，使用 FSDP，并使用 AdamW 优化器。

局限性与偏见

以下内容改编自 EleutherAI 的 GPT-NeoX-20B

MPT-7B-Instruct-8k 可能会产生与事实不符的输出，不应依赖其生成准确的事实信息。 MPT-7B-Instruct-8k 是在各种公开数据集上训练的。尽管已尽最大努力清理预训练数据，但该模型仍有可能生成低俗、有偏见或其他冒犯性的输出。

致谢

该模型由 MosaicML NLP 团队进行微调。

免责声明

本模型的许可协议不构成法律建议。我们不对使用本模型的第三方行为负责。将本模型用于商业目的前，请咨询律师。

MosaicML 平台

如果您有兴趣在 MosaicML 平台上训练和部署自己的 MPT 或大型语言模型（LLMs），请在此注册。

引用

请使用以下格式引用本模型：

@online{MosaicML2023Introducing,
    author    = {MosaicML NLP Team},
    title     = {Introducing MPT-30B: Raising the bar
for open-source foundation models},
    year      = {2023},
    url       = {www.mosaicml.com/blog/mpt-30b},
    note      = {Accessed: 2023-06-22},
    urldate   = {2023-06-22}
}