neural-chat-7b-v3-2:Intel开发的7B参数大语言模型，基于Mistral-7B-v0.1优化，在MetaMathQA数据集上微调，采用DPO方法对齐。支持FP32/BF16/INT4多精度推理，擅长数学问题解答，具备分步推理能力。【此简介由AI生成】

模型详情：Neural-Chat-v3-2

该模型是基于Intel/neural-chat-7b-v3-1在meta-math/MetaMathQA数据集上，使用英特尔Gaudi 2处理器微调得到的70亿参数大语言模型。模型采用直接性能优化（DPO）方法，通过Intel/orca_dpo_pairs数据集进行对齐训练。Intel/neural-chat-7b-v3-1原始模型基于mistralai/Mistral-7B-v-0.1微调而得。更多技术细节请参阅Medium文章《The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2》。

图片来源：Google DeepMind on Unsplash

模型详情	说明
模型作者 - 公司	英特尔神经聊天团队（成员来自DCAI/AISE/AIPT），核心成员：Kaokao Lv、Liang Lv、Chang Wang、Wenxin Zhang、Xuhui Ren、Haihao Shen
发布时间	2023年12月
版本号	v3-2
模型类型	70亿参数大语言模型
论文或参考资料	Medium技术博客
许可证	Apache 2.0
问题反馈	社区讨论区或英特尔开发者Discord

使用范围	说明
主要用途	可用于多种语言相关任务，性能表现请参阅大语言模型排行榜
目标用户	需要进行语言任务推理的开发者
非适用场景	该模型在多数场景下需针对具体任务进行微调，不得用于制造敌对或排斥性环境

使用方法

本模型的上下文长度为：8192 个 token（与 mistralai/Mistral-7B-v0.1 相同）。

复现模型

以下是复现模型的示例代码：GitHub 示例代码。以下是复现构建模型的文档：

git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel-extension-for-transformers

docker build --no-cache ./ --target hpu --build-arg REPO=https://github.com/intel/intel-extension-for-transformers.git --build-arg ITREX_VER=main -f ./intel_extension_for_transformers/neural_chat/docker/Dockerfile -t chatbot_finetuning:latest

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest

# after entering docker container
cd examples/finetuning/finetune_neuralchat_v3

我们选用最新的预训练模型mistralai/Mistral-7B-v0.1与开源数据集Open-Orca/SlimOrca进行实验。

以下脚本采用deepspeed zero2模式启动8卡Gaudi2训练。在finetune_neuralchat_v3.py中，默认设置use_habana=True, use_lazy_mode=True, device="hpu"适用于Gaudi2。若需在NVIDIA GPU上运行，可将其设置为use_habana=False, use_lazy_mode=False, device="auto"。

deepspeed --include localhost:0,1,2,3,4,5,6,7 \
    --master_port 29501 \
    finetune_neuralchat_v3.py

合并LoRA权重：

python apply_lora.py \
    --base-model-path mistralai/Mistral-7B-v0.1 \
    --lora-model-path finetuned_model/ \
    --output-path finetuned_model_lora

使用模型

基于Transformers的FP32精度推理

import transformers


model_name = 'Intel/neural-chat-7b-v3-2'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]


# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.
"""

使用 Intel Extension for Transformers 和 Intel Extension for Pytorch 进行 BF16 推理

from transformers import AutoTokenizer, TextStreamer
import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import intel_extension_for_pytorch as ipex

model_name = "Intel/neural-chat-7b-v3-2"
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = ipex.optimize(model.eval(), dtype=torch.bfloat16, inplace=True, level="O1", auto_kernel_selection=True)

outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

使用 Transformers 和 Intel Extension for Transformers 进行 INT4 推理

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v3-2"

# for int8, should set weight_dtype="int8"      
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int4")
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

因素	说明
数据组	关于数据集和标注的更多细节可在 meta-math/MetaMathQA、项目页面 https://meta-math.github.io/ 及相关论文 https://arxiv.org/abs/2309.12284 中查看。
输入敏感性	模型性能会因输入内容的不同而产生差异。提示词的设计会显著影响语言模型的预测结果。
训练环境	模型基于英特尔 Gaudi 2 处理器（8卡）进行训练。
硬件适配提示	在不同硬件和软件环境部署模型会影响其性能。模型评估指标来自 Hugging Face LLM 排行榜：ARC、HellaSwag、MMLU、TruthfulQA、Winogrande 和 GSM8K（详见下方量化分析）。

指标	说明
性能度量标准	模型性能根据 LLM 排行榜的指标与其他大语言模型进行对比评估。这些指标已成为衡量 LLM 性能的行业标准。
决策阈值	未设置决策阈值。
不确定性与变异性处理方法	-

训练与评估数据	说明
数据集	训练数据源自 meta-math/MetaMathQA，该数据集基于 GSM8k 和 MATH 训练集进行增强扩充。训练过程中已排除 GSM8k 测试集数据，确保无数据污染。
数据动机	-
预处理	-

量化分析

Open LLM 排行榜结果详见：https://huggingface.co/datasets/open-llm-leaderboard/details_Intel__neural-chat-7b-v3-2。具体指标如下：

指标	数值
平均分	68.29
ARC（25样本）	67.49
HellaSwag（10样本）	83.92
MMLU（5样本）	63.55
TruthfulQA（0样本）	59.68
Winogrande（5样本）	79.95
GSM8K（5样本）	55.12

道德考量与局限性

Neural-chat-7b-v3-2 可能生成事实性错误的输出，不应依赖其提供准确无误的信息。受预训练模型和微调数据集的局限性影响，该模型有可能产生低俗、带有偏见或其他具有冒犯性的内容。

因此，在部署任何基于 neural-chat-7b-v3-2 的应用前，开发者应进行安全性测试。

注意事项与建议

用户（包括直接用户和下游用户）应充分了解该模型存在的风险、偏见和局限性。

以下链接提供有关英特尔人工智能软件的更多信息：

英特尔神经压缩器链接
英特尔Transformer扩展库链接

免责声明

本模型许可协议不构成法律建议。我们对第三方使用该模型的行为不承担任何责任。在将本模型用于商业用途前，请务必咨询法律顾问。