利用混合质量数据提升开源语言模型

由 RunPod 赞助 RunPod Logo

OPENCHAT3.5 1210
🏆 总体性能最佳的开源 7B 模型 🏆
🤖 性能超越 ChatGPT（3月版）和 Grok-1 🤖
🚀 相比 OpenChat-3.5，编码能力提升 15 个百分点 🚀

新特性
💡 两种模式：编码 + 通用，数学推理能力增强 💡
🧑‍⚖️ 实验性支持评估器和反馈功能 🧑‍⚖️

在 Openmind 中的使用

from openmind import pipeline, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/openchat-3.5-1210-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    
    
    start_time = time.time()
    
    pipe = pipeline("text-generation", model=model_path, torch_dtype=torch.bfloat16, device_map=device)

    # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])
    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

使用方法

要使用此模型，我们强烈建议按照我们仓库中的安装指南安装 OpenChat 软件包，并通过运行下表中的服务命令来使用 OpenChat 兼容 OpenAI 的 API 服务器。该服务器使用 vLLM 针对高吞吐量部署进行了优化，可在具有 24GB 内存的消费级 GPU 上运行。要启用张量并行，请在服务命令后附加 --tensor-parallel-size N。

启动后，服务器在 localhost:18888 监听请求，并且与 OpenAI ChatCompletion API 规范兼容。请参考下面的示例请求。此外，您可以使用 OpenChat Web UI 获得用户友好的体验。

如果您想将服务器部署为在线服务，可以使用 --api-keys sk-KEY1 sk-KEY2 ... 指定允许的 API 密钥，并使用 --disable-log-requests --disable-log-stats --log-file openchat.log 仅将日志记录到文件。出于安全考虑，我们建议在服务器前使用 HTTPS 网关。

模型	大小	上下文	权重	服务
OpenChat 3.5 1210	7B	8192	Huggingface	`python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-1210 --engine-use-ray --worker-use-ray`

示例请求（点击展开）

💡 默认模式（GPT4 Correct）：最适用于编码、聊天和一般任务

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openchat_3.5",
    "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
  }'

🧮 数学推理模式：专为解决数学问题设计

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openchat_3.5",
    "condition": "Math Correct",
    "messages": [{"role": "user", "content": "10.3 − 7988.8133 = "}]
  }'

对话模板

💡 默认模式（GPT4 Correct）：最适用于编码、聊天和一般任务

GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

🧮 数学推理模式：专为解决数学问题设计

Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:

⚠️ 注意： 请记住将 <|end_of_turn|> 设置为生成结束标记。

默认的（GPT4 Correct）模板也可作为集成的 tokenizer.chat_template 使用，您可以使用此模板，而无需手动指定模板：

messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi"},
    {"role": "user", "content": "How are you today?"}
]
tokens = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]

（实验性）评估器/反馈功能

我们在此次版本中加入了评估器功能，旨在推动开源模型作为评估工具的发展。您可以使用`Default Mode (GPT4 Correct)`及以下提示词（与[Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)相同）来评估响应。

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Reference Answer (Score 5):
{orig_reference_answer}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback:

基准测试

模型	参数数量	平均值	MT-Bench	HumanEval	BBH MC	AGIEval	TruthfulQA	MMLU	GSM8K	BBH CoT
OpenChat-3.5-1210	70亿	63.8	7.76	68.9	49.5	48.0	61.8	65.3	77.3	61.8
OpenChat-3.5	70亿	61.6	7.81	55.5	47.6	47.4	59.1	64.3	77.3	63.5
ChatGPT (March)*	未知	61.5	7.94	48.1	47.6	47.1	57.7	67.3	74.9	70.1

OpenHermes 2.5	70亿	59.3	7.54	48.2	49.4	46.5	57.5	63.8	73.5	59.9
OpenOrca Mistral	70亿	52.7	6.86	38.4	49.4	42.9	45.9	59.3	59.1	58.1
Zephyr-β^	70亿	34.6	7.34	22.0	40.6	39.0	40.8	39.8	5.1	16.0
Mistral	70亿	-	6.84	30.5	39.0	38.0	-	60.1	52.2	-

评估详情（点击展开）

*: ChatGPT (March) 的结果来自 [GPT-4 技术报告](https://arxiv.org/abs/2303.08774)、[思维链中心](https://github.com/FranxYao/chain-of-thought-hub) 以及我们的评估。请注意，ChatGPT 并非固定基准，其性能会随时间快速演变。

^: Zephyr-β 往往难以遵循少样本思维链指令，这可能是因为它仅针对对话数据进行了对齐，而未在少样本数据上训练。

**: Mistral 和开源 SOTA 结果取自指令微调模型论文及官方仓库中报告的结果。

所有模型均在对话模式下进行评估（例如，应用了各自的对话模板）。所有零样本基准测试均采用与 AGIEval 论文和 Orca 论文相同的设置。思维链任务使用与思维链中心相同的配置，HumanEval 通过 EvalPlus 进行评估，MT-bench 使用 FastChat 运行。要复现我们的结果，请按照我们的仓库中的说明操作。

HumanEval+

模型	规模	HumanEval+ pass@1
ChatGPT (2023年12月12日)	-	64.6
WizardCoder-Python-34B-V1.0	340亿	64.6
OpenChat 3.5 (12月10日)	70亿	63.4
OpenHermes 2.5	70亿	41.5

OpenChat-3.5-1210 与 Grok 对比

	许可证	参数数量	平均值	MMLU	HumanEval	MATH	GSM8k
OpenChat 3.5 1210	Apache-2.0	70亿	60.1	65.3	68.9	28.9	77.3
OpenChat 3.5	Apache-2.0	70亿	56.4	64.3	55.5	28.6	77.3
Grok-0	专有	330亿	44.5	65.7	39.7	15.7	56.8
Grok-1	专有	未知	55.8	73	63.2	23.9	62.9

*: Grok 结果由 X.AI 报告。

中文评估结果 / Chinese Evaluations

⚠️ Note that this model was not explicitly trained in Chinese (only < 0.1% of the data is in Chinese). 请注意本模型没有针对性训练中文（中文数据占比小于0.1%）。

多级别多学科中文评估套件（CEVAL）

模型	平均	理工科	社会科学	人文科学	其他
ChatGPT	54.4	52.9	61.8	50.9	53.6
OpenChat	47.29	45.22	52.49	48.52	45.08

中文大规模多任务语言理解（CMMLU，5-shot）

模型	理工科	人文科学	社会科学	其他	中国特色	平均
ChatGPT	47.81	55.68	56.5	62.66	50.69	55.51
OpenChat	38.7	45.99	48.32	50.23	43.27	45.85

局限性

基础模型局限性 尽管 OpenChat 具备先进功能，但它仍然受到其基础模型固有局限性的约束。这些局限性可能会影响模型在以下领域的性能：

复杂推理
数学和算术任务
编程与编码挑战

虚构信息的幻觉 OpenChat 有时可能会生成不存在或不准确的信息，即所谓的“幻觉”。用户应意识到这种可能性，并对从模型中获得的任何关键信息进行核实。

安全性 OpenChat 有时可能会生成有害内容、仇恨言论、带有偏见的回应，或回答不安全的问题。在需要安全且经过审核的回应的使用场景中，应用额外的 AI 安全措施至关重要。

许可证

我们的 OpenChat 3.5 代码和模型根据 Apache License 2.0 进行分发。

数据集详情

OpenChat 3.5 是在 C-RLFT 框架下，使用一系列公开可用的高质量指令数据进行训练的，并采用了自定义处理流程。我们在此详细列出一些值得注意的子集：

引用

@article{wang2023openchat,
  title={OpenChat: Advancing Open-source Language Models with Mixed-Quality Data},
  author={Wang, Guan and Cheng, Sijie and Zhan, Xianyuan and Li, Xiangang and Song, Sen and Liu, Yang},
  journal={arXiv preprint arXiv:2309.11235},
  year={2023}
}

💌 联系方式

我们期待您的反馈，并希望能在这个令人兴奋的项目上展开合作！

项目负责人：

Guan Wang [imonenext at gmail dot com]
Alpay Ariyak [aariyak at wpi dot edu]

利用混合质量数据提升开源语言模型

在线演示 | GitHub | 论文 | Discord

由 RunPod 赞助 RunPod Logo

在 Openmind 中的使用

from openmind import pipeline, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/openchat-3.5-1210-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    
    
    start_time = time.time()
    
    pipe = pipeline("text-generation", model=model_path, torch_dtype=torch.bfloat16, device_map=device)

    # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])
    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

使用方法

模型	大小	上下文	权重	服务
OpenChat 3.5 1210	7B	8192	Huggingface	`python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-1210 --engine-use-ray --worker-use-ray`

示例请求（点击展开）

💡 默认模式（GPT4 Correct）：最适用于编码、聊天和一般任务

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openchat_3.5",
    "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
  }'

🧮 数学推理模式：专为解决数学问题设计

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openchat_3.5",
    "condition": "Math Correct",
    "messages": [{"role": "user", "content": "10.3 − 7988.8133 = "}]
  }'

对话模板

💡 默认模式（GPT4 Correct）：最适用于编码、聊天和一般任务

GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

🧮 数学推理模式：专为解决数学问题设计

Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:

⚠️ 注意： 请记住将 <|end_of_turn|> 设置为生成结束标记。

默认的（GPT4 Correct）模板也可作为集成的 tokenizer.chat_template 使用，您可以使用此模板，而无需手动指定模板：

messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi"},
    {"role": "user", "content": "How are you today?"}
]
tokens = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]

（实验性）评估器/反馈功能

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Reference Answer (Score 5):
{orig_reference_answer}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback:

基准测试

模型	参数数量	平均值	MT-Bench	HumanEval	BBH MC	AGIEval	TruthfulQA	MMLU	GSM8K	BBH CoT
OpenChat-3.5-1210	70亿	63.8	7.76	68.9	49.5	48.0	61.8	65.3	77.3	61.8
OpenChat-3.5	70亿	61.6	7.81	55.5	47.6	47.4	59.1	64.3	77.3	63.5
ChatGPT (March)*	未知	61.5	7.94	48.1	47.6	47.1	57.7	67.3	74.9	70.1

OpenHermes 2.5	70亿	59.3	7.54	48.2	49.4	46.5	57.5	63.8	73.5	59.9
OpenOrca Mistral	70亿	52.7	6.86	38.4	49.4	42.9	45.9	59.3	59.1	58.1
Zephyr-β^	70亿	34.6	7.34	22.0	40.6	39.0	40.8	39.8	5.1	16.0
Mistral	70亿	-	6.84	30.5	39.0	38.0	-	60.1	52.2	-

评估详情（点击展开）

^: Zephyr-β 往往难以遵循少样本思维链指令，这可能是因为它仅针对对话数据进行了对齐，而未在少样本数据上训练。

**: Mistral 和开源 SOTA 结果取自指令微调模型论文及官方仓库中报告的结果。

HumanEval+

模型	规模	HumanEval+ pass@1
ChatGPT (2023年12月12日)	-	64.6
WizardCoder-Python-34B-V1.0	340亿	64.6
OpenChat 3.5 (12月10日)	70亿	63.4
OpenHermes 2.5	70亿	41.5

OpenChat-3.5-1210 与 Grok 对比

	许可证	参数数量	平均值	MMLU	HumanEval	MATH	GSM8k
OpenChat 3.5 1210	Apache-2.0	70亿	60.1	65.3	68.9	28.9	77.3
OpenChat 3.5	Apache-2.0	70亿	56.4	64.3	55.5	28.6	77.3
Grok-0	专有	330亿	44.5	65.7	39.7	15.7	56.8
Grok-1	专有	未知	55.8	73	63.2	23.9	62.9

*: Grok 结果由 X.AI 报告。

中文评估结果 / Chinese Evaluations

⚠️ Note that this model was not explicitly trained in Chinese (only < 0.1% of the data is in Chinese). 请注意本模型没有针对性训练中文（中文数据占比小于0.1%）。

多级别多学科中文评估套件（CEVAL）

模型	平均	理工科	社会科学	人文科学	其他
ChatGPT	54.4	52.9	61.8	50.9	53.6
OpenChat	47.29	45.22	52.49	48.52	45.08

中文大规模多任务语言理解（CMMLU，5-shot）

模型	理工科	人文科学	社会科学	其他	中国特色	平均
ChatGPT	47.81	55.68	56.5	62.66	50.69	55.51
OpenChat	38.7	45.99	48.32	50.23	43.27	45.85

局限性

基础模型局限性 尽管 OpenChat 具备先进功能，但它仍然受到其基础模型固有局限性的约束。这些局限性可能会影响模型在以下领域的性能：

复杂推理
数学和算术任务
编程与编码挑战

许可证

我们的 OpenChat 3.5 代码和模型根据 Apache License 2.0 进行分发。

数据集详情

OpenChat 3.5 是在 C-RLFT 框架下，使用一系列公开可用的高质量指令数据进行训练的，并采用了自定义处理流程。我们在此详细列出一些值得注意的子集：

引用

@article{wang2023openchat,
  title={OpenChat: Advancing Open-source Language Models with Mixed-Quality Data},
  author={Wang, Guan and Cheng, Sijie and Zhan, Xianyuan and Li, Xiangang and Song, Sen and Liu, Yang},
  journal={arXiv preprint arXiv:2309.11235},
  year={2023}
}

💌 联系方式

我们期待您的反馈，并希望能在这个令人兴奋的项目上展开合作！

项目负责人：

Guan Wang [imonenext at gmail dot com]
Alpay Ariyak [aariyak at wpi dot edu]

利用混合质量数据提升开源语言模型

目录

在 Openmind 中的使用

使用方法

对话模板

（实验性）评估器/反馈功能

基准测试

HumanEval+

OpenChat-3.5-1210 与 Grok 对比

中文评估结果 / Chinese Evaluations

多级别多学科中文评估套件（CEVAL）

中文大规模多任务语言理解（CMMLU，5-shot）

局限性

许可证

数据集详情

引用

💌 联系方式

利用混合质量数据提升开源语言模型

目录

在 Openmind 中的使用

使用方法

对话模板

（实验性）评估器/反馈功能

基准测试

HumanEval+

OpenChat-3.5-1210 与 Grok 对比

中文评估结果 / Chinese Evaluations

多级别多学科中文评估套件（CEVAL）

中文大规模多任务语言理解（CMMLU，5-shot）

局限性

许可证

数据集详情

引用

💌 联系方式