一幅以神话艺术风格描绘南风之神Notus的横幅。横幅展现出强劲的旋风，体现了南风温暖湿润的特性。几架纸飞机优雅地掠过画面，被Notus轻柔而有力的阵风裹挟。背景融合了温暖的色调，象征南方的炎热，同时点缀着蓝色和绿色，代表这股风所携带的湿气。整体氛围充满动态的运动感与温暖。

Notus 7B v1 模型卡片

Notus 是一系列使用直接偏好优化（DPO）及相关RLHF技术微调的模型。本模型为首个版本，基于 zephyr-7b-sft-full 进行DPO微调，而 zephyr-7b-sft-full 正是用于创建 zephyr-7b-beta 的SFT模型。

遵循数据优先的原则，Notus-7B-v1与Zephyr-7B-beta之间的唯一区别在于用于dDPO的偏好数据集。

具体而言，在我们开始构建distilabel时，我们投入时间深入理解和研究UltraFeedback数据集。通过使用Argilla，我们发现原始UltraFeedback数据集中存在一些问题，导致质量不佳的回答获得了高分（训练数据部分将详细说明）。在整理了数百个数据点后，我们决定使用偏好评分而非原始评论的 overall_score 对数据集进行二值化处理，并通过Argilla对新数据集进行了验证。

使用偏好评分而非评论分数，产生了一个新的数据集，其中约50%的情况下，被选中的回答有所不同。利用这个新数据集，我们通过DPO微调了Notus——一个7B模型，它在AlpacaEval上超越了Zephyr-7B-beta和Claude 2。

重要提示：虽然在修复原始数据集时，我们选择了多方面评分的平均值，但一个非常有趣的开放性问题仍然存在：一旦评论数据被修复，使用评论分数还是偏好评分效果更好？我们非常期待在未来几周内进行这项比较，请保持关注！

本模型的成功离不开出色的Alignment Handbook、OpenBMB发布的Ultrafeedback数据集，并且基于与HuggingFace H4团队富有成效的讨论。特别是，我们采用了 zephyr-7b-beta 的训练方案，该方案开箱即用，使我们能够专注于我们最擅长的事情：高质量数据。

Notus模型旨在通过类聊天应用程序作为助手使用，并通过聊天（MT-Bench、AlpacaEval）和学术（Open LLM Leaderboard）基准进行评估，以便与原始Zephyr dDPO模型及其他7B模型进行直接比较。

为何命名为Notus？：Notus的名字来源于古希腊神话中的南风之神Notus，这是对Zephyr（源自古希腊西风之神Zephyrus）的一种呼应；不同之处在于Notus是南风之神，而Zephyr是西风之神。更多信息请访问 https://en.wikipedia.org/wiki/Anemoi。

模型详情

模型描述

开发机构： Argilla（基于HuggingFace H4和MistralAI先前的努力及出色成果）
共享机构： Argilla
模型类型： 类GPT的7B模型，经DPO微调
支持语言（自然语言处理）： 主要为英语
许可证： MIT（与Zephyr 7B-beta相同）
微调基础模型： alignment-handbook/zephyr-7b-sft-full

模型来源

代码仓库： https://github.com/argilla-io/notus
论文： 无
演示地址： https://argilla-notus-chat-ui.hf.space/

性能表现

对话基准测试

下表改编自Zephyr-7b-β和Starling的原始表格，用于MT-Bench和AlpacaEval基准测试。结果按AlpacaEval胜率排序，为简洁起见，省略了部分大于7B的模型。

Notus在MT-Bench上与Zephyr持平，同时在AlpacaEval上超越了Zephyr、Claude 2和Cohere Command。这使得Notus成为AlpacaEval上最具竞争力的7B商业模型。

模型	规模	对齐方式	MT-Bench（得分）	AlpacaEval（胜率%）	许可证
GPT-4-turbo	-	?	9.32	97.70	专有
XwinLM 70b V0.1	70B	dPPO	-	95.57	LLaMA 2 许可证
GPT-4	-	RLHF	8.99	95.03	专有
Tulu 2+DPO 70B V0.1	70B	dDPO	6.29	95.28	专有
LLaMA2 Chat 70B	70B	RLHF	6.86	92.66	LLaMA 2 许可证
Starling-7B	7B	C-RLFT + APA	8.09	91.99	CC-BY-NC-4.0
Notus-7b-v1	7B	dDPO	7.30	91.42	MIT
Claude 2	-	RLHF	8.06	91.36	专有
Zephyr-7b-β	7B	dDPO	7.34	90.60	MIT
Cohere Command	-	RLHF	-	90.62	专有
GPT-3.5-turbo	-	RLHF	7.94	89.37	专有

学术基准测试

以下是 OpenLLM 排行榜的结果：

模型	平均分	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K	DROP
Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta)	52.15	62.03	84.36	61.07	57.45	77.74	12.74	9.66
argilla/notus-7b-v1	52.89	64.59	84.78	63.03	54.37	79.4	15.16	8.91

⚠️ 正如 AllenAI 研究人员所指出的，UltraFeedback 包含来自 TruthfulQA 数据集的提示，因此我们在该基准测试上显示的结果可能不准确。我们之前并未意识到此问题，因此在对 Notus-7B-v1 进行微调时使用了 TruthfulQA 的提示和偏好数据。对于未来的版本，我们将移除 TruthfulQA 提示。

训练详情

训练硬件

我们使用了 Lambda Labs 提供的配备 8 块 A100 40GB GPU 的虚拟机，在实验阶段也尝试了其他云服务提供商，如 GCP。

训练数据

我们使用了 openbmb/UltraFeedback 的一个新的精选版本，名为 Ultrafeedback 二值化偏好。

简要说明

我们使用 Argilla 的排序和筛选功能（按所选回应的最高评分排序）浏览了部分示例，发现原始 UF 数据集（以及 Zephyr 的 train_prefs 数据集）中的 overall_score 与所选回应的质量之间存在严重不匹配。

通过将评论理由添加到我们的 Argilla 数据集中，我们确认评论理由通常是高度负面的，而评分却非常高（在大多数情况下是最高的：10）。

以下截图展示了一个此类问题的示例。

经过初步调查，我们：

发现了数百个存在相同问题的示例，
在 UltraFeedback 仓库上提交了一个 bug 报告，
并通知了 H4 团队，他们反应迅速，并进行了额外的实验以验证新的评分二值化方法。

虽然我们正在努力修复原始数据集（已缩小到约 2000 个有问题的示例），但我们决定利用多偏好评分来开发 Notus！

image/png

重要说明：在我们修复数据集期间，我们选择使用评分的平均值。但有一个非常有趣的开放性问题：一旦数据修复完成，使用评论分数和使用偏好评分哪种方法效果更好？我们非常期待在未来几周内进行这一比较，请持续关注！

有关数据集分析和精选的更多详细信息，请参见 ultrafeedback-binarized-preferences 数据集卡片。

提示词模板

我们使用与 HuggingFaceH4/zephyr-7b-beta 相同的提示词模板：

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

使用方法

首先，您需要安装 transformers 和 accelerate（仅为简化设备放置），然后您可以运行以下任意命令：

在 Openmind 中使用

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import openmind
import argparse
import time

def generate_text(prompt, model, tokenizer, device):
    text_generator = openmind.pipeline(
        "text-generation",
        model=model,
        torch_dtype=torch.float16,
        device_map=device,
        tokenizer=tokenizer,
    )

    formatted_prompt = f"Question: {prompt} Answer:"

    sequences = text_generator(
        formatted_prompt,
        do_sample=True,
        top_k=5,
        top_p=0.9,
        num_return_sequences=1,
        repetition_penalty=1.5,
        max_new_tokens=128,
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/notus-7b-v1-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True)
    model = model.to(device)
    
    start_time = time.time()
    
    # infer
    messages = [{"role": "user", "content": "What is the capital of France."}]
    input_text=tokenizer.apply_chat_template(messages, tokenize=False)
    print(input_text)
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
    print(tokenizer.decode(outputs[0]))

    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

通过 `generate`

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)
outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

通过 `pipeline` 方法

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
generated_text = outputs[0]["generated_text"]

Notus 7B v1 模型卡片

遵循数据优先的原则，Notus-7B-v1与Zephyr-7B-beta之间的唯一区别在于用于dDPO的偏好数据集。

重要提示：虽然在修复原始数据集时，我们选择了多方面评分的平均值，但一个非常有趣的开放性问题仍然存在：一旦评论数据被修复，使用评论分数还是偏好评分效果更好？我们非常期待在未来几周内进行这项比较，请保持关注！

为何命名为Notus？：Notus的名字来源于古希腊神话中的南风之神Notus，这是对Zephyr（源自古希腊西风之神Zephyrus）的一种呼应；不同之处在于Notus是南风之神，而Zephyr是西风之神。更多信息请访问 https://en.wikipedia.org/wiki/Anemoi。

模型详情

模型描述

开发机构： Argilla（基于HuggingFace H4和MistralAI先前的努力及出色成果）
共享机构： Argilla
模型类型： 类GPT的7B模型，经DPO微调
支持语言（自然语言处理）： 主要为英语
许可证： MIT（与Zephyr 7B-beta相同）
微调基础模型： alignment-handbook/zephyr-7b-sft-full

模型来源

代码仓库： https://github.com/argilla-io/notus
论文： 无
演示地址： https://argilla-notus-chat-ui.hf.space/

性能表现

对话基准测试

下表改编自Zephyr-7b-β和Starling的原始表格，用于MT-Bench和AlpacaEval基准测试。结果按AlpacaEval胜率排序，为简洁起见，省略了部分大于7B的模型。

Notus在MT-Bench上与Zephyr持平，同时在AlpacaEval上超越了Zephyr、Claude 2和Cohere Command。这使得Notus成为AlpacaEval上最具竞争力的7B商业模型。

模型	规模	对齐方式	MT-Bench（得分）	AlpacaEval（胜率%）	许可证
GPT-4-turbo	-	?	9.32	97.70	专有
XwinLM 70b V0.1	70B	dPPO	-	95.57	LLaMA 2 许可证
GPT-4	-	RLHF	8.99	95.03	专有
Tulu 2+DPO 70B V0.1	70B	dDPO	6.29	95.28	专有
LLaMA2 Chat 70B	70B	RLHF	6.86	92.66	LLaMA 2 许可证
Starling-7B	7B	C-RLFT + APA	8.09	91.99	CC-BY-NC-4.0
Notus-7b-v1	7B	dDPO	7.30	91.42	MIT
Claude 2	-	RLHF	8.06	91.36	专有
Zephyr-7b-β	7B	dDPO	7.34	90.60	MIT
Cohere Command	-	RLHF	-	90.62	专有
GPT-3.5-turbo	-	RLHF	7.94	89.37	专有

学术基准测试

以下是 OpenLLM 排行榜的结果：

模型	平均分	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K	DROP
Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta)	52.15	62.03	84.36	61.07	57.45	77.74	12.74	9.66
argilla/notus-7b-v1	52.89	64.59	84.78	63.03	54.37	79.4	15.16	8.91

训练详情

训练硬件

我们使用了 Lambda Labs 提供的配备 8 块 A100 40GB GPU 的虚拟机，在实验阶段也尝试了其他云服务提供商，如 GCP。

训练数据

我们使用了 openbmb/UltraFeedback 的一个新的精选版本，名为 Ultrafeedback 二值化偏好。

简要说明

通过将评论理由添加到我们的 Argilla 数据集中，我们确认评论理由通常是高度负面的，而评分却非常高（在大多数情况下是最高的：10）。

以下截图展示了一个此类问题的示例。

经过初步调查，我们：

发现了数百个存在相同问题的示例，
在 UltraFeedback 仓库上提交了一个 bug 报告，
并通知了 H4 团队，他们反应迅速，并进行了额外的实验以验证新的评分二值化方法。

虽然我们正在努力修复原始数据集（已缩小到约 2000 个有问题的示例），但我们决定利用多偏好评分来开发 Notus！

image/png

重要说明：在我们修复数据集期间，我们选择使用评分的平均值。但有一个非常有趣的开放性问题：一旦数据修复完成，使用评论分数和使用偏好评分哪种方法效果更好？我们非常期待在未来几周内进行这一比较，请持续关注！

有关数据集分析和精选的更多详细信息，请参见 ultrafeedback-binarized-preferences 数据集卡片。

提示词模板

我们使用与 HuggingFaceH4/zephyr-7b-beta 相同的提示词模板：

<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>

使用方法

首先，您需要安装 transformers 和 accelerate（仅为简化设备放置），然后您可以运行以下任意命令：

在 Openmind 中使用

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import openmind
import argparse
import time

def generate_text(prompt, model, tokenizer, device):
    text_generator = openmind.pipeline(
        "text-generation",
        model=model,
        torch_dtype=torch.float16,
        device_map=device,
        tokenizer=tokenizer,
    )

    formatted_prompt = f"Question: {prompt} Answer:"

    sequences = text_generator(
        formatted_prompt,
        do_sample=True,
        top_k=5,
        top_p=0.9,
        num_return_sequences=1,
        repetition_penalty=1.5,
        max_new_tokens=128,
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/notus-7b-v1-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True)
    model = model.to(device)
    
    start_time = time.time()
    
    # infer
    messages = [{"role": "user", "content": "What is the capital of France."}]
    input_text=tokenizer.apply_chat_template(messages, tokenize=False)
    print(input_text)
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
    print(tokenizer.decode(outputs[0]))

    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

通过 `generate`

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)
outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

通过 `pipeline` 方法

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
    },
    {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
generated_text = outputs[0]["generated_text"]

Notus 7B v1 模型卡片

模型详情

模型描述

模型来源

性能表现

对话基准测试

学术基准测试

训练详情

训练硬件

训练数据

提示词模板

使用方法

在 Openmind 中使用

通过 generate

通过 pipeline 方法

Notus 7B v1 模型卡片

模型详情

模型描述

模型来源

性能表现

对话基准测试

学术基准测试

训练详情

训练硬件

训练数据

提示词模板

使用方法

在 Openmind 中使用

通过 generate

通过 pipeline 方法

通过 `generate`

通过 `pipeline` 方法

通过 `generate`

通过 `pipeline` 方法