我们推出 Starling-7B,这是一款通过 AI 反馈强化学习(RLAIF)训练的开源大型语言模型(LLM)。该模型利用了我们全新的 GPT-4 标注排序数据集 berkeley-nest/Nectar 以及我们新的奖励训练和策略调优 pipeline。Starling-7B-alpha 在 MT Bench 中以 GPT-4 作为评判标准获得 8.09 分,在 MT-Bench 上的表现超越了除 OpenAI 的 GPT-4 和 GPT-4 Turbo 之外的所有现有模型。我们在 HuggingFace 上发布了排序数据集 Nectar、奖励模型 Starling-RM-7B-alpha 和语言模型 Starling-LM-7B-alpha,并在 LMSYS Chatbot Arena 提供了在线演示。敬请关注我们即将发布的代码和论文,其中将详细介绍整个流程。
Starling-LM-7B-alpha 是在 Openchat 3.5 的基础上,使用奖励模型 berkeley-nest/Starling-RM-7B-alpha 和策略优化方法 advantage-induced policy alignment (APA) 训练而成的语言模型。评估结果如下表所示。
| 模型 | 调优方法 | MT Bench | AlpacaEval | MMLU |
|---|---|---|---|---|
| GPT-4-Turbo | ? | 9.32 | 97.70 | |
| GPT-4 | SFT + PPO | 8.99 | 95.28 | 86.4 |
| Starling-7B | C-RLFT + APA | 8.09 | 91.99 | 63.9 |
| Claude-2 | ? | 8.06 | 91.36 | 78.5 |
| GPT-3.5-Turbo | ? | 7.94 | 89.37 | 70 |
| Claude-1 | ? | 7.9 | 88.39 | 77 |
| Tulu-2-dpo-70b | SFT + DPO | 7.89 | 95.1 | |
| Openchat-3.5 | C-RLFT | 7.81 | 88.51 | 64.3 |
| Zephyr-7B-beta | SFT + DPO | 7.34 | 90.60 | 61.4 |
| Llama-2-70b-chat-hf | SFT + PPO | 6.86 | 92.66 | 63 |
| Neural-chat-7b-v3-1 | SFT + DPO | 6.84 | 84.53 | 62.4 |
| Tulu-2-dpo-7b | SFT + DPO | 6.29 | 85.1 |
欲了解更多详细讨论,请查阅我们的 博客文章,并敬请关注我们即将发布的代码和论文!
重要提示:请使用以下提供的精确聊天模板与模型交互。否则,模型性能将下降。在极少数情况下,模型输出可能会过于冗长。请考虑将temperature设置为0以减少这种情况的发生。
我们的模型遵循与Openchat 3.5完全相同的聊天模板和使用方法。有关更多详细信息,请参考其模型卡片。 此外,我们的模型已托管在LMSYS Chatbot Arena上,可供免费测试。
对话模板与Openchat 3.5相同:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("openchat/openchat_3.5")
# Single-turn
tokens = tokenizer("GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant:").input_ids
assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
# Multi-turn
tokens = tokenizer("GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:").input_ids
assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
# Coding Mode
tokens = tokenizer("Code User: Implement quicksort using C++<|end_of_turn|>Code Assistant:").input_ids
assert tokens == [1, 7596, 1247, 28747, 26256, 2936, 7653, 1413, 334, 1680, 32000, 7596, 21631, 28747]from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import openmind
import argparse
import time
def generate_text(prompt, model, tokenizer, device):
text_generator = openmind.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map=device,
tokenizer=tokenizer,
)
formatted_prompt = f"Question: {prompt} Answer:"
sequences = text_generator(
formatted_prompt,
do_sample=True,
top_k=5,
top_p=0.9,
num_return_sequences=1,
repetition_penalty=1.5,
max_new_tokens=128,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to model",
default="jeffding/Starling-LM-7B-alpha-openmind",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True)
model = model.to(device)
start_time = time.time()
# infer
messages = [{"role": "user", "content": "What is the capital of France."}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))
end_time = time.time()
print(f"硬件环境:{device},推理执行时间:{end_time - start_time}秒")
if __name__ == "__main__":
main()import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
model = transformers.AutoModelForCausalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
def generate_response(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(
input_ids,
max_length=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response_ids = outputs[0]
response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
return response_text
# Single-turn conversation
prompt = "Hello, how are you?"
single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(single_turn_prompt)
print("Response:", response_text)
## Multi-turn conversation
prompt = "Hello"
follow_up_question = "How are you today?"
response = ""
multi_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant: {response}<|end_of_turn|>GPT4 Correct User: {follow_up_question}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(multi_turn_prompt)
print("Multi-turn conversation response:", response_text)
### Coding conversation
prompt = "Implement quicksort using C++"
coding_prompt = f"Code User: {prompt}<|end_of_turn|>Code Assistant:"
response = generate_response(coding_prompt)
print("Coding conversation response:", response)本数据集、模型及在线演示版本仅供研究用途,且仅限非商业使用,其使用需遵守LLaMA的数据蒸馏许可协议、OpenAI生成数据的使用条款以及ShareGPT的隐私政策。如发现任何潜在侵权行为,请与我们联系。
感谢加州大学伯克利分校的Wei-Lin Chiang对本博客及项目提供的详细反馈。感谢LMSYS Organization在lmsys-chat-1M数据集、评估工作及在线演示方面给予的支持。感谢开源社区为我们提供了开发本项目所使用的数据集和基础模型,其中包括但不限于Anthropic、Llama、Mistral、Hugging Face H4、LMSYS、OpenChat、OpenBMB、Flan和ShareGPT。
@misc{starling2023,
title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
url = {},
author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
month = {November},
year = {2023}
}