harrier-oss-v1 是由微软开发的多语言文本嵌入模型系列。 这些模型采用仅解码器架构,结合最后 token 池化和 L2 归一化技术,生成密集型文本嵌入。 它们可应用于广泛的任务,包括但不限于检索、聚类、语义相似度计算、分类、平行语料挖掘和重排序。 截至发布之日,这些模型在 Multilingual MTEB v2 基准测试中取得了最先进的结果。
| 模型 | 参数规模 | 嵌入维度 | 最大 Token 数 | MTEB v2 得分 |
|---|---|---|---|---|
| harrier-oss-v1-270m | 270M | 640 | 32,768 | 66.5 |
| harrier-oss-v1-0.6b | 0.6B | 1,024 | 32,768 | 69.0 |
| harrier-oss-v1-27b | 27B | 5,376 | 32,768 | 74.3 |
所有模型均在大规模多语言数据集混合上采用对比学习目标进行训练,这些数据集涵盖了多种任务。 270m 和 0.6b 版本还额外通过更大的嵌入模型进行知识蒸馏训练。
以下是对 MS-MARCO 段落排序数据集中的查询和段落进行编码的示例。
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("microsoft/harrier-oss-v1-0.6b", model_kwargs={"dtype": "auto"})
queries = [
"how much protein should a female eat",
"summit define",
]
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)
scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())可查看 config_sentence_transformers.json 了解预配置的提示词,例如 web_search_query、sts_query 和 bitext_query。您也可以直接使用自定义指令,例如 model.encode(queries, prompt="Instruct: Retrieve semantically similar text\nQuery: ")。
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'summit define')
]
# No need to add instruction for retrieval documents
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('microsoft/harrier-oss-v1-0.6b')
model = AutoModel.from_pretrained('microsoft/harrier-oss-v1-0.6b', dtype='auto')
model.eval()
model.cuda()
max_length = 32768
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
batch_dict = {k: v.cuda() for k, v in batch_dict.items()}
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())这些模型基于多语言数据训练,支持多种语言,包括但不限于:阿拉伯语、保加利亚语、加泰罗尼亚语、捷克语、丹麦语、德语、希腊语、英语、西班牙语、爱沙尼亚语、波斯语、芬兰语、法语、希伯来语、印地语、克罗地亚语、匈牙利语、印度尼西亚语、意大利语、日语、韩语、立陶宛语、拉脱维亚语、马其顿语、马来语、荷兰语、挪威语、波兰语、葡萄牙语、罗马尼亚语、俄语、斯洛伐克语、斯洛文尼亚语、阿尔巴尼亚语、塞尔维亚语、瑞典语、泰语、土耳其语、乌克兰语、乌尔都语、越南语和中文。
请参考mteb仓库来复现我们的分数。各任务使用的评估提示也可在mteb_v2_eval_prompts.json中获取。
1. 是否需要在查询中添加指令?
是的,模型就是这样训练的,否则性能会下降。任务定义应是一个描述任务的单句指令。这是一种通过自然语言指令为不同场景定制文本嵌入的方法。
另一方面,文档侧无需添加指令。
2. 为什么我复现的结果与模型卡片中报告的结果略有不同?
不同版本的transformers和pytorch可能会导致微小但非零的性能差异。
3. 该模型使用什么池化策略?
模型使用最后一个 token 池化——将最后一个非填充 token 的嵌入作为句子表示。然后对嵌入进行 L2 归一化。使用 Sentence Transformers 时会自动处理此过程。