多语言 MS Marco 的交叉编码器

该模型在 MMARCO 数据集上进行了训练。这是一个使用谷歌翻译对 MS MARCO 进行机器翻译的版本，已被翻译成 14 种语言。在我们的实验中，我们发现它在其他语言上也表现良好。

我们使用 multilingual MiniLMv2 模型作为基础模型。

该模型可用于信息检索：给定一个查询，将查询与所有可能的段落（例如通过 ElasticSearch 检索到的段落）进行编码，然后按降序对段落进行排序。更多详细信息，请参见 SBERT.net Retrieve & Re-rank。训练代码可在此处获取：SBERT.net Training MS Marco

与 SentenceTransformers 配合使用

当您安装了 SentenceTransformers 后，使用起来会非常简单。然后，您可以像这样使用预训练模型：

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])

使用 Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

多语言 MS Marco 的交叉编码器

我们使用 multilingual MiniLMv2 模型作为基础模型。

与 SentenceTransformers 配合使用

当您安装了 SentenceTransformers 后，使用起来会非常简单。然后，您可以像这样使用预训练模型：

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])

使用 Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)