MMLW(muszę mieć lepszą wiadomość)是针对波兰语的神经文本编码器。 此模型针对信息检索任务进行了优化。它可以将查询和段落转换为1024维向量。 该模型的开发采用了两步流程:
⚠️ 2023年12月26日: 我们已将模型更新至结果更优的新版本。您仍可使用 v1 标签下载旧版本:AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-large", revision="v1") ⚠️
⚠️ 我们的密集检索器在对文本进行编码时,需要使用特定的前缀和后缀。对于此模型,查询应前缀 "query: ",段落应前缀 "passage: " ⚠️
您可以通过sentence-transformers这样使用该模型:
from openmind import AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to model",
default="zhouhui/mmlw-retrieval-e5-large",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path).to(device)
sentences = ['Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu.', 'Trzeba pić alkohol, imprezować i jeździć szybkimi autami.']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to(device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
if __name__ == "__main__":
main()
# Trzeba zdrowo się odżywiać i uprawiać sport.该模型在波兰信息检索基准(Polish Information Retrieval Benchmark)上的NDCG@10达到58.30。详细结果请参见PIRB排行榜。
本模型的训练得到了格但斯克理工大学TASK中心倡议提供的A100 GPU集群支持。
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}