MMLW(muszę mieć lepszą wiadomość)是适用于波兰语的神经文本编码器。 这是一个蒸馏模型,可用于生成适用于多种任务的嵌入向量,如语义相似度计算、聚类和信息检索。该模型也可用作进一步微调的基础模型。 它能将文本转换为1024维向量。 该模型以波兰语RoBERTa检查点初始化,随后采用多语言知识蒸馏方法在包含6000万波兰语-英语文本对的多样化语料库上进行训练。我们使用英语FlagEmbeddings(BGE)作为蒸馏过程中的教师模型。
⚠️ 我们的嵌入模型在对文本进行编码时,要求使用特定的前缀和后缀。对于本模型,每个查询前都应加上前缀 "zapytanie: " ⚠️
您可以通过sentence-transformers这样使用该模型:
from openmind import AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to model",
default="jeffding/mmlw-roberta-large-openmind",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path).to(device)
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to(device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
if __name__ == "__main__":
main()本模型的训练得到了格但斯克理工大学TASK中心计划提供的A100 GPU集群支持。
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}