HuggingFace镜像/Solon-embeddings-base-0.1-openmind
模型介绍文件和版本分析
下载使用量0

如何在 openmind 中使用

from openmind import AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/Solon-embeddings-base-0.1-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
        
    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path).to(device)
    sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to(device)

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling. In this case, mean pooling.
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
if __name__ == "__main__":
    main()

Solon Embeddings — Base 0.1

SOTA 开源法语嵌入模型。

使用说明:
在查询前添加“query : ”以提升检索性能。
段落无需额外指令。

模型平均得分
OrdalieTech/Solon-embeddings-large-0.10.7490
cohere/embed-multilingual-v30.7402
OrdalieTech/Solon-embeddings-base-0.10.7306
openai/ada-0020.7290
cohere/embed-multilingual-light-v30.6945
antoinelouis/biencoder-camembert-base-mmarcoFR0.6826
dangvantuan/sentence-camembert-large0.6756
voyage/voyage-010.6753
intfloat/multilingual-e5-large0.6660
intfloat/multilingual-e5-base0.6597
Sbert/paraphrase-multilingual-mpnet-base-v20.5975
dangvantuan/sentence-camembert-base0.5456
EuropeanParliament/eubert_embedding_v10.5063

这些结果通过 9 项法语基准测试获得,涵盖多种文本相似度任务(分类、重排序、STS):

  • AmazonReviewsClassification (MTEB)
  • MassiveIntentClassification (MTEB)
  • MassiveScenarioClassification (MTEB)
  • MTOPDomainClassification (MTEB)
  • MTOPIntentClassification (MTEB)
  • STS22 (MTEB)
  • MiraclFRRerank (Miracl)
  • OrdalieFRSTS (Ordalie)
  • OrdalieFRReranking (Ordalie)

我们创建了 OrdalieFRSTS 和 OrdalieFRReranking,以增强法语 STS 和重排序评估的基准测试能力。

(评估脚本可在此获取:github.com/OrdalieTech/mteb)