HuggingFace镜像/Arabic-labse-Matryoshka-openmind
模型介绍文件和版本分析
下载使用量0

基于 sentence-transformers/LaBSE 的 SentenceTransformer

这是一个基于 sentence-transformers 的模型,它是在 Omartificial-Intelligence-Space/arabic-n_li-triplet 数据集上对 sentence-transformers/LaBSE 进行微调后得到的。该模型能将句子和段落映射到 768 维的稠密向量空间,可用于语义文本相似度计算、语义搜索、复述挖掘、文本分类、聚类等任务。

模型详情

模型描述

  • 模型类型: Sentence Transformer
  • 基础模型: sentence-transformers/LaBSE
  • 最大序列长度: 256 个 tokens
  • 输出维度: 768 个 tokens
  • 相似度函数: 余弦相似度
  • 训练数据集:
    • Omartificial-Intelligence-Space/arabic-n_li-triplet

模型来源

  • 文档: Sentence Transformers 文档
  • 代码库: GitHub 上的 Sentence Transformers
  • Hugging Face: Hugging Face 上的 Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

使用方法

直接使用(Sentence Transformers)

首先安装 Sentence Transformers 库:

pip install -U sentence-transformers

然后您可以加载此模型并运行推理。

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/Arabic-labse")
# Run inference
sentences = [
    'يجلس شاب ذو شعر أشقر على الحائط يقرأ جريدة بينما تمر امرأة وفتاة شابة.',
    'ذكر شاب ينظر إلى جريدة بينما تمر إمرأتان بجانبه',
    'الشاب نائم بينما الأم تقود ابنتها إلى الحديقة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

如何在 openmind 中使用

from openmind import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('models/Arabic-labse-Matryoshka')
model = AutoModel.from_pretrained('models/Arabic-labse-Matryoshka')
sentences = ['شخص على حصان يقفز فوق طائرة معطلة	', 'أطفال يبتسمون و يلوحون للكاميرا	']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

评估

指标

语义相似度

  • 数据集:sts-test-768
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.7269
spearman_cosine0.7225
pearson_manhattan0.7259
spearman_manhattan0.721
pearson_euclidean0.726
spearman_euclidean0.7225
pearson_dot0.7269
spearman_dot0.7225
pearson_max0.7269
spearman_max0.7225

语义相似度

  • 数据集:sts-test-512
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.7268
spearman_cosine0.7224
pearson_manhattan0.7241
spearman_manhattan0.7195
pearson_euclidean0.7248
spearman_euclidean0.7213
pearson_dot0.7253
spearman_dot0.7205
pearson_max0.7268
spearman_max0.7224

语义相似度

  • 数据集:sts-test-256
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.7283
spearman_cosine0.7264
pearson_manhattan0.7228
spearman_manhattan0.7181
pearson_euclidean0.7251
spearman_euclidean0.7215
pearson_dot0.7243
spearman_dot0.7221
pearson_max0.7283
spearman_max0.7264

语义相似度

  • 数据集:sts-test-128
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.7102
spearman_cosine0.7104
pearson_manhattan0.7135
spearman_manhattan0.7089
pearson_euclidean0.7172
spearman_euclidean0.713
pearson_dot0.6778
spearman_dot0.6746
pearson_max0.7172
spearman_max0.713

语义相似度

  • 数据集:sts-test-64
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.6931
spearman_cosine0.6982
pearson_manhattan0.6971
spearman_manhattan0.6942
pearson_euclidean0.7013
spearman_euclidean0.6987
pearson_dot0.6377
spearman_dot0.6345
pearson_max0.7013
spearman_max0.6987

语义相似度

  • 数据集:sts-test-768
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.8144
spearman_cosine0.8205
pearson_manhattan0.8203
spearman_manhattan0.8204
pearson_euclidean0.8202
spearman_euclidean0.8205
pearson_dot0.8144
spearman_dot0.8205
pearson_max0.8203
spearman_max0.8205

语义相似度

  • 数据集:sts-test-512
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.8143
spearman_cosine0.8212
pearson_manhattan0.8217
spearman_manhattan0.8216
pearson_euclidean0.8216
spearman_euclidean0.8219
pearson_dot0.8097
spearman_dot0.8147
pearson_max0.8217
spearman_max0.8219

语义相似度

  • 数据集:sts-test-256
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.8076
spearman_cosine0.8159
pearson_manhattan0.8209
spearman_manhattan0.8197
pearson_euclidean0.821
spearman_euclidean0.8203
pearson_dot0.7871
spearman_dot0.7875
pearson_max0.821
spearman_max0.8203

语义相似度

  • 数据集:sts-test-128
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.8024
spearman_cosine0.8118
pearson_manhattan0.8189
spearman_manhattan0.8181
pearson_euclidean0.8198
spearman_euclidean0.8185
pearson_dot0.7513
spearman_dot0.7428
pearson_max0.8198
spearman_max0.8185

语义相似度

  • 数据集:sts-test-64
  • 使用EmbeddingSimilarityEvaluator进行评估
指标值
pearson_cosine0.7855
spearman_cosine0.7949
pearson_manhattan0.806
spearman_manhattan0.8041
pearson_euclidean0.8088
spearman_euclidean0.806
pearson_dot0.6778
spearman_dot0.6616
pearson_max0.8088
spearman_max0.806

训练详情

训练数据集

Omartificial-Intelligence-Space/arabic-n_li-triplet

  • 数据集:Omartificial-Intelligence-Space/arabic-n_li-triplet
  • 规模:557,850 个训练样本
  • 列:anchor、positive 和 negative
  • 基于前 1000 个样本的近似统计数据:
    anchorpositivenegative
    类型stringstringstring
    详情
    • 最小值:4 个词元
    • 平均值:9.99 个词元
    • 最大值:51 个词元
    • 最小值:4 个词元
    • 平均值:12.44 个词元
    • 最大值:49 个词元
    • 最小值:5 个词元
    • 平均值:13.82 个词元
    • 最大值:49 个词元
  • 样本:
    anchorpositivenegative
    شخص على حصان يقفز فوق طائرة معطلةشخص في الهواء الطلق، على حصان.شخص في مطعم، يطلب عجة.
    أطفال يبتسمون و يلوحون للكاميراهناك أطفال حاضرونالاطفال يتجهمون
    صبي يقفز على لوح التزلج في منتصف الجسر الأحمر.الفتى يقوم بخدعة التزلجالصبي يتزلج على الرصيف
  • 损失函数:MatryoshkaLoss,参数如下:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }

评估数据集

Omartificial-Intelligence-Space/arabic-n_li-triplet

  • 数据集:Omartificial-Intelligence-Space/arabic-n_li-triplet
  • 规模:6,584 条评估样本
  • 列:anchor、positive 和 negative
  • 基于前 1000 条样本的近似统计数据:
    anchorpositivenegative
    类型stringstringstring
    详情
    • 最小值:4 个词元
    • 平均值:19.71 个词元
    • 最大值:100 个词元
    • 最小值:4 个词元
    • 平均值:9.37 个词元
    • 最大值:38 个词元
    • 最小值:4 个词元
    • 平均值:10.49 个词元
    • 最大值:34 个词元
  • 样本:
    anchorpositivenegative
    امرأتان يتعانقان بينما يحملان حزمةإمرأتان يحملان حزمةالرجال يتشاجرون خارج مطعم
    طفلين صغيرين يرتديان قميصاً أزرق، أحدهما يرتدي الرقم 9 والآخر يرتدي الرقم 2 يقفان على خطوات خشبية في الحمام ويغسلان أيديهما في المغسلة.طفلين يرتديان قميصاً مرقماً يغسلون أيديهمطفلين يرتديان سترة يذهبان إلى المدرسة
    رجل يبيع الدونات لعميل خلال معرض عالمي أقيم في مدينة أنجليسرجل يبيع الدونات لعميلامرأة تشرب قهوتها في مقهى صغير
  • 损失函数:带有以下参数的 MatryoshkaLoss:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }

训练超参数

非默认超参数

  • per_device_train_batch_size:64
  • per_device_eval_batch_size:64
  • num_train_epochs:1
  • warmup_ratio:0.1
  • fp16:True
  • batch_sampler:no_duplicates

所有超参数

点击展开
  • overwrite_output_dir:False
  • do_predict:False
  • prediction_loss_only:True
  • per_device_train_batch_size:64
  • per_device_eval_batch_size:64
  • per_gpu_train_batch_size:None
  • per_gpu_eval_batch_size:None
  • gradient_accumulation_steps:1
  • eval_accumulation_steps:None
  • learning_rate:5e-05
  • weight_decay:0.0
  • adam_beta1:0.9
  • adam_beta2:0.999
  • adam_epsilon:1e-08
  • max_grad_norm:1.0
  • num_train_epochs:1
  • max_steps:-1
  • lr_scheduler_type:linear
  • lr_scheduler_kwargs:{}
  • warmup_ratio:0.1
  • warmup_steps:0
  • log_level:passive
  • log_level_replica:warning
  • log_on_each_node:True
  • logging_nan_inf_filter:True
  • save_safetensors:True
  • save_on_each_node:False
  • save_only_model:False
  • no_cuda:False
  • use_cpu:False
  • use_mps_device:False
  • seed:42
  • data_seed:None
  • jit_mode_eval:False
  • use_ipex:False
  • bf16:False
  • fp16:True
  • fp16_opt_level:O1
  • half_precision_backend:auto
  • bf16_full_eval:False
  • fp16_full_eval:False
  • tf32:None
  • local_rank:0
  • ddp_backend:None
  • tpu_num_cores:None
  • tpu_metrics_debug:False
  • debug:[]
  • dataloader_drop_last:False
  • dataloader_num_workers:0
  • dataloader_prefetch_factor:None
  • past_index:-1
  • disable_tqdm:False
  • remove_unused_columns:True
  • label_names:None
  • load_best_model_at_end:False
  • ignore_data_skip:False
  • fsdp:[]
  • fsdp_min_num_params:0
  • fsdp_config:{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap:None
  • accelerator_config:{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None}
  • deepspeed:None
  • label_smoothing_factor:0.0
  • optim:adamw_torch
  • optim_args:None
  • adafactor:False
  • group_by_length:False
  • length_column_name:length
  • ddp_find_unused_parameters:None
  • ddp_bucket_cap_mb:None
  • ddp_broadcast_buffers:False
  • dataloader_pin_memory:True
  • dataloader_persistent_workers:False
  • skip_memory_metrics:True
  • use_legacy_prediction_loop:False
  • push_to_hub:False
  • resume_from_checkpoint:None
  • hub_model_id:None
  • hub_strategy:every_save
  • hub_private_repo:False
  • hub_always_push:False
  • gradient_checkpointing:False
  • gradient_checkpointing_kwargs:None
  • include_inputs_for_metrics:False
  • eval_do_concat_batches:True
  • fp16_backend:auto
  • push_to_hub_model_id:None
  • push_to_hub_organization:None
  • mp_parameters:
  • auto_find_batch_size:False
  • full_determinism:False
  • torchdynamo:None
  • ray_scope:last
  • ddp_timeout:1800
  • torch_compile:False
  • torch_compile_backend:None
  • torch_compile_mode:None
  • dispatch_batches:None
  • split_batches:None
  • include_tokens_per_second:False
  • include_num_input_tokens_seen:False
  • neftune_noise_alpha:None
  • optim_target_modules:None
  • batch_sampler:no_duplicates
  • multi_dataset_batch_sampler:proportional

训练日志

轮次步数训练损失sts-test-128_spearman_cosinests-test-256_spearman_cosinests-test-512_spearman_cosinests-test-64_spearman_cosinests-test-768_spearman_cosine
None0-0.71040.72640.72240.69820.7225
0.022920013.1738-----
0.04594008.8127-----
0.06886008.0984-----
0.09188007.2984-----
0.114710007.5749-----
0.137712007.1292-----
0.160614006.6146-----
0.183516006.6523-----
0.206518006.1095-----
0.229420006.0841-----
0.252422006.3024-----
0.275324006.1941-----
0.298326006.1686-----
0.321228005.8317-----
0.344230006.0597-----
0.367132005.7832-----
0.390034005.7088-----
0.413036005.6988-----
0.435938005.5268-----
0.458940005.5543-----
0.481842005.3152-----
0.504844005.2894-----
0.527746005.1805-----
0.550648005.4559-----
0.573650005.3836-----
0.596552005.2626-----
0.619554005.2511-----
0.642456005.3308-----
0.665458005.2264-----
0.688360005.2881-----
0.711362005.1349-----
0.734264005.0872-----
0.757166004.5515-----
0.780168003.4312-----
0.803070003.1008-----
0.826072002.9582-----
0.848974002.8153-----
0.871976002.7214-----
0.894878002.5392-----
0.917780002.584-----
0.940782002.5384-----
0.963684002.4937-----
0.986686002.4155-----
1.08717-0.81180.81590.82120.79490.8205

框架版本

  • Python: 3.9.18
  • Sentence Transformers: 3.0.1
  • Transformers: 4.40.0
  • PyTorch: 2.2.2+cu121
  • Accelerate: 0.26.1
  • Datasets: 2.19.0
  • Tokenizers: 0.19.1

引用

BibTeX格式

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

套娃损失

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

多负例排序损失

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

致谢

作者要感谢苏丹王子大学对本项目给予的宝贵支持。他们所提供的帮助与资源,对这些模型的开发和优化起到了至关重要的作用。

## Citation

If you use the Arabic Matryoshka Embeddings Model, please cite it as follows:

@misc{nacar2024enhancingsemanticsimilarityunderstanding,
      title={Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning}, 
      author={Omer Nacar and Anis Koubaa},
      year={2024},
      eprint={2407.21139},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.21139}, 
}