HuggingFace镜像/gte-large
模型介绍文件和版本分析
下载使用量0

gte-large

通用文本嵌入(GTE)模型。Towards General Text Embeddings with Multi-stage Contrastive Learning

GTE模型由阿里巴巴达摩院训练。它们主要基于BERT框架,目前提供三种不同尺寸的模型,包括[GTE-large]、[GTE-base]和[GTE-small]。GTE模型在大规模相关文本对语料库上进行训练,覆盖广泛的领域和场景。这使得GTE模型能够应用于文本嵌入的各种下游任务,包括信息检索、语义文本相似度、文本重排序等。

指标

我们在MTEB基准上比较了GTE模型与其他流行文本嵌入模型的性能。更多详细的比较结果,请参考[MTEB leaderboard]。

模型名称模型大小 (GB)维度序列长度平均值 (56)聚类 (11)对分类 (3)重排序 (4)检索 (15)STS (10)摘要 (1)分类 (12)
[gte-large]0.67102451263.1346.8485.0059.1352.2283.3531.6673.33
[gte-base]0.2276851262.3946.284.5758.6151.1482.331.1773.01
[e5-large-v2]1.34102451262.2544.4986.0356.6150.5682.0530.1975.24
[e5-base-v2]0.4476851261.543.8085.7355.9150.2981.0530.2873.84
[gte-small]0.0738451261.3644.8983.5457.749.4682.0730.4272.31
[text-embedding-ada-002]-1536819260.9945.984.8956.3249.2580.9730.870.93
[e5-small-v2]0.1338451259.9339.9284.6754.3249.0480.3931.1672.94
[sentence-t5-xxl]9.7376851259.5143.7285.0656.4242.2482.6330.0873.42
[all-mpnet-base-v2]0.4476851457.7843.6983.0459.3643.8180.2827.4965.07
[sgpt-bloom-7b1-msmarco]28.274096204857.5938.9381.955.6548.2277.7433.666.19
[all-MiniLM-L12-v2]0.1338451256.5341.8182.4158.4442.6979.827.963.21
[all-MiniLM-L6-v2]0.0938451256.2642.3582.3758.0441.9578.930.8163.05
[contriever-base-msmarco]0.4476851256.0041.182.5453.1441.8876.5130.3666.68
[sentence-t5-base]0.2276851255.2740.2185.1853.0933.6381.1431.3969.81

使用方法

代码示例

import torch.nn.functional as F
from torch import Tensor
from openmind import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

model_path = "SY_AICC/gte-large"

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to(device)

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

局限性

本模型仅适用于英文文本,并且任何长文本都将被截断至最多512个token。

引用

如果您发现我们的论文或模型对您有所帮助,请考虑按以下方式引用:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}