gte-large

通用文本嵌入（GTE）模型。Towards General Text Embeddings with Multi-stage Contrastive Learning

GTE模型由阿里巴巴达摩院训练。它们主要基于BERT框架，目前提供三种不同尺寸的模型，包括[GTE-large]、[GTE-base]和[GTE-small]。GTE模型在大规模相关文本对语料库上进行训练，覆盖广泛的领域和场景。这使得GTE模型能够应用于文本嵌入的各种下游任务，包括信息检索、语义文本相似度、文本重排序等。

指标

我们在MTEB基准上比较了GTE模型与其他流行文本嵌入模型的性能。更多详细的比较结果，请参考[MTEB leaderboard]。

模型名称	模型大小 (GB)	维度	序列长度	平均值 (56)	聚类 (11)	对分类 (3)	重排序 (4)	检索 (15)	STS (10)	摘要 (1)	分类 (12)
[gte-large]	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
[gte-base]	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
[e5-large-v2]	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
[e5-base-v2]	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
[gte-small]	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
[text-embedding-ada-002]	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
[e5-small-v2]	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
[sentence-t5-xxl]	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
[all-mpnet-base-v2]	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
[sgpt-bloom-7b1-msmarco]	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
[all-MiniLM-L12-v2]	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
[all-MiniLM-L6-v2]	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
[contriever-base-msmarco]	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
[sentence-t5-base]	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81

使用方法

代码示例

import torch.nn.functional as F
from torch import Tensor
from openmind import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

model_path = "SY_AICC/gte-large"

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to(device)

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

局限性

本模型仅适用于英文文本，并且任何长文本都将被截断至最多512个token。

引用

如果您发现我们的论文或模型对您有所帮助，请考虑按以下方式引用：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}