HuggingFace镜像/GTE-large-zh
模型介绍文件和版本分析
下载使用量0

修改

  • 修改示例并添加NPU支持
  • 添加依赖项

gte-large-zh

通用文本嵌入(GTE)模型。通过多阶段对比学习实现通用文本嵌入

GTE模型由阿里巴巴达摩院训练。它们主要基于BERT框架,目前提供不同规模的中英文模型。GTE模型在大规模相关文本对语料库上进行训练,覆盖广泛的领域和场景。这使得GTE模型能够应用于文本嵌入的各种下游任务,包括信息检索、语义文本相似度、文本重排序等。

模型列表

模型语言最大序列长度维度模型大小
GTE-large-zh中文51210240.67GB
GTE-base-zh中文5125120.21GB
GTE-large英文51210240.67GB
GTE-small英文5123840.10GB

指标

我们在MTEB(中文使用CMTEB)基准上将GTE模型与其他流行的文本嵌入模型的性能进行了比较。

  • CMTEB上的评估结果
模型模型大小(GB)嵌入维度序列长度平均值(35个数据集)分类(9个数据集)聚类(4个数据集)对分类(2个数据集)重排序(4个数据集)检索(8个数据集)STS(8个数据集)
gte-large-zh0.65102451266.7271.3453.0781.1467.4272.4957.82
gte-base-zh0.2076851265.9271.2653.8680.4467.0071.7155.96
stella-large-zh-v20.651024102465.1369.0549.1682.6866.4170.1458.66
stella-large-zh0.651024102464.5467.6248.6578.7265.9871.0258.3
bge-large-zh-v1.51.3102451264.5369.1348.9981.665.8470.4656.25
stella-base-zh-v20.21768102464.3668.2949.479.9666.170.0856.92
stella-base-zh0.21768102464.1667.7748.776.0966.9571.0756.54
piccolo-large-zh0.65102451264.1167.0347.0478.3865.9870.9358.02
piccolo-base-zh0.276851263.6666.9847.1276.6166.6871.255.9
gte-small-zh0.151251260.0464.3548.9569.9966.2165.5049.72
bge-small-zh-v1.50.151251257.8263.9644.1870.460.9261.7749.1
m3e-base0.4176851257.7967.5247.6863.9959.5456.9150.47
text-embedding-ada-002(openai)-1536819253.0264.3145.6869.5654.2852.043.35

依赖项

  • transformers==4.44.2
  • psutil==6.0.0
  • better_profanity==0.7.0
  • einops==0.6.1
  • protobuf==5.28.2

使用方法

代码示例

# coding=utf-8
import argparse
import torch
from openmind import pipeline, is_torch_npu_available
import torch.nn.functional as F
from torch import Tensor
from openmind import AutoTokenizer, AutoModel

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default=None,
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path
    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    input_texts = [
        "中国的首都是哪里",
        "你喜欢去哪里旅游",
        "北京",
        "今天中午吃什么"
    ]

    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path)

    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = outputs.last_hidden_state[:, 0]
    
    # (Optionally) normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:1] @ embeddings[1:].T) * 100
    print(scores.tolist())
if __name__ == "__main__":
    main()

局限性

本模型仅适用于中文文本,对于较长文本将被截断至最多512个tokens。

引用

如果您发现我们的论文或模型对您有所帮助,请考虑按以下方式引用:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}