修改

修改示例并添加NPU支持
添加依赖项

gte-large-zh

GTE模型由阿里巴巴达摩院训练。它们主要基于BERT框架，目前提供不同规模的中英文模型。GTE模型在大规模相关文本对语料库上进行训练，覆盖广泛的领域和场景。这使得GTE模型能够应用于文本嵌入的各种下游任务，包括信息检索、语义文本相似度、文本重排序等。

模型列表

模型	语言	最大序列长度	维度	模型大小
GTE-large-zh	中文	512	1024	0.67GB
GTE-base-zh	中文	512	512	0.21GB
GTE-large	英文	512	1024	0.67GB
GTE-small	英文	512	384	0.10GB

指标

我们在MTEB（中文使用CMTEB）基准上将GTE模型与其他流行的文本嵌入模型的性能进行了比较。

CMTEB上的评估结果

模型	模型大小（GB）	嵌入维度	序列长度	平均值（35个数据集）	分类（9个数据集）	聚类（4个数据集）	对分类（2个数据集）	重排序（4个数据集）	检索（8个数据集）	STS（8个数据集）
gte-large-zh	0.65	1024	512	66.72	71.34	53.07	81.14	67.42	72.49	57.82
gte-base-zh	0.20	768	512	65.92	71.26	53.86	80.44	67.00	71.71	55.96
stella-large-zh-v2	0.65	1024	1024	65.13	69.05	49.16	82.68	66.41	70.14	58.66
stella-large-zh	0.65	1024	1024	64.54	67.62	48.65	78.72	65.98	71.02	58.3
bge-large-zh-v1.5	1.3	1024	512	64.53	69.13	48.99	81.6	65.84	70.46	56.25
stella-base-zh-v2	0.21	768	1024	64.36	68.29	49.4	79.96	66.1	70.08	56.92
stella-base-zh	0.21	768	1024	64.16	67.77	48.7	76.09	66.95	71.07	56.54
piccolo-large-zh	0.65	1024	512	64.11	67.03	47.04	78.38	65.98	70.93	58.02
piccolo-base-zh	0.2	768	512	63.66	66.98	47.12	76.61	66.68	71.2	55.9
gte-small-zh	0.1	512	512	60.04	64.35	48.95	69.99	66.21	65.50	49.72
bge-small-zh-v1.5	0.1	512	512	57.82	63.96	44.18	70.4	60.92	61.77	49.1
m3e-base	0.41	768	512	57.79	67.52	47.68	63.99	59.54	56.91	50.47
text-embedding-ada-002(openai)	-	1536	8192	53.02	64.31	45.68	69.56	54.28	52.0	43.35

依赖项

transformers==4.44.2
psutil==6.0.0
better_profanity==0.7.0
einops==0.6.1
protobuf==5.28.2

使用方法

代码示例

# coding=utf-8
import argparse
import torch
from openmind import pipeline, is_torch_npu_available
import torch.nn.functional as F
from torch import Tensor
from openmind import AutoTokenizer, AutoModel

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default=None,
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path
    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    input_texts = [
        "中国的首都是哪里",
        "你喜欢去哪里旅游",
        "北京",
        "今天中午吃什么"
    ]

    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path)

    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = outputs.last_hidden_state[:, 0]
    
    # (Optionally) normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:1] @ embeddings[1:].T) * 100
    print(scores.tolist())
if __name__ == "__main__":
    main()

局限性

本模型仅适用于中文文本，对于较长文本将被截断至最多512个tokens。

引用

如果您发现我们的论文或模型对您有所帮助，请考虑按以下方式引用：

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}

修改

修改示例并添加NPU支持
添加依赖项