这是一个通用嵌入模型:在测试阶段,它能将任何文本片段(例如标题、句子、文档等)映射为固定长度的向量,无需进一步训练。通过指令,嵌入向量具备领域特异性(例如专门针对科学、金融等领域)和任务感知能力(例如为分类、信息检索等任务定制)。
该模型可通过 sentence-transformer 库轻松使用。
git clone https://github.com/HKUNLP/instructor-embedding
cd sentence-transformers
pip install -e .然后您可以像这样使用该模型来计算特定领域且具有任务感知能力的嵌入:
import torch
import torch_npu
from InstructorEmbedding import INSTRUCTOR
from sentence_transformers import SentenceTransformer
device = torch.device('npu:0')
model = INSTRUCTOR('./').to(device)
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title; Input:"
model = SentenceTransformer('hku-nlp/instructor-base')
embeddings = model.encode([[instruction,sentence,0]])
print(embeddings)您可以进一步使用该模型,借助自定义嵌入来计算两组句子之间的相似度。
from sklearn.metrics.pairwise import cosine_similarity
import torch
import torch_npu
from InstructorEmbedding import INSTRUCTOR
device = torch.device('npu:0')
model = INSTRUCTOR('./').to(device)
sentences_a = [['Represent the Science sentence; Input: ','Parton energy loss in QCD matter',0],
['Represent the Financial statement; Input: ','The Federal Reserve on Wednesday raised its benchmark interest rate.',0]
sentences_b = [['Represent the Science sentence; Input: ','The Chiral Phase Transition in Dissipative Dynamics', 0],
['Represent the Financial statement; Input: ','The funds rose less than 0.5 per cent on Friday',0]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)
print(similarities)