这是 FinLang 团队针对金融应用开发的 Investopedia 嵌入模型。该模型使用我们开源的金融数据集进行训练,数据集可从 https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 获取。
本模型是在 BAAI/bge-base-en-v1.5 基础上进行微调的嵌入模型。它能将句子和段落映射到 768 维的稠密向量空间,可用于聚类或 RAG 应用中的语义搜索等任务。
本项目仅用于研究目的。第三方数据集可能受其相关许可协议的额外条款和条件约束。
在金融 RAG 应用的索引过程中,只需指定 Finlang 嵌入模型即可。
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="FinLang/investopedia_embedding")当您安装了 sentence-transformers 后,使用此模型会变得非常简单(详见 https://huggingface.co/sentence-transformers):
pip install -U sentence-transformers然后你可以这样使用该模型:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('FinLang/investopedia_embedding')
embeddings = model.encode(sentences)
print(embeddings)示例代码测试:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("FinLang/investopedia_embedding")
query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys, and is it possible to decrypt a private key?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency in a custodial relationship. While it is theoretically possible to decrypt a private key, with current technology, it would take centuries or millennia for the 115 quattuorvigintillion possibilities. Most hacks and thefts occur in wallets, where private keys are stored."
embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1*embedding_2).sum()
print(scores) # 0.862我们在未见过的句子对(用于相似性评估)和未见过的打乱句子对(用于非相似性评估)上对模型进行了评估。我们的评估套件包含来自以下来源的句子对:Investopedia(用于测试金融领域的熟练程度),以及Gooaq、MSMARCO、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer(用于评估模型在微调后避免遗忘的能力)。
由于在微调过程中使用了非商业数据集,我们将此模型以cc-by-nc-4.0许可证发布。