FinLang/finance-embeddings-investopedia

这是 FinLang 团队针对金融应用开发的 Investopedia 嵌入模型。该模型使用我们开源的金融数据集进行训练，数据集可从 https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset 获取。

本模型是在 BAAI/bge-base-en-v1.5 基础上进行微调的嵌入模型。它能将句子和段落映射到 768 维的稠密向量空间，可用于聚类或 RAG 应用中的语义搜索等任务。

本项目仅用于研究目的。第三方数据集可能受其相关许可协议的额外条款和条件约束。

计划

研究论文即将发表。
我们正在开发模型的 v2 版本，将增加金融数据的训练语料，并采用改进的嵌入训练技术。

使用方法（LLamaIndex）

在金融 RAG 应用的索引过程中，只需指定 Finlang 嵌入模型即可。

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="FinLang/investopedia_embedding")

使用方法（Sentence-Transformers）

当您安装了 sentence-transformers 后，使用此模型会变得非常简单（详见 https://huggingface.co/sentence-transformers）：

pip install -U sentence-transformers

然后你可以这样使用该模型：

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('FinLang/investopedia_embedding')
embeddings = model.encode(sentences)
print(embeddings)

示例代码测试：

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("FinLang/investopedia_embedding")

query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys, and is it possible to decrypt a private key?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency in a custodial relationship. While it is theoretically possible to decrypt a private key, with current technology, it would take centuries or millennia for the 115 quattuorvigintillion possibilities. Most hacks and thefts occur in wallets, where private keys are stored."

embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1*embedding_2).sum()
print(scores) # 0.862

评估结果

我们在未见过的句子对（用于相似性评估）和未见过的打乱句子对（用于非相似性评估）上对模型进行了评估。我们的评估套件包含来自以下来源的句子对：Investopedia（用于测试金融领域的熟练程度），以及Gooaq、MSMARCO、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer（用于评估模型在微调后避免遗忘的能力）。

许可证

由于在微调过程中使用了非商业数据集，我们将此模型以cc-by-nc-4.0许可证发布。

引用 [即将推出]