HuggingFace镜像/gliner-multitask-v1.0
模型介绍文件和版本分析
下载使用量0

🚀 来认识首个支持多任务提示调优的 GLiNER 模型 🚀

GLiNER-Multitask 是一款能够根据用户提供的自定义提示,从纯文本中提取各类信息的模型。这款多功能模型采用了与 BERT 类似的双向 transformer 编码器,确保了其在紧凑体积下仍具备高泛化能力和计算效率。

gliner-multitask-v1.0 变体在 NER 零样本基准测试中实现了最先进的性能,充分展示了其稳健性和灵活性。它不仅在命名实体识别方面表现出色,还能处理其他各种信息提取任务,是一款适用于多种自然语言处理应用的强大工具。

支持的任务:

  • 命名实体识别(NER):识别并分类文本中的实体,如人名、组织名、日期和其他特定项目。
  • 关系抽取:检测并分类文本中实体之间的关系。
  • 文本摘要:提取能够概括输入文本的最重要句子,捕捉关键信息。
  • 情感抽取:识别文本中表示积极、消极或中性情感的部分;
  • 关键短语提取:从文本中识别并提取重要的短语和关键词。
  • 问答:根据给定问题在文本中找到答案;
  • 开放信息抽取:根据用户的开放提示提取文本片段,例如产品描述提取;
  • 文本分类:通过匹配提示中指定的标签对文本进行分类;

安装

要使用此模型,您必须安装 GLiNER Python 库:

pip install gliner

下载 GLiNER 库后,您可以导入 GLiNER 类。然后,您可以使用 GLiNER.from_pretrained 加载此模型。

如何用于 NER:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-v1.0")

text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""

labels = ["founder", "computer", "software", "position", "date"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

性能表现:

模型数据集精确率召回率F1 分数F1 分数(小数)
knowledgator/gliner-multitask-v0.5CrossNER_AI51.00%51.11%51.05%0.5105
CrossNER_literature72.65%65.62%68.96%0.6896
CrossNER_music74.91%73.70%74.30%0.7430
CrossNER_politics78.84%77.71%78.27%0.7827
CrossNER_science69.20%65.48%67.29%0.6729
mit-movie61.29%52.59%56.60%0.5660
mit-restaurant50.65%38.13%43.51%0.4351
平均值0.6276
knowledgator/gliner-multitask-v1.0CrossNER_AI67.15%56.10%61.13%0.6113
CrossNER_literature71.60%64.74%68.00%0.6800
CrossNER_music73.57%69.29%71.36%0.7136
CrossNER_politics77.54%76.52%77.03%0.7703
CrossNER_science74.54%66.00%70.01%0.7001
mit-movie61.86%42.02%50.04%0.5004
mit-restaurant58.87%36.67%45.19%0.4519
平均值0.6325
knowledgator/gliner-llama-multitask-1B-v1.0CrossNER_AI63.24%55.60%59.17%0.5917
CrossNER_literature69.74%60.10%64.56%0.6456
CrossNER_music74.03%67.22%70.46%0.7046
CrossNER_politics76.96%71.64%74.20%0.7420
CrossNER_science73.79%63.73%68.39%0.6839
mit-movie56.89%46.70%51.30%0.5130
mit-restaurant48.45%38.13%42.67%0.4267
平均值0.6153

关系抽取使用方法:

text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""

labels = ["Microsoft <> founder", "Microsoft <> inception date", "Bill Gates <> held position"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["label"], "=>", entity["text"])

使用 utca 构建关系抽取流水线

首先,我们需要导入该库的必要组件,并初始化预测器——GLiNER 模型,然后构建一个结合命名实体识别(NER)和关系抽取的流水线:

from utca.core import RenameAttribute
from utca.implementation.predictors import (
    GLiNERPredictor,
    GLiNERPredictorConfig
)
from utca.implementation.tasks import (
    GLiNER,
    GLiNERPreprocessor,
    GLiNERRelationExtraction,
    GLiNERRelationExtractionPreprocessor,
)

predictor = GLiNERPredictor( # Predictor manages the model that will be used by tasks
    GLiNERPredictorConfig(
        model_name = "knowledgator/gliner-multitask-v1.0", # Model to use
        device = "cuda:0", # Device to use
    )
)

pipe = (
    GLiNER( # GLiNER task produces classified entities that will be at the "output" key.
        predictor=predictor,
        preprocess=GLiNERPreprocessor(threshold=0.7) # Entities threshold
    ) 
    | RenameAttribute("output", "entities") # Rename output entities from GLiNER task to use them as inputs in GLiNERRelationExtraction
    | GLiNERRelationExtraction( # GLiNERRelationExtraction is used for relation extraction.
        predictor=predictor,
        preprocess=(
            GLiNERPreprocessor(threshold=0.5) # Relations threshold
            | GLiNERRelationExtractionPreprocessor()
        )
    )
)

要运行流水线,我们需要指定实体类型和关系及其参数:

r = pipe.run({
    "text": text, # Text to process
    "labels": ["organisation", "founder", "position", "date"],
    "relations": [{ # Relation parameters
        "relation": "founder", # Relation label. Required parameter.
        "pairs_filter": [("organisation", "founder")], # Optional parameter. It specifies possible members of relations by their entity labels.
        "distance_threshold": 100, # Optional parameter. It specifies the max distance between spans in the text (i.e., the end of the span that is closer to the start of the text and the start of the next one).
    }, {
        "relation": "inception date",
        "pairs_filter": [("organisation", "date")],
    }, {
        "relation": "held position",
        "pairs_filter": [("founder", "position")],
    }]
})

print(r["output"])

性能表现:

模型数据集精确率召回率F1 分数
knowledgator/gliner-llama-multitask-1B-v1.0CrossRe0.6064720.5114440.554919
DocRed0.7074830.5893550.643039
knowledgator/gliner-multitask-v0.5CrossRe0.5853190.8001760.676088
DocRed0.7133920.7728260.74192
knowledgator/gliner-multitask-v1.0CrossRe0.7606530.7385560.749442
DocRed0.7706440.7613730.76598

开放信息抽取使用方法:

prompt = """Find all positive aspects about the product:\n"""
text = """
I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.

The headphones themselves are remarkable. The noise-canceling feature works like a charm in the bustling city environment, and the 30-hour battery life means I don't have to charge them every day. Connecting them to my Samsung Galaxy S21 was a breeze, and the sound quality is second to none.

I also appreciated the customer service from Amazon when I had a question about the warranty. They responded within an hour and provided all the information I needed.

However, the headphones did not come with a hard case, which was listed in the product description. I contacted Amazon, and they offered a 10% discount on my next purchase as an apology.

Overall, I'd give these headphones a 4.5/5 rating and highly recommend them to anyone looking for top-notch quality in both product and service.
"""

input_ = prompt+text

labels = ["match"]

matches = model.predict_entities(input_, labels)

for match in matches:
    print(match["text"], "=>", match["score"])

性能表现:

数据集:WiRe57_343-manual-oie

模型精确率召回率F1分数
knowledgator/gliner-llama-multitask-1B-v1.00.90470.27940.4269
knowledgator/gliner-multitask-v0.50.92780.27790.4287
knowledgator/gliner-multitask-v1.00.87750.27330.4168

如何用于问答:

question = "Who was the CEO of Microsoft?"
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""

labels = ["answer"]

input_ = question+text
answers = model.predict_entities(input_, labels)

for answer in answers:
    print(answer["text"], "=>", answer["score"])

性能表现:

数据集:SQuAD 2.0

模型精确率召回率F1 分数
knowledgator/gliner-llama-multitask-1B-v1.00.5782960.7958210.669841
knowledgator/gliner-multitask-v0.50.4292130.943780.590072
knowledgator/gliner-multitask-v1.00.6013540.8747840.712745

如何用于摘要生成:

通过阈值参数,您可以控制希望提取的信息量。

prompt = "Summarize the given text, highlighting the most important information:\n"

text = """
Several studies have reported its pharmacological activities, including anti-inflammatory, antimicrobial, and antitumoral effects.
The effect of E-anethole was studied in the osteosarcoma MG-63 cell line, and the antiproliferative activity was evaluated by an MTT assay.
It showed a GI50 value of 60.25 μM with apoptosis induction through the mitochondrial-mediated pathway. Additionally, it induced cell cycle arrest at the G0/G1 phase, up-regulated the expression of p53, caspase-3, and caspase-9, and down-regulated Bcl-xL expression.
Moreover, the antitumoral activity of anethole was assessed against oral tumor Ca9-22 cells, and the cytotoxic effects were evaluated by MTT and LDH assays.
It demonstrated a LD50 value of 8 μM, and cellular proliferation was 42.7% and 5.2% at anethole concentrations of 3 μM and 30 μM, respectively.
It was reported that it could selectively and in a dose-dependent manner decrease cell proliferation and induce apoptosis, as well as induce autophagy, decrease ROS production, and increase glutathione activity. The cytotoxic effect was mediated through NF-kB, MAP kinases, Wnt, caspase-3 and -9, and PARP1 pathways. Additionally, treatment with anethole inhibited cyclin D1 oncogene expression, increased cyclin-dependent kinase inhibitor p21WAF1, up-regulated p53 expression, and inhibited the EMT markers.
"""

labels = ["summary"]

input_ = prompt+text

threshold = 0.1
summaries = model.predict_entities(input_, labels, threshold=threshold)

for summary in summaries:
    print(summary["text"], "=>", summary["score"])

文本分类使用方法:

通过阈值参数,您可以控制文本分类的召回率和精确率。

prompt = "Classify text into the following classes: positive review, negative review"

text = """
"I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
"""

labels = ["match"]

input_ = prompt+text

threshold = 0.5
classes = model.predict_entities(input_, labels, threshold=threshold)

for label in classes:
    print(label["text"], "=>", label["score"])

性能表现:

模型名称数据集微平均 F1 分数
knowledgator/gliner-multitask-v1.0Emotion0.322
AG News0.7436
IMDb0.7907
knowledgator/gliner-llama-multitask-1B-v1.0Emotion0.3475
AG News0.7436
IMDb0.7907

广泛的命名实体识别基准测试:

模型性能

我们的多任务模型在不同的零样本基准测试中表现出与专注于命名实体识别任务的专用模型相当的性能(本测试中所有标签均为小写):

数据集精确率召回率F1 分数F1 分数(小数形式)
ACE 200453.25%23.20%32.32%0.3232
ACE 200543.25%18.00%25.42%0.2542
AnatEM51.75%25.98%34.59%0.3459
Broad Tweet Corpus69.54%72.50%70.99%0.7099
CoNLL 200368.33%68.43%68.38%0.6838
CrossNER_AI67.15%56.10%61.13%0.6113
CrossNER_literature71.60%64.74%68.00%0.6800
CrossNER_music73.57%69.29%71.36%0.7136
CrossNER_politics77.54%76.52%77.03%0.7703
CrossNER_science74.54%66.00%70.01%0.7001
FabNER69.28%62.62%65.78%0.6578
FindVehicle49.75%51.25%50.49%0.5049
GENIA_NER60.98%46.91%53.03%0.5303
HarveyNER24.27%35.66%28.88%0.2888
MultiNERD54.33%89.34%67.57%0.6757
Ontonotes27.26%36.64%31.26%0.3126
PolyglotNER33.54%64.29%44.08%0.4408
TweetNER744.77%38.67%41.50%0.4150
WikiANN en56.33%57.09%56.71%0.5671
WikiNeural71.70%86.60%78.45%0.7845
bc2gm64.71%51.68%57.47%0.5747
bc4chemd69.24%50.08%58.12%0.5812
bc5cdr79.22%69.19%73.87%0.7387
mit-movie61.86%42.02%50.04%0.5004
mit-restaurant58.87%36.67%45.19%0.4519
ncbi68.72%54.86%61.01%0.6101

加入我们的 Discord

欢迎通过 Discord 与我们的社区建立联系,获取最新动态、技术支持以及关于模型的讨论。点击链接加入 Discord。

引用说明:

@misc{stepanov2024gliner,
      title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks}, 
      author={Ihor Stepanov and Mykhailo Shtopko},
      year={2024},
      eprint={2406.12925},
      archivePrefix={arXiv},
      primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'}
}