🚀 来认识首个支持多任务提示调优的 GLiNER 模型 🚀
GLiNER-Multitask 是一款能够根据用户提供的自定义提示,从纯文本中提取各类信息的模型。这款多功能模型采用了与 BERT 类似的双向 transformer 编码器,确保了其在紧凑体积下仍具备高泛化能力和计算效率。
gliner-multitask-v1.0 变体在 NER 零样本基准测试中实现了最先进的性能,充分展示了其稳健性和灵活性。它不仅在命名实体识别方面表现出色,还能处理其他各种信息提取任务,是一款适用于多种自然语言处理应用的强大工具。
要使用此模型,您必须安装 GLiNER Python 库:
pip install gliner下载 GLiNER 库后,您可以导入 GLiNER 类。然后,您可以使用 GLiNER.from_pretrained 加载此模型。
如何用于 NER:
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-v1.0")
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""
labels = ["founder", "computer", "software", "position", "date"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(entity["text"], "=>", entity["label"])| 模型 | 数据集 | 精确率 | 召回率 | F1 分数 | F1 分数(小数) |
|---|---|---|---|---|---|
| knowledgator/gliner-multitask-v0.5 | CrossNER_AI | 51.00% | 51.11% | 51.05% | 0.5105 |
| CrossNER_literature | 72.65% | 65.62% | 68.96% | 0.6896 | |
| CrossNER_music | 74.91% | 73.70% | 74.30% | 0.7430 | |
| CrossNER_politics | 78.84% | 77.71% | 78.27% | 0.7827 | |
| CrossNER_science | 69.20% | 65.48% | 67.29% | 0.6729 | |
| mit-movie | 61.29% | 52.59% | 56.60% | 0.5660 | |
| mit-restaurant | 50.65% | 38.13% | 43.51% | 0.4351 | |
| 平均值 | 0.6276 | ||||
| knowledgator/gliner-multitask-v1.0 | CrossNER_AI | 67.15% | 56.10% | 61.13% | 0.6113 |
| CrossNER_literature | 71.60% | 64.74% | 68.00% | 0.6800 | |
| CrossNER_music | 73.57% | 69.29% | 71.36% | 0.7136 | |
| CrossNER_politics | 77.54% | 76.52% | 77.03% | 0.7703 | |
| CrossNER_science | 74.54% | 66.00% | 70.01% | 0.7001 | |
| mit-movie | 61.86% | 42.02% | 50.04% | 0.5004 | |
| mit-restaurant | 58.87% | 36.67% | 45.19% | 0.4519 | |
| 平均值 | 0.6325 | ||||
| knowledgator/gliner-llama-multitask-1B-v1.0 | CrossNER_AI | 63.24% | 55.60% | 59.17% | 0.5917 |
| CrossNER_literature | 69.74% | 60.10% | 64.56% | 0.6456 | |
| CrossNER_music | 74.03% | 67.22% | 70.46% | 0.7046 | |
| CrossNER_politics | 76.96% | 71.64% | 74.20% | 0.7420 | |
| CrossNER_science | 73.79% | 63.73% | 68.39% | 0.6839 | |
| mit-movie | 56.89% | 46.70% | 51.30% | 0.5130 | |
| mit-restaurant | 48.45% | 38.13% | 42.67% | 0.4267 | |
| 平均值 | 0.6153 |
关系抽取使用方法:
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""
labels = ["Microsoft <> founder", "Microsoft <> inception date", "Bill Gates <> held position"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(entity["label"], "=>", entity["text"])首先,我们需要导入该库的必要组件,并初始化预测器——GLiNER 模型,然后构建一个结合命名实体识别(NER)和关系抽取的流水线:
from utca.core import RenameAttribute
from utca.implementation.predictors import (
GLiNERPredictor,
GLiNERPredictorConfig
)
from utca.implementation.tasks import (
GLiNER,
GLiNERPreprocessor,
GLiNERRelationExtraction,
GLiNERRelationExtractionPreprocessor,
)
predictor = GLiNERPredictor( # Predictor manages the model that will be used by tasks
GLiNERPredictorConfig(
model_name = "knowledgator/gliner-multitask-v1.0", # Model to use
device = "cuda:0", # Device to use
)
)
pipe = (
GLiNER( # GLiNER task produces classified entities that will be at the "output" key.
predictor=predictor,
preprocess=GLiNERPreprocessor(threshold=0.7) # Entities threshold
)
| RenameAttribute("output", "entities") # Rename output entities from GLiNER task to use them as inputs in GLiNERRelationExtraction
| GLiNERRelationExtraction( # GLiNERRelationExtraction is used for relation extraction.
predictor=predictor,
preprocess=(
GLiNERPreprocessor(threshold=0.5) # Relations threshold
| GLiNERRelationExtractionPreprocessor()
)
)
)要运行流水线,我们需要指定实体类型和关系及其参数:
r = pipe.run({
"text": text, # Text to process
"labels": ["organisation", "founder", "position", "date"],
"relations": [{ # Relation parameters
"relation": "founder", # Relation label. Required parameter.
"pairs_filter": [("organisation", "founder")], # Optional parameter. It specifies possible members of relations by their entity labels.
"distance_threshold": 100, # Optional parameter. It specifies the max distance between spans in the text (i.e., the end of the span that is closer to the start of the text and the start of the next one).
}, {
"relation": "inception date",
"pairs_filter": [("organisation", "date")],
}, {
"relation": "held position",
"pairs_filter": [("founder", "position")],
}]
})
print(r["output"])| 模型 | 数据集 | 精确率 | 召回率 | F1 分数 |
|---|---|---|---|---|
| knowledgator/gliner-llama-multitask-1B-v1.0 | CrossRe | 0.606472 | 0.511444 | 0.554919 |
| DocRed | 0.707483 | 0.589355 | 0.643039 | |
| knowledgator/gliner-multitask-v0.5 | CrossRe | 0.585319 | 0.800176 | 0.676088 |
| DocRed | 0.713392 | 0.772826 | 0.74192 | |
| knowledgator/gliner-multitask-v1.0 | CrossRe | 0.760653 | 0.738556 | 0.749442 |
| DocRed | 0.770644 | 0.761373 | 0.76598 |
开放信息抽取使用方法:
prompt = """Find all positive aspects about the product:\n"""
text = """
I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
The headphones themselves are remarkable. The noise-canceling feature works like a charm in the bustling city environment, and the 30-hour battery life means I don't have to charge them every day. Connecting them to my Samsung Galaxy S21 was a breeze, and the sound quality is second to none.
I also appreciated the customer service from Amazon when I had a question about the warranty. They responded within an hour and provided all the information I needed.
However, the headphones did not come with a hard case, which was listed in the product description. I contacted Amazon, and they offered a 10% discount on my next purchase as an apology.
Overall, I'd give these headphones a 4.5/5 rating and highly recommend them to anyone looking for top-notch quality in both product and service.
"""
input_ = prompt+text
labels = ["match"]
matches = model.predict_entities(input_, labels)
for match in matches:
print(match["text"], "=>", match["score"])数据集:WiRe57_343-manual-oie
| 模型 | 精确率 | 召回率 | F1分数 |
|---|---|---|---|
| knowledgator/gliner-llama-multitask-1B-v1.0 | 0.9047 | 0.2794 | 0.4269 |
| knowledgator/gliner-multitask-v0.5 | 0.9278 | 0.2779 | 0.4287 |
| knowledgator/gliner-multitask-v1.0 | 0.8775 | 0.2733 | 0.4168 |
如何用于问答:
question = "Who was the CEO of Microsoft?"
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
"""
labels = ["answer"]
input_ = question+text
answers = model.predict_entities(input_, labels)
for answer in answers:
print(answer["text"], "=>", answer["score"])数据集:SQuAD 2.0
| 模型 | 精确率 | 召回率 | F1 分数 |
|---|---|---|---|
| knowledgator/gliner-llama-multitask-1B-v1.0 | 0.578296 | 0.795821 | 0.669841 |
| knowledgator/gliner-multitask-v0.5 | 0.429213 | 0.94378 | 0.590072 |
| knowledgator/gliner-multitask-v1.0 | 0.601354 | 0.874784 | 0.712745 |
如何用于摘要生成:
通过阈值参数,您可以控制希望提取的信息量。
prompt = "Summarize the given text, highlighting the most important information:\n"
text = """
Several studies have reported its pharmacological activities, including anti-inflammatory, antimicrobial, and antitumoral effects.
The effect of E-anethole was studied in the osteosarcoma MG-63 cell line, and the antiproliferative activity was evaluated by an MTT assay.
It showed a GI50 value of 60.25 μM with apoptosis induction through the mitochondrial-mediated pathway. Additionally, it induced cell cycle arrest at the G0/G1 phase, up-regulated the expression of p53, caspase-3, and caspase-9, and down-regulated Bcl-xL expression.
Moreover, the antitumoral activity of anethole was assessed against oral tumor Ca9-22 cells, and the cytotoxic effects were evaluated by MTT and LDH assays.
It demonstrated a LD50 value of 8 μM, and cellular proliferation was 42.7% and 5.2% at anethole concentrations of 3 μM and 30 μM, respectively.
It was reported that it could selectively and in a dose-dependent manner decrease cell proliferation and induce apoptosis, as well as induce autophagy, decrease ROS production, and increase glutathione activity. The cytotoxic effect was mediated through NF-kB, MAP kinases, Wnt, caspase-3 and -9, and PARP1 pathways. Additionally, treatment with anethole inhibited cyclin D1 oncogene expression, increased cyclin-dependent kinase inhibitor p21WAF1, up-regulated p53 expression, and inhibited the EMT markers.
"""
labels = ["summary"]
input_ = prompt+text
threshold = 0.1
summaries = model.predict_entities(input_, labels, threshold=threshold)
for summary in summaries:
print(summary["text"], "=>", summary["score"])文本分类使用方法:
通过阈值参数,您可以控制文本分类的召回率和精确率。
prompt = "Classify text into the following classes: positive review, negative review"
text = """
"I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
"""
labels = ["match"]
input_ = prompt+text
threshold = 0.5
classes = model.predict_entities(input_, labels, threshold=threshold)
for label in classes:
print(label["text"], "=>", label["score"])| 模型名称 | 数据集 | 微平均 F1 分数 |
|---|---|---|
| knowledgator/gliner-multitask-v1.0 | Emotion | 0.322 |
| AG News | 0.7436 | |
| IMDb | 0.7907 | |
| knowledgator/gliner-llama-multitask-1B-v1.0 | Emotion | 0.3475 |
| AG News | 0.7436 | |
| IMDb | 0.7907 |

我们的多任务模型在不同的零样本基准测试中表现出与专注于命名实体识别任务的专用模型相当的性能(本测试中所有标签均为小写):
| 数据集 | 精确率 | 召回率 | F1 分数 | F1 分数(小数形式) |
|---|---|---|---|---|
| ACE 2004 | 53.25% | 23.20% | 32.32% | 0.3232 |
| ACE 2005 | 43.25% | 18.00% | 25.42% | 0.2542 |
| AnatEM | 51.75% | 25.98% | 34.59% | 0.3459 |
| Broad Tweet Corpus | 69.54% | 72.50% | 70.99% | 0.7099 |
| CoNLL 2003 | 68.33% | 68.43% | 68.38% | 0.6838 |
| CrossNER_AI | 67.15% | 56.10% | 61.13% | 0.6113 |
| CrossNER_literature | 71.60% | 64.74% | 68.00% | 0.6800 |
| CrossNER_music | 73.57% | 69.29% | 71.36% | 0.7136 |
| CrossNER_politics | 77.54% | 76.52% | 77.03% | 0.7703 |
| CrossNER_science | 74.54% | 66.00% | 70.01% | 0.7001 |
| FabNER | 69.28% | 62.62% | 65.78% | 0.6578 |
| FindVehicle | 49.75% | 51.25% | 50.49% | 0.5049 |
| GENIA_NER | 60.98% | 46.91% | 53.03% | 0.5303 |
| HarveyNER | 24.27% | 35.66% | 28.88% | 0.2888 |
| MultiNERD | 54.33% | 89.34% | 67.57% | 0.6757 |
| Ontonotes | 27.26% | 36.64% | 31.26% | 0.3126 |
| PolyglotNER | 33.54% | 64.29% | 44.08% | 0.4408 |
| TweetNER7 | 44.77% | 38.67% | 41.50% | 0.4150 |
| WikiANN en | 56.33% | 57.09% | 56.71% | 0.5671 |
| WikiNeural | 71.70% | 86.60% | 78.45% | 0.7845 |
| bc2gm | 64.71% | 51.68% | 57.47% | 0.5747 |
| bc4chemd | 69.24% | 50.08% | 58.12% | 0.5812 |
| bc5cdr | 79.22% | 69.19% | 73.87% | 0.7387 |
| mit-movie | 61.86% | 42.02% | 50.04% | 0.5004 |
| mit-restaurant | 58.87% | 36.67% | 45.19% | 0.4519 |
| ncbi | 68.72% | 54.86% | 61.01% | 0.6101 |
欢迎通过 Discord 与我们的社区建立联系,获取最新动态、技术支持以及关于模型的讨论。点击链接加入 Discord。
@misc{stepanov2024gliner,
title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks},
author={Ihor Stepanov and Mykhailo Shtopko},
year={2024},
eprint={2406.12925},
archivePrefix={arXiv},
primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'}
}