🔑 关键短语提取模型：KBIR-inspec

关键短语提取是文本分析中的一项技术，旨在从文档中提取重要的关键短语。借助这些关键短语，人们无需完整阅读文本，就能快速且轻松地理解其内容。关键短语提取最初主要由人工标注者完成，他们会详细阅读文本，然后写下最重要的关键短语。但这种方式的缺点是，如果处理大量文档，整个过程会非常耗时 ⏳。

这正是人工智能 🤖 发挥作用的地方。目前，使用统计和语言特征的经典机器学习方法被广泛应用于提取过程。而现在，借助深度学习，我们能够比这些经典方法更好地捕捉文本的语义含义。经典方法关注文本中词语的频率、出现次数和顺序，而这些神经方法则可以捕捉文本中词语的长期语义依赖关系和上下文信息。

📓 模型描述

本模型以 [KBIR] 作为基础模型，并在 [Inspec 数据集] 上进行微调。KBIR，即关键短语边界填充与替换（Keyphrase Boundary Infilling with Replacement），是一个预训练模型，它采用多任务学习设置，优化掩码语言建模（MLM）、关键短语边界填充（KBI）和关键短语替换分类（KRC）的组合损失。有关该架构的更多信息，可参阅此论文。

关键短语提取模型是经过微调的 transformer 模型，被当作 token 分类问题来处理——文档中的每个词都被分类为是否属于关键短语的一部分。

标签	描述
B-KEY	位于关键短语的开头
I-KEY	位于关键短语的内部
O	位于关键短语的外部

Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).

Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328-335. Springer, Cham, 2020.

✋ 预期用途与限制

🛑 限制

此关键词提取模型具有很强的领域针对性，在科学论文摘要上表现优异。不建议将其用于其他领域，但您可以自由进行测试。
仅适用于英文文档。

❓ 使用方法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np
import torch
import torch_npu

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
device = torch.device('npu:0')
model_name = "ml6team/keyphrase-extraction-kbir-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)to(device)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['Artificial Intelligence' 'Keyphrase extraction' 'deep learning'
 'linguistic features' 'machine learning' 'semantic meaning'
 'text analysis']

📚 训练数据集

[Inspec] 是一个关键短语抽取/生成数据集，包含 2000 篇英文科学论文，这些论文来自计算机与控制以及信息技术科学领域，发表于 1998 年至 2002 年之间。关键短语由专业索引员或编辑进行标注。

你可以在论文中找到更多信息。

👷‍♂️ 训练流程

训练参数

参数	值
Learning Rate	1e-4
Epochs	50
Early Stopping Patience	3

预处理

数据集中的文档已被预处理为带有相应标签的单词列表。唯一需要做的是进行分词以及重新对齐标签，使其与正确的子词 token 相对应。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR", add_prefix_space=True)
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

后处理（不使用管道函数）

如果不使用管道函数，则必须过滤出标记为 B 和 I 的 tokens。然后，每个 B 和 I 将合并为一个关键短语。最后，需要对关键短语进行去空格处理，确保所有不必要的空格都已被移除。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

📝 评估结果

传统的评估方法包括精确率、召回率和F1分数@k、m，其中k表示排名前k的预测关键短语数量，m表示预测关键短语的平均数量。

该模型在Inspec测试集上取得了以下结果：

数据集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M
Inspec Test Set	0.53	0.47	0.46	0.36	0.58	0.41	0.58	0.60	0.56

🚨 问题反馈

欢迎在社区标签中展开讨论。