用于为机器学习相关文章生成标签的机器学习模型。该模型是 t5-small 的微调版本,在 190k Medium Articles 数据集的优化版本上进行了微调,旨在利用文章的文本内容作为输入来生成机器学习文章标签。虽然标签生成通常被表述为多标签分类问题,但本模型将其作为文本到文本生成任务来处理(灵感和参考来源:fabiochiu/t5-base-tag-generation)。
微调笔记本参考:Hugging face summarization notebook。
pip install transformers nltkimport torch
import argparse
from openmind import pipeline, is_torch_npu_available
import time
def parse_args():
parser = argparse.ArgumentParser(description="Eval the model")
parser.add_argument(
"--model_name_or_path",
type=str,
help="path or model",
default="zhouhui/t5-small-machine-articles-tag-generation",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
#device = "cpu"
seq2seq = pipeline("summarization", model=model_path,device_map=device)
start_time = time.time()
sample_text = """"Paige, AI in pathology and genomics
Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why?
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer."""
result = seq2seq(sample_text)
print(result)
end_time = time.time()
print(f"硬件环境:{device},推理执行时间:{end_time - start_time}秒")
if __name__ == "__main__":
main()在 Kaggle 提供的超过 19 万篇文章数据集中,约有 1.2 万篇是基于机器学习的,且其标签层级较高。在开发技术博客平台系统时,生成更具体的标签将非常有用。我们筛选出了机器学习相关文章,并抽样了约 1000 篇。使用 GPT3 API 为这些文章生成标签,然后对生成的标签进行预处理,确保最终数据集选取的是带有 4 个或 5 个标签的文章,最终得到约 940 篇文章。
该模型主要可用于为机器学习文章生成标签,也可用于其他技术文章,但准确率和细节丰富度可能会有所降低。生成结果中可能包含重复标签,需要在结果的后处理中进行处理。
在评估集上,该模型取得了以下结果:
包含超过 940 篇文章的数据集按照 80:10:10 的比例划分为训练集、验证集和测试集。
训练过程中使用了以下超参数: