HuggingFace镜像/bert-restore-punctuation
模型介绍文件和版本分析
下载使用量0

✨ bert-restore-punctuation

forthebadge

这是一个基于 bert-base-uncased 的模型,在 Yelp Reviews 数据集上进行了标点恢复的微调。

该模型可对纯文本、小写文本进行标点预测和大小写转换。一个典型的使用场景是语音识别(ASR)输出,或者其他文本丢失标点的情况。

此模型旨在直接用作通用英语的标点恢复模型。或者,您也可以将其用于特定领域文本的进一步微调,以完成标点恢复任务。

模型可恢复以下标点符号——[! ? . , - : ; ' ]

该模型还能恢复单词的大写。


🚋 使用方法

以下是快速上手使用该模型的方法。

  1. 首先,安装相关包。
pip install rpunct
  1. 示例 Python 代码。
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

该模型适用于任意长度的英文文本,若有 GPU 则会启用 GPU 进行处理。


📡 训练数据

以下是我们用于微调模型的产品评论数量:

语言文本样本数量
English560,000

我们发现模型在 3 个 epoch 左右收敛效果最佳,这也是本文展示的结果,且相关模型可下载获取。


🎯 准确率

微调后的模型在 45,990 个预留文本样本上的准确率如下:

准确率总体 F1评估支持数
91%90%45,990

以下是模型针对每个标签的性能细分:

labelprecisionrecallf1-scoresupport
!0.450.170.24424
!+Upper0.430.340.3898
'0.600.270.3711
,0.590.510.551522
,+Upper0.520.500.51239
-0.000.000.0018
.0.690.840.752488
.+Upper0.650.520.57274
:0.520.310.3939
:+Upper0.360.620.4516
;0.000.000.0017
?0.540.480.5146
?+Upper0.400.500.444
none0.960.960.9635352
Upper0.840.820.835442

☕ 联系方式

如有问题、反馈或需要类似模型,请联系 Daulet Nurmanbetov。