HuggingFace镜像/twhin-bert-base-openmind
模型介绍文件和版本分析
下载使用量0

TwHIN-BERT:用于多语言推文表示的社交增强预训练语言模型

PRs Welcome arXiv

本仓库包含我们论文中的模型、代码以及数据集指针:TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations。 [PDF] [HuggingFace 模型]

概述

TwHIN-BERT 是一种全新的多语言推文语言模型,它在来自 100 多种不同语言的 70 亿条推文上进行训练。与以往的预训练语言模型不同,TwHIN-BERT 的训练不仅基于文本的自监督(例如 MLM),还结合了基于 Twitter 异质信息网络(TwHIN)中丰富社交互动的社交目标。

TwHIN-BERT 可以作为 BERT 的即插即用替代品,应用于各种自然语言处理(NLP)和推荐任务。它不仅在语义理解任务(如文本分类)上优于同类模型,在社交推荐任务(如预测用户对推文的互动)上也表现出色。

0. 在 Openmind 中使用

from openmind import pipeline, AutoTokenizer, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/twhin-bert-base-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    
    start_time = time.time()
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    pipe = pipeline('fill-mask', model=model_path, torch_dtype=torch.bfloat16, device_map=device)
    MASK_TOKEN = tokenizer.mask_token
    result = pipe("Hello I'm a {} model.".format(MASK_TOKEN))
    print(result)
    
    end_time = time.time()
    print(f"硬件环境:{device},推理执行时间:{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

1. 预训练模型

我们初步发布了两个预训练的 TwHIN-BERT 模型(基础版和大型版),它们与 HuggingFace BERT 模型 兼容。

模型规模下载链接(🤗 HuggingFace)
TwHIN-BERT-base2.8 亿参数Twitter/TwHIN-BERT-base
TwHIN-BERT-large5.5 亿参数Twitter/TwHIN-BERT-large

要在 🤗 Transformers 中使用这些模型:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-base')
model = AutoModel.from_pretrained('Twitter/twhin-bert-base')
inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt")
outputs = model(**inputs)

引用

如果您在研究工作中使用了 TwHIN-BERT 或我们的数据集,请引用以下文献:

@article{zhang2022twhin,
  title={TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations},
  author={Zhang, Xinyang and Malkov, Yury and Florez, Omar and Park, Serim and McWilliams, Brian and Han, Jiawei and El-Kishky, Ahmed},
  journal={arXiv preprint arXiv:2209.07562},
  year={2022}
}