vi-mrc-large:可用于越南语抽取式问答任务，通过给定上下文和问题提取答案。基于XLM-RoBERTa预训练模型微调，支持越南语和英语，在VLSP MRC 2021测试集上EM达85.847、F1达83.826，实现子词重组提升性能。【此简介由AI生成】

模型描述

语言模型：XLM-RoBERTa
微调：MRCQuestionAnswering
语言：越南语、英语
下游任务：抽取式问答
数据集（结合英语和越南语）：

本模型旨在用于越南语的问答任务，因此验证集仅包含越南语数据（但英语也能正常工作）。以下评估结果使用VLSP MRC 2021测试集。本实验在排行榜上获得了第一名。

模型	EM	F1
large public_test_set	85.847	83.826
large private_test_set	82.072	78.071

公共排行榜	私有排行榜

MRCQuestionAnswering 使用 XLM-RoBERTa 作为预训练语言模型。默认情况下，XLM-RoBERTa 会将单词拆分为子词。但在我的实现中，我使用求和策略将子词表示（经 BERT 层编码后）重新组合为单词表示。

使用预训练模型

Hugging Face 流水线风格（不使用求和特征策略）。

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

更精准的推理过程（使用求和特征策略）

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

model_checkpoint = "nguyenvulebinh/vi-mrc-large"
#model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

关于

由 Binh Nguyen 开发 有关更多详情，请访问项目仓库。

模型描述

模型	EM	F1
large public_test_set	85.847	83.826
large private_test_set	82.072	78.071

公共排行榜	私有排行榜

使用预训练模型

Hugging Face 流水线风格（不使用求和特征策略）。

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

model_checkpoint = "nguyenvulebinh/vi-mrc-large"
#model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013