Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.6
|View full text |Cite
|
Sign up to set email alerts
|

Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data

Abstract: Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create largescale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chines… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
1

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 40 publications
0
0
1
Order By: Relevance
“…For example, just using the pre-trained language model to initialize 2 and 3 hurts F1 by 0.2% (2 vs. 2A) and 0.4% (3 vs. 3A), respectively. This finding, at first glance, seems to be contrary to those in some previous studies (e.g., Yu et al, 2021)) that also leverage pseudo-labeled or distantly-labeled data. This is perhaps because in CST teachers (except for 0) are trained with pseudo-labeled data constructed based on DIFFERENT sets of books, instead of relying on a FIXED set of unlabeled or distantly-labeled resources.…”
Section: Discussion On Continual Self-trainingcontrasting
confidence: 99%
“…For example, just using the pre-trained language model to initialize 2 and 3 hurts F1 by 0.2% (2 vs. 2A) and 0.4% (3 vs. 3A), respectively. This finding, at first glance, seems to be contrary to those in some previous studies (e.g., Yu et al, 2021)) that also leverage pseudo-labeled or distantly-labeled data. This is perhaps because in CST teachers (except for 0) are trained with pseudo-labeled data constructed based on DIFFERENT sets of books, instead of relying on a FIXED set of unlabeled or distantly-labeled resources.…”
Section: Discussion On Continual Self-trainingcontrasting
confidence: 99%