Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1620
|View full text |Cite
|
Sign up to set email alerts
|

Synthetic QA Corpora Generation with Roundtrip Consistency

Abstract: We introduce a novel method of generating synthetic question answering corpora by combining models of question generation and answer extraction, and by filtering the results to ensure roundtrip consistency. By pretraining on the resulting corpora we obtain significant improvements on SQuAD2 (Rajpurkar et al., 2018) and NQ (Kwiatkowski et al., 2019), establishing a new state-of-the-art on the latter. Our synthetic data generation models, for both question generation and answer extraction, can be fully reproduce… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
159
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 158 publications
(162 citation statements)
references
References 14 publications
1
159
2
Order By: Relevance
“…Further, we introduce a data filter to remove poorly generated examples and a mixing mini-batch training strategy to more effectively use the synthetic data. Similar methods have also been applied in some very recent concurrent works (Dong et al, 2019;Alberti et al, 2019) on SQuADv2.0. The main difference is that we also propose to generate new questions from existing articles without introducing new articles.…”
Section: Related Workmentioning
confidence: 97%
“…Further, we introduce a data filter to remove poorly generated examples and a mixing mini-batch training strategy to more effectively use the synthetic data. Similar methods have also been applied in some very recent concurrent works (Dong et al, 2019;Alberti et al, 2019) on SQuADv2.0. The main difference is that we also propose to generate new questions from existing articles without introducing new articles.…”
Section: Related Workmentioning
confidence: 97%
“…Related work in the context of semi-supervised learning has focused on developing methods to generate synthetic training instances for different tasks (Sennrich et al, 2016;Hayashi et al, 2018;Alberti et al, 2019;Winata et al, 2019), in order to accelerate the learning process. Sennrich et al (2016) create artificial training instances for machine translation, using monolingual data paired with automatic back-translations.…”
Section: Discussionmentioning
confidence: 99%
“…For example, Hayashi et al (2018) augment the training data for attention-based end-to-end automatic speech recognition with synthetic instances, and Winata et al (2019) generate artificial training examples to improve automatic speech recognition on code-switching material. Alberti et al (2019) use a large number of synthetic instances to pre-train a Question Answering (QA) model that is then fine-tuned on the target QA dataset. Their approach results in significant improvements over models that are trained without the synthetic datapoints.…”
Section: Discussionmentioning
confidence: 99%
“…Note that we use this pre-trained model for experimental purpose, and it is not included in the final submission. In our experiments, we initialize the parameters of the encoding layers from the checkpoint 2 of the model (Alberti et al, 2019) namely BERT + N-Gram Masking + Synthetic Self-Training. The model is initialized from Whole Word Masking BERT (BERT wwm ), further fine-tuned on the SQuAD 2.0 task with synthetic generated question answering corpora.…”
Section: Pre-trained Modelsmentioning
confidence: 99%