Synthetic QA Corpora Generation with Roundtrip Consistency

Alberti, Chris; Andor, Daniel; Pitler, Emily; Devlin, Jacob; Collins, Michael

doi:10.18653/v1/p19-1620

Cited by 158 publications

(162 citation statements)

References 14 publications

Supporting

Mentioning

159

Contrasting

Order By: Relevance

“…Further, we introduce a data filter to remove poorly generated examples and a mixing mini-batch training strategy to more effectively use the synthetic data. Similar methods have also been applied in some very recent concurrent works (Dong et al, 2019;Alberti et al, 2019) on SQuADv2.0. The main difference is that we also propose to generate new questions from existing articles without introducing new articles.…”

Section: Related Workmentioning

confidence: 97%

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Zhang¹,

Bansal²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

115

133

View full text Add to dashboard Cite

Text-based Question Generation (QG) aims at generating natural and relevant questions that can be answered by a given answer in some context. Existing QG models suffer from a "semantic drift" problem, i.e., the semantics of the model-generated question drifts away from the given context and answer. In this paper, we first propose two semantics-enhanced rewards obtained from downstream question paraphrasing and question answering tasks to regularize the QG model to generate semantically valid questions. Second, since the traditional evaluation metrics (e.g., BLEU) often fall short in evaluating the quality of generated questions, we propose a QA-based evaluation method which measures the QG model's ability to mimic human annotators in generating QA training data. Experiments show that our method achieves the new state-of-theart performance w.r.t. traditional metrics, and also performs best on our QA-based evaluation metrics. Further, we investigate how to use our QG model to augment QA datasets and enable semi-supervised QA. We propose two ways to generate synthetic QA pairs: generate new questions from existing articles or collect QA pairs from new articles. We also propose two empirically effective strategies, a data filter and mixing mini-batch training, to properly use the QG-generated data for QA. Experiments show that our method improves over both BiDAF and BERT QA baselines, even without introducing new articles. 1

show abstract

Section: Related Workmentioning

confidence: 97%

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Zhang¹,

Bansal²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

115

133

View full text Add to dashboard Cite

show abstract

“…Related work in the context of semi-supervised learning has focused on developing methods to generate synthetic training instances for different tasks (Sennrich et al, 2016;Hayashi et al, 2018;Alberti et al, 2019;Winata et al, 2019), in order to accelerate the learning process. Sennrich et al (2016) create artificial training instances for machine translation, using monolingual data paired with automatic back-translations.…”

Section: Discussionmentioning

confidence: 99%

“…For example, Hayashi et al (2018) augment the training data for attention-based end-to-end automatic speech recognition with synthetic instances, and Winata et al (2019) generate artificial training examples to improve automatic speech recognition on code-switching material. Alberti et al (2019) use a large number of synthetic instances to pre-train a Question Answering (QA) model that is then fine-tuned on the target QA dataset. Their approach results in significant improvements over models that are trained without the synthetic datapoints.…”

Section: Discussionmentioning

confidence: 99%

Active Learning via Membership Query Synthesis for Semi-Supervised Sentence Classification

Schumann¹,

Rehbein²

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Active learning (AL) is a technique for reducing manual annotation effort during the annotation of training data for machine learning classifiers. For NLP tasks, pool-based and stream-based sampling techniques have been used to select new instances for AL while generating new, artificial instances via Membership Query Synthesis was, up to know, considered to be infeasible for NLP problems. We present the first successful attempt to use Membership Query Synthesis for generating AL queries for natural language processing, using Variational Autoencoders for query generation. We evaluate our approach in a text classification task and demonstrate that query synthesis shows competitive performance to pool-based AL strategies while substantially reducing annotation time.

show abstract

“…Note that we use this pre-trained model for experimental purpose, and it is not included in the final submission. In our experiments, we initialize the parameters of the encoding layers from the checkpoint 2 of the model (Alberti et al, 2019) namely BERT + N-Gram Masking + Synthetic Self-Training. The model is initialized from Whole Word Masking BERT (BERT wwm ), further fine-tuned on the SQuAD 2.0 task with synthetic generated question answering corpora.…”

Section: Pre-trained Modelsmentioning

confidence: 99%

D-NET: A Pre-Training and Fine-Tuning Framework for Improving the Generalization of Machine Reading Comprehension

Xi-yuan²,

Liu³

et al. 2019

Proceedings of the 2nd Workshop on Machine Reading for Question Answering

View full text Add to dashboard Cite

In this paper, we introduce a simple system Baidu submitted for MRQA (Machine Reading for Question Answering) 2019 Shared Task that focused on generalization of machine reading comprehension (MRC) models. Our system is built on a framework of pretraining and fine-tuning, namely D-NET. The techniques of pre-trained language models and multi-task learning are explored to improve the generalization of MRC models and we conduct experiments to examine the effectiveness of these strategies. Our system is ranked at top 1 of all the participants in terms of averaged F1 score. Our codes and models will be released at PaddleNLP 1 .

show abstract

Synthetic QA Corpora Generation with Roundtrip Consistency

Cited by 158 publications

References 14 publications

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Active Learning via Membership Query Synthesis for Semi-Supervised Sentence Classification

D-NET: A Pre-Training and Fine-Tuning Framework for Improving the Generalization of Machine Reading Comprehension

Contact Info

Product

Resources

About