RACE: Large-scale ReAding Comprehension Dataset From Examinations

Lai, Guokun; Xie, Qizhe; Liu, Hanxiao; Yang, Yiming; Hovy, Eduard

doi:10.18653/v1/d17-1082

Cited by 783 publications

(736 citation statements)

References 19 publications

Supporting

Mentioning

733

Contrasting

Unclassified

Order By: Relevance

“…In recent years, more and more large-scale RC datasets became available. These datasets focus on different types of RC tasks, such as cloze-style RC (Hermann et al, 2015;Hill et al, 2016), span-based RC with or without unanswerable questions (Rajpurkar et al, 2016(Rajpurkar et al, , 2018 and multi-choice RC (Lai et al, 2017). Some tasks require the model to answer yes/no questions in addition to spans (Reddy et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Robust Machine Comprehension Models via Adversarial Training

Wang¹,

Bansal²

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

It is shown that many published models for the Stanford Question Answering Dataset (Rajpurkar et al., 2016) lack robustness, suffering an over 50% decrease in F1 score during adversarial evaluation based on the AddSent (Jia and Liang, 2017) algorithm. It has also been shown that retraining models on data generated by AddSent has limited effect on their robustness. We propose a novel alternative adversary-generation algorithm, AddSentDiverse, that significantly increases the variance within the adversarial training data by providing effective examples that punish the model for making certain superficial assumptions. Further, in order to improve robustness to AddSent's semantic perturbations (e.g., antonyms), we jointly improve the model's semantic-relationship learning capabilities in addition to our AddSentDiversebased adversarial training data augmentation. With these additions, we show that we can make a state-of-the-art model significantly more robust, achieving a 36.5% increase in F1 score under many different types of adversarial evaluation while maintaining performance on the regular SQuAD task.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We evaluate our method on the representative datasets SQuAD1.1 (Rajpurkar et al, 2016), SQuAD2.0 (Rajpurkar et al, 2018) and RACE (Lai et al, 2017). The passages in SQuAD1.1 are retrieved from Wikipedia articles and the questions are crafted by crowd-workers.…”

Section: Datasetsmentioning

confidence: 99%

Robust Machine Comprehension Models via Adversarial Training

Wang¹,

Bansal²

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

show abstract

“…Recently, multiple datasets have been proposed for multi-hop QA, in which questions can only be answered when considering information from multiple sentences and/or documents Khashabi et al, 2018a;Welbl et al, 2018;Mihaylov et al, 2018;Bauer et al, 2018;Dunn et al, 2017;Dhingra et al, 2017;Lai et al, 2017;Rajpurkar et al, 2018;. The task of selecting justification sentences is complex for multi-hop QA, because of the additional knowledge aggregation requirement (examples of such questions and answers are shown in Figures 1 and 2).…”

Section: Introductionmentioning

confidence: 99%

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Yadav

Bethard

Surdeanu

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

We propose an unsupervised strategy for the selection of justification sentences for multihop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

show abstract

“…In our dataset, in contrast, answering requires drawing inferences using knowledge not explicit in the text. Another recently published multiple choice dataset is RACE (Lai et al, 2017), which contains 100,000 questions on reading examination data. Rajpurkar et al (2016) have proposed the Stanford Question Answering Dataset (SQuAD), a data set of 100,000 questions on Wikipedia articles collected via crowdsourcing.…”

Section: Related Workmentioning

confidence: 99%

SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge

Ostermann¹,

Roth²,

Modi³

et al. 2018

Proceedings of the 12th International Workshop on Semantic Evaluation

110

101

View full text Add to dashboard Cite

This report summarizes the results of the SemEval 2018 task on machine comprehension using commonsense knowledge. For this machine comprehension task, we created a new corpus, MCScript. It contains a high number of questions that require commonsense knowledge for finding the correct answer. 11 teams from 4 different countries participated in this shared task, most of them used neural approaches. The best performing system achieves an accuracy of 83.95%, outperforming the baselines by a large margin, but still far from the human upper bound, which was found to be at 98%.

show abstract

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Cited by 783 publications

References 19 publications

Robust Machine Comprehension Models via Adversarial Training

Robust Machine Comprehension Models via Adversarial Training

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge

Contact Info

Product

Resources

About