Latent Retrieval for Weakly Supervised Open Domain Question Answering

Lee, Kenton; Chang, Ming‐Wei; Toutanova, Kristina

doi:10.18653/v1/p19-1612

Cited by 462 publications

(294 citation statements)

References 30 publications

Supporting

Mentioning

292

Contrasting

Order By: Relevance

“…BM25-based methods remain to be the mainstream methods for document retrieval in industry. Previous work in open domain question answering has shown that BM25 is a difficult baseline to surpass when questions were written by workers who have prior knowledge of the answer (Lee et al, 2019a). We will leave more comprehensive comparisons against other learning-based methods to future work, since the main goal of this demo paper is to present the system along with its dataset.…”

Section: Results For Soco-qa Performancementioning

confidence: 99%

See 1 more Smart Citation

Talk to Papers: Bringing Neural Question Answering to Academic Search

Zhao

Lee

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

View full text Add to dashboard Cite

We introduce Talk to Papers 1 , which exploits the recent open-domain question answering (QA) techniques to improve the current experience of academic search. It's designed to enable researchers to use natural language queries to find precise answers and extract insights from a massive amount of academic papers. We present a large improvement over classic search engine baseline on several standard QA datasets, and provide the community a collaborative data collection tool to curate the first natural language processing research QA dataset via a community effort.

show abstract

Section: Results For Soco-qa Performancementioning

confidence: 99%

“…has shown that using paragraphs as the unit of passage outperform sentences or documents. Lee et al (2019a) proposes a trainable first-stage retriever that improves the recall performance. Pipeline-based system often suffer from error propagation (Zhao and Eskenazi, 2016).…”

Section: Related Workmentioning

confidence: 99%

Talk to Papers: Bringing Neural Question Answering to Academic Search

Zhao

Lee

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

View full text Add to dashboard Cite

show abstract

“…Knowledge Incorporation of knowledge into language models has shown promising results for downstream tasks, such as factual correct generation (Logan et al, 2019) , commonsense knowledge graph construction (Bosselut et al, 2019), entity typing (Zhang et al, 2019) and etc. More recently, several works have shown that inclusion of learned mechanisms for explicit or implicit knowledge can lead to the state-of-the-art results in Question Answering (Guu et al, 2020;Karpukhin et al, 2020;Lee et al, 2019;Lewis et al, 2020) and dialogue modeling (Roller et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models

Xu¹,

Patwary²,

Shoeybi³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Existing pre-trained large language models have shown unparalleled generative capabilities. However, they are not controllable. In this paper, we propose MEGATRON-CNTRL, a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base. Our framework consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator. As we do not have access to groundtruth supervision for the knowledge ranker, we make use of weak supervision from sentence embedding. The empirical results show that our model generates more fluent, consistent, and coherent stories with less repetition and higher diversity compared to prior work on the ROC story dataset. We showcase the controllability of our model by replacing the keywords used to generate stories and re-running the generation process. Human evaluation results show that 77.5% of these stories are successfully controlled by the new keywords. Furthermore, by scaling our model from 124 million to 8.3 billion parameters we demonstrate that larger models improve both the quality of generation (from 74.5% to 93.0% for consistency) and controllability (from 77.5% to 91.5%).

show abstract

“…Pseudoquery is a declarative sentence; it is different from the actual query, which is an interrogative sentence. ORQA, which uses learned ICT with pseudo-data to predict the context related to the query, performed better than the baseline model [43]. Pseudo-evidence consists of the surrounding sentences of the pseudo-query that are not the context that contains the information about query.…”

Section: ) Evidence Extractionmentioning

confidence: 96%

“…Here, the unsupervised Inverse Cloze Task (ICT) proposed by the Open Retrieval Question Answering System (ORQA) [43] is used to confirm the relevance of the paragraph and query. ICT is a task that finds related context for a sentence, which is the inverse of Cloze task [44].…”

Section: ) Evidence Extractionmentioning

confidence: 99%

Improved Machine Reading Comprehension Using Data Validation for Weakly Labeled Data

2020

View full text Add to dashboard Cite

Machine reading comprehension (MRC) is a natural language processing task wherein a given question is answered according to a holistic understanding of a given context. Recently, many researchers have shown interest in MRC, for which a considerable number of datasets are being released. Datasets for MRC, which are composed of the context-query-answer triple, are designed to answer a given query by referencing and understanding a readily-available, relevant context text. The TriviaQA dataset is a weakly labeled dataset, because it contains irrelevant context that forms no basis for answering the query. The existing syntactic data cleaning method struggles to deal with the contextual noise this irrelevancy creates. Therefore, a semantic data cleaning method using reasoning processes is necessary. To address this, we propose a new MRC model in which the TriviaQA dataset is validated and trained using a high-quality dataset. The data validation method in our MRC model improves the quality of the training dataset, and the answer extraction model learns with the validated training data, because of our validation method. Our proposed method showed a 4.33% improvement in performance for the TriviaQA Wiki, compared to the existing baseline model. Accordingly, our proposed method can address the limitation of irrelevant context in MRC better than the human supervision. INDEX TERMS Computational and artificial intelligence, data validation, natural language processing, neural networks, machine reading comprehension, weak label.

show abstract

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Cited by 462 publications

References 30 publications

Talk to Papers: Bringing Neural Question Answering to Academic Search

Talk to Papers: Bringing Neural Question Answering to Academic Search

MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models

Improved Machine Reading Comprehension Using Data Validation for Weakly Labeled Data

Contact Info

Product

Resources

About