Mitigating Noisy Inputs for Question Answering

Peskov, Denis; Barrow, Joe; Rodríguez, Pedro; Neubig, Graham; Boyd-Graber, Jordan

doi:10.21437/interspeech.2019-3154

Cited by 9 publications

(10 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Question answering can also be interpreted as an exercise in verifying the knowledge of experts by finding the answer to trivia questions that are carefully crafted by someone who already knows the answer such that exactly one answer is correct such as TriviaQA and Quizbowl/Jeopoardy! questions (Ferrucci et al, 2010;Dunn et al, 2017;Joshi et al, 2017;Peskov et al, 2019); this information-verifying paradigm also describes reading comprehension datasets such as NewsQA (Trischler et al, 2017), SQuAD (Rajpurkar et al, 2016(Rajpurkar et al, , 2018, CoQA (Reddy et al, 2019), and the multiple choice RACE (Lai et al, 2017). This paradigm has been taken even further by biasing the distribution of questions toward especially hard-to-model examples as in QAngaroo (Welbl et al, 2018), HotpotQA (Yang et al, 2018), and DROP (Dua et al, 2019).…”

Section: Quality Controlmentioning

confidence: 99%

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

Clark

Choi

Collins

et al. 2020

Transactions of the Association for Computational Linguistics

290

244

View full text Add to dashboard Cite

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

show abstract

Section: Quality Controlmentioning

confidence: 99%

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

Clark

Choi

Collins

et al. 2020

Transactions of the Association for Computational Linguistics

290

244

View full text Add to dashboard Cite

show abstract

“…On our set of human voices, Kaldi produces at least one UNK token for ∼50% of the questions, and BERT achieves an F1 score of only 43.6 on this set (54.4 F1 and 32.3 F1 separately on questions with and without UNK respectively) compared to 67.1 F1 achieved by Google ASR, demonstrating that speech recognizer choice can greatly affect downstream QA performance. The observed degradation due to UNK decoding (previously noted by Peskov et al, 2019) suggests that practitioners might find it useful to go beyond speech recognition benchmarks, and also evaluate ASR systems in the context of downstream QA applications.…”

Section: Results and Analysismentioning

confidence: 93%

NoiseQA: Challenge Set Evaluation for User-Centric Question Answering

Ravichander

Dalmia

Ryskina

et al. 2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system. While there has been significant community attention devoted to identifying correct answers in passages assuming a perfectly formed question, we show that components in the pipeline that precede an answering engine can introduce varied and considerable sources of error, and performance can degrade substantially based on these upstream noise sources even for powerful pre-trained QA models. We conclude that there is substantial room for progress before QA systems can be effectively deployed, highlight the need for QA evaluation to expand to consider real-world use, and hope that our findings will spur greater community interest in the issues that arise when our systems actually need to be of utility to humans. 1 XQuAD EN ASR MT Keyboard Model EM F1 EM F1 EM F1 EM F1

show abstract

“…The interactive component of CQA also provides a natural mechanism for improving rewriting. When the computer cannot understand (rewrite) a question because of complicated context, missing world knowledge, or upstream errors (Peskov et al, 2019) in the course of a conversation, it should be able to ask its interlocutor, "can you unpack that?" This dataset helps start that conversation; the next steps are developing and evaluating models that efficiently decide when to ask for human assistance, and how to best use this assistance.…”

Section: Related Work and Discussionmentioning

confidence: 99%

Can You Unpack That? Learning to Rewrite Questions-in-Context

Elgohary¹,

Peskov²,

Boyd-Graber³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

134

177

View full text Add to dashboard Cite

Question answering is an AI-complete problem, but existing datasets lack key elements of language understanding such as coreference and ellipsis resolution. We consider sequential question answering: multiple questions are asked one-by-one in a conversation between a questioner and an answerer. Answering these questions is only possible through understanding the conversation history. We introduce the task of question-in-context rewriting: given the context of a conversation's history, rewrite a context-dependent into a selfcontained question with the same answer. We construct, CANARD, a dataset of 40,527 questions based on QUAC (Choi et al., 2018) and train Seq2Seq models for incorporating context into standalone questions.

show abstract

Mitigating Noisy Inputs for Question Answering

Cited by 9 publications

References 16 publications

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

NoiseQA: Challenge Set Evaluation for User-Centric Question Answering

Can You Unpack That? Learning to Rewrite Questions-in-Context

Contact Info

Product

Resources

About