2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639505
|View full text |Cite
|
Sign up to set email alerts
|

ODSQA: Open-Domain Spoken Question Answering Dataset

Abstract: Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem. In this paper, we release Open-Domain Spoken Question Answering Dataset (ODSQA) with more than three thousand questions. To the best of our knowledge, this is the largest real SQA dataset. On this dataset, we found that ASR errors have catastrophic impact on SQA. To mitigate the effect of ASR errors, subword units are involved, which brings consistent improvements over all… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(21 citation statements)
references
References 38 publications
0
21
0
Order By: Relevance
“…The authors further propose subword unit sequence embedding based mitigation strategies. This work was further extended to the ODSQA dataset (Lee et al, 2018a), where the question is also given in speech et al, 2020), where the authors explore Spoken Conversational Question Answering (Spoken-CoQA). They used both speech and transcript in their feature vector embedding.…”
Section: Related Workmentioning
confidence: 99%
“…The authors further propose subword unit sequence embedding based mitigation strategies. This work was further extended to the ODSQA dataset (Lee et al, 2018a), where the question is also given in speech et al, 2020), where the authors explore Spoken Conversational Question Answering (Spoken-CoQA). They used both speech and transcript in their feature vector embedding.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, there has been a significant increase in the construction of extractive MRC datasets with formal written texts such as SQuAD [2], CNN/Daily Mail [1], CBT [28], NewsQA [29], TriviaQA [31], WIKIHOP [32], DRCD [37], and CMRC2018 [38]. There are also datasets of which reading texts are spoken language, such as ODSQA [33] and Spoken SQuAD [34] and conversation-based datasets [30], [35]. • In contrast to extractive MRC, abstractive MRC requires computers to generate answers or synthetic summaries because answers to such questions in abstractive MRC are usually not spans in the reading text.…”
Section: A Mrc Datasetsmentioning
confidence: 99%
“…Intentional noise has been added to machine translation data [9,10]. Alternate methods for collecting large scale audio data include Generative Adversarial Networks [11] and manual recording [12].…”
Section: Spoken Question Answering Datasetsmentioning
confidence: 99%