2019
DOI: 10.48550/arxiv.1909.07005
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
38
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 27 publications
(38 citation statements)
references
References 0 publications
0
38
0
Order By: Relevance
“…The DRCD [25] is a native Chinese QA dataset, consists of 30,000+ questions posed by the annotators on 10,014 paragraphs extracted from 2,108 Wikipedia articles. KorQuAD [26] is a Korean QA dataset and PIAF [27] is a French QA dataset, consisting of 70,000+ and 3835 question-answer pairs, respectively. Since large-scale QA datasets in languages other than English rarely exist and building native QA datasets is time-and cost-consuming, developing QA systems for these languages is challenging.…”
Section: B Other Languagesmentioning
confidence: 99%
See 1 more Smart Citation
“…The DRCD [25] is a native Chinese QA dataset, consists of 30,000+ questions posed by the annotators on 10,014 paragraphs extracted from 2,108 Wikipedia articles. KorQuAD [26] is a Korean QA dataset and PIAF [27] is a French QA dataset, consisting of 70,000+ and 3835 question-answer pairs, respectively. Since large-scale QA datasets in languages other than English rarely exist and building native QA datasets is time-and cost-consuming, developing QA systems for these languages is challenging.…”
Section: B Other Languagesmentioning
confidence: 99%
“…[7] 2018 English Native 150K+ Wikiqa: A challenge dataset for open-domain question answering [8] 2015 English Native 3K+ MS MARCO: A human generated machine reading comprehension dataset [9] 2016 English Native 100K+ Natural questions: a benchmark for question answering research [10] 2019 English Native 300K+ Quac: Question answering in context [11] 2018 English Native 100K+ Coqa: A conversational question answering challenge [12] 2019 English Native 127K+ Newsqa: A machine comprehension dataset [13] 2016 English Native 100K+ Constructing datasets for multi-hop reading comprehension across documents [15] 2018 English Native, Multi-hop 50K+ Hotpotqa: A dataset for diverse, explainable multi-hop question answering [16] 2018 English Native, Multi-hop 113K+ Repartitioning of the complexwebquestions dataset [17] 2018 English Native, Multi-hop 63K+ R4C: A benchmark for evaluating RC systems to get the right answer for the right reason [18] 2019 English Native, Multi-hop 4K+ Automatic spanish translation of the squad dataset for multilingual question answering [19] 2019 Spanish Translation 100K+ Neural arabic question answering [20] 2019 Arabic Translation 48K+ Semi-supervised training data generation for multilingual question answering [21] 2018 Korean Translation 81K+ Neural learning for question answering in italian [22] 2018 Italian Translation 60K+ SberQuAD-Russian reading comprehension dataset: Description and analysis [24] 2020 Russian Native 50K+ Drcd: a chinese machine reading comprehension dataset [25] 2018 Chinese Native 30K+ Korquad1. 0: Korean qa dataset for machine reading comprehension [26] 2018 Korean Native 70K+ Project PIAF: Building a Native French Question-Answering Dataset [27] 2020 French Native 3K+ Parsinlu: a suite of language understanding challenges for persian [34] 2021 Persian Native 1K+ ParSQuAD: Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0 [33] 2021 Persian Translation 25K, 70K…”
Section: B Other Languagesmentioning
confidence: 99%
“…The schema of NSMC is similar to SST-2 (Socher et al, 2013). KorQuAD 1.0 (Lim et al, 2019) is a Korean version of machine reading comprehension dataset. 6 It consists of 10,645 training passages with 66,181 training questions and 5,774 validation questions.…”
Section: Experimental Settingmentioning
confidence: 99%
“…Another line of research uses multilingual models such as mBERT , XLM (Conneau and Lample, 2019) or XLM-Roberta (Conneau et al, 2020) trained on English QA datasets for zero-shot language transfer to the target domain. While these models perform astonishingly well on QA in unseen languages, they do not perform as well as on English QA (Lewis et al, 2020a;d'Hoffschmidt et al, 2020;Lim et al, 2019). Furthermore, multilingual models are much larger than their monolingual counterparts, rendering them unsuitable for most production systems where memory consumption and query speed matter.…”
Section: Introduction and Related Workmentioning
confidence: 99%
“…Research on non-English machine reading for question answering (QA) suffers from limited availability of annotated non-English datasets. With the English SQuAD dataset (Rajpurkar et al, 2018) as a role model, there are only a few resources of similar format, such as the French FQuAD (d'Hoffschmidt et al, 2020), the Korean KorQuAD (Lim et al, 2019), and the Russian SberQuAD (Efimov et al, 2020) datasets. As an alternative, there are machine-translated datasets for training (Lewis et al, 2020a).…”
Section: Introduction and Related Workmentioning
confidence: 99%