Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-3005
|View full text |Cite
|
Sign up to set email alerts
|

Data Collection for Dialogue System: A Startup Perspective

Abstract: Industrial dialogue systems such as Apple Siri and Google Assistant require large scale diverse training data to enable their sophisticated conversation capabilities. Crowdsourcing is a scalable and inexpensive data collection method, but collecting high quality data efficiently requires thoughtful orchestration of crowdsourcing jobs. Prior study of data collection process has focused on tasks with limited scope and performed intrinsic data analysis, which may not be indicative of impact on trained model perfo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
39
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
2
2

Relationship

1
5

Authors

Journals

citations
Cited by 26 publications
(39 citation statements)
references
References 9 publications
0
39
0
Order By: Relevance
“…We build on prior work employing online crowd workers to create data by paraphrasing. In particular, we refine the idea of iteratively asking for paraphrases, where each round prompts workers with sentences from the previous round, leading to more diverse data (Negri et al, 2012;Jiang et al, 2017;Kang et al, 2018). We also apply the idea of a multi-stage process, in which a second set of workers check paraphrases to ensure they are correct (Buzek et al, 2010;Burrows et al, 2013;Coucke et al, 2018).…”
Section: Data Collectionmentioning
confidence: 99%
See 4 more Smart Citations
“…We build on prior work employing online crowd workers to create data by paraphrasing. In particular, we refine the idea of iteratively asking for paraphrases, where each round prompts workers with sentences from the previous round, leading to more diverse data (Negri et al, 2012;Jiang et al, 2017;Kang et al, 2018). We also apply the idea of a multi-stage process, in which a second set of workers check paraphrases to ensure they are correct (Buzek et al, 2010;Burrows et al, 2013;Coucke et al, 2018).…”
Section: Data Collectionmentioning
confidence: 99%
“…To demonstrate this idea, we developed a novel crowdsourcing pipeline for data collection. Following prior work in crowdsourcing for dialog (Kang et al, 2018;Jiang et al, 2017), we ask crowd workers to write paraphrases of seed sentences with known intents and slot values. This provides linguistic diversity in our data in a way that is easily explained to workers.…”
Section: Application: Uniqueness-driven Data Collectionmentioning
confidence: 99%
See 3 more Smart Citations