2022
DOI: 10.48550/arxiv.2203.03399
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Building and curating conversational corpora for diversity-aware language science and technology

Abstract: We present a pipeline and tools to build a maximally natural data set of conversational interaction that covers 66 languages and varieties from 32 phyla. We describe the curation and compilation process moving from diverse language documentation corpora to a unified format and describe an open-source tool 'convo-parse' to help in quality control and assessment of conversational data. We conclude with two case studies of how diverse data sets can inform interactional linguistics and speech recognition technolog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…Our focus is specifically on corpora of informal conversations among co-present participants, transcribed and time-aligned at the level of conversational turns. Details of our curation and analysis pipeline are described in the Appendix and in Liesenfeld & Dingemanse (2022). While it is impossible to exhaustively list or estimate the size of extant conversational corpora, the qualitycontrolled subset we consider here represents 63 languages from 26 language families (Figure 1), and amounts to over 800 hours of talk produced by over 11.000 partipants, segmented into over 1.6 million turns (9.3 million words) (Figure 2).…”
Section: The Natural Habitat Of Languagementioning
confidence: 99%
See 1 more Smart Citation
“…Our focus is specifically on corpora of informal conversations among co-present participants, transcribed and time-aligned at the level of conversational turns. Details of our curation and analysis pipeline are described in the Appendix and in Liesenfeld & Dingemanse (2022). While it is impossible to exhaustively list or estimate the size of extant conversational corpora, the qualitycontrolled subset we consider here represents 63 languages from 26 language families (Figure 1), and amounts to over 800 hours of talk produced by over 11.000 partipants, segmented into over 1.6 million turns (9.3 million words) (Figure 2).…”
Section: The Natural Habitat Of Languagementioning
confidence: 99%
“…With regard to data usage, the corpora considered here are only those for which contributors have granted access openly or to all registered users, usually for research purposes. We cannot redistribute the dataset directly, but have strived to document the process of curation in sufficient detail to enable others to register and access the data (Liesenfeld and Dingemanse, 2022). With regard to technological applications, as with any technology, there is potential for helpful as well as harmful uses (Hovy and Spruit, 2016) and we side with Levow et al (2021) in stressing the need for computational linguists to work closely with language communities in maximising helpful uses and minimising harmful ones.…”
Section: A Appendixmentioning
confidence: 99%