Building and curating conversational corpora for diversity-aware language science and technology

Liesenfeld, Andreas; Dingemanse, Mark

doi:10.48550/arxiv.2203.03399

Cited by 1 publication

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our focus is specifically on corpora of informal conversations among co-present participants, transcribed and time-aligned at the level of conversational turns. Details of our curation and analysis pipeline are described in the Appendix and in Liesenfeld & Dingemanse (2022). While it is impossible to exhaustively list or estimate the size of extant conversational corpora, the qualitycontrolled subset we consider here represents 63 languages from 26 language families (Figure 1), and amounts to over 800 hours of talk produced by over 11.000 partipants, segmented into over 1.6 million turns (9.3 million words) (Figure 2).…”

Section: The Natural Habitat Of Languagementioning

confidence: 99%

“…With regard to data usage, the corpora considered here are only those for which contributors have granted access openly or to all registered users, usually for research purposes. We cannot redistribute the dataset directly, but have strived to document the process of curation in sufficient detail to enable others to register and access the data (Liesenfeld and Dingemanse, 2022). With regard to technological applications, as with any technology, there is potential for helpful as well as harmful uses (Hovy and Spruit, 2016) and we side with Levow et al (2021) in stressing the need for computational linguists to work closely with language communities in maximising helpful uses and minimising harmful ones.…”

Section: A Appendixmentioning

confidence: 99%

See 1 more Smart Citation

From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology

Dingemanse¹,

Liesenfeld²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

Informal social interaction is the primordial home of human language. Linguistically diverse conversational corpora are an important and largely untapped resource for computational linguistics and language technology. Through the efforts of a worldwide language documentation movement, such corpora are increasingly becoming available. We show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action, with implications for language technology, natural language understanding, and the design of conversational interfaces. Harnessing linguistically diverse conversational corpora will provide the empirical foundations for flexible, localizable, humane language technologies of the future.

show abstract

Section: The Natural Habitat Of Languagementioning

confidence: 99%