Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-3164
|View full text |Cite
|
Sign up to set email alerts
|

FT Speech: Danish Parliament Speech Corpus

Abstract: This paper introduces FT SPEECH, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(9 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…Identifying word embeddings requires much larger amounts of text than the corpora in our study provide. Although the genre of conversations arguably is not akin to that of online encyclopedias, no sufficiently largescale corpus of conversational Danish is yet available to create reliable word embeddings representations (Kirkedal et al, 2019;Strømberg-Derczynski et al, 2020). Each word was associated with the 300 values identifying its position in the 300-dimension vector space of the word, and we averaged word embeddings within each utterance.…”
Section: Statistical Analysesmentioning
confidence: 99%
“…Identifying word embeddings requires much larger amounts of text than the corpora in our study provide. Although the genre of conversations arguably is not akin to that of online encyclopedias, no sufficiently largescale corpus of conversational Danish is yet available to create reliable word embeddings representations (Kirkedal et al, 2019;Strømberg-Derczynski et al, 2020). Each word was associated with the 300 values identifying its position in the 300-dimension vector space of the word, and we averaged word embeddings within each utterance.…”
Section: Statistical Analysesmentioning
confidence: 99%
“…Finally, compared to results achieved on the other parliament corpora in Table 1 , our models perform in the same 5%-20% WER range. The authors of Croatian (Ljubešić et al, 2022 ), Czech (Kratochvil et al, 2020 ), Danish (Kirkedal et al, 2020 ), and Icelandic (Helgadóttir et al, 2017 ) Parliament Corpora each trained a TDNN acoustic model using Kaldi and combined it with an in-domain n-gram language model similar to our work. Their test set WER results for this model combination are 16.38%, 7.1%, 14.01%, and 16.38%.…”
Section: Analysis and Discussionmentioning
confidence: 99%
“…One of the earliest examples is the MediaParl Corpus for French and German spoken in the Swiss Valais Parliament by Imseng et al ( 2012 ). In recent years, public corpora based on parliament records has also been created for Icelandic (Helgadóttir et al, 2017 ), Bulgarian (Geneva et al, 2019 ), Danish (Kirkedal et al, 2020 ), Czech (Kratochvil et al, 2020 ), Swiss German (Plüss et al, 2020 ), Croatian (Ljubešić et al, 2022 ), and Norwegian (Solberg & Ortiz, 2022 ). Various event recordings from the European Parliament have also served as raw material for two multi-lingual corpora.…”
Section: Related Workmentioning
confidence: 99%
“…Three studies are concerned with the nature and timing of the creakiness that characterizes stød, and Kirkedal (2016) studied the importance of stød for automatic speech recognition.…”
Section: The Most Recent Studies 2015-2022mentioning
confidence: 99%