2019
DOI: 10.1007/978-3-030-31372-2_16
|View full text |Cite
|
Sign up to set email alerts
|

Building an ASR Corpus Based on Bulgarian Parliament Speeches

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 9 publications
0
5
0
Order By: Relevance
“…Three ASR systems are built by using three acoustic models: (1) baseline acoustic models obtained from generic out-of-domain datasets for Basque and Spanish; (2) acoustic models trained on the dataset S (1) train obtained after one iteration of the data extraction procedure using only those segments with alignment similarity ≥80% and (3) acoustic models trained on the dataset S (2) train obtained after a second iteration of the data extraction procedure using segments with the highest similarity amounting to the same duration as in (2). The average WER figures obtained on the tuning and evaluation subsets by the three ASR systems in cross-validation experiments are shown in Table 4, disaggregated per language.…”
Section: Training Data Extractionmentioning
confidence: 99%
See 1 more Smart Citation
“…Three ASR systems are built by using three acoustic models: (1) baseline acoustic models obtained from generic out-of-domain datasets for Basque and Spanish; (2) acoustic models trained on the dataset S (1) train obtained after one iteration of the data extraction procedure using only those segments with alignment similarity ≥80% and (3) acoustic models trained on the dataset S (2) train obtained after a second iteration of the data extraction procedure using segments with the highest similarity amounting to the same duration as in (2). The average WER figures obtained on the tuning and evaluation subsets by the three ASR systems in cross-validation experiments are shown in Table 4, disaggregated per language.…”
Section: Training Data Extractionmentioning
confidence: 99%
“…are suitably covered due to their commercial interest, while languages spoken by few people or lacking the support of governments struggle to be even considered by major technological giants. This issue is not new and has been addressed in two different ways: (1) by fostering the production of language (spoken and text) resources, many of them from parliamentary speeches [2][3][4][5][6][7][8]; and (2) by leveraging the resources produced for other languages, e.g., by adjusting (finetuning) models or systems trained on multilingual data [9,10]. In the case of Basque, to compensate for the lack of interest of private companies, efforts have focused on producing data.…”
Section: Introductionmentioning
confidence: 99%
“…One of the earliest examples is the MediaParl Corpus for French and German spoken in the Swiss Valais Parliament by Imseng et al ( 2012 ). In recent years, public corpora based on parliament records has also been created for Icelandic (Helgadóttir et al, 2017 ), Bulgarian (Geneva et al, 2019 ), Danish (Kirkedal et al, 2020 ), Czech (Kratochvil et al, 2020 ), Swiss German (Plüss et al, 2020 ), Croatian (Ljubešić et al, 2022 ), and Norwegian (Solberg & Ortiz, 2022 ). Various event recordings from the European Parliament have also served as raw material for two multi-lingual corpora.…”
Section: Related Workmentioning
confidence: 99%
“…In order to compile a corpus of utterances more akin to rapid spontaneous speech, we follow the recent trend of converting open parliamentary data into ASR speech corpora. This has been accomplished for languages such as Icelandic [7], Finnish [8], and Bulgarian [9]. In addition, a multilingual speech corpus has been constructed from the debates held in the European Parliament [10].…”
Section: Related Workmentioning
confidence: 99%