2014 17th Oriental Chapter of the International Committee for the Co-Ordination and Standardization of Speech Databases and Ass 2014
DOI: 10.1109/icsda.2014.7051423
|View full text |Cite
|
Sign up to set email alerts
|

Using closely-related language to build an ASR for a very under-resourced language: Iban

Abstract: This paper describes our work on automatic speech recognition system (ASR) for an under-resourced language, namely the Iban language, which is spoken in Sarawak, a Malaysian Borneo state. To begin this study, we collected 8 hours of speech data due to no resources yet for ASR concerning this language. Following the lack of resources, we employed bootstrapping techniques on a closely-related language to build the Iban system. For this case, we utilized Malay data to bootstrap the grapheme-to-phoneme system (G2P… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 18 publications
0
5
0
Order By: Relevance
“…Using language resources from high-resourced languages for the recognition of an under-resourced language is common practice [35][36][37]. Being an under-resourced language, Frisian also lacks adequate speech data to train acoustic models that can provide accurate enough recognition.…”
Section: Multilingual Trainingmentioning
confidence: 99%
“…Using language resources from high-resourced languages for the recognition of an under-resourced language is common practice [35][36][37]. Being an under-resourced language, Frisian also lacks adequate speech data to train acoustic models that can provide accurate enough recognition.…”
Section: Multilingual Trainingmentioning
confidence: 99%
“…We used data sets for four widely spoken lowresource languages, Fongbe (Laleye et al, 2016), Wolof (Gauthier et al, 2016), Swahili (Gelas et al, 2012), and Iban (Juan et al, 2014), which were previously released as ASR corpora. They include segmented audio with corresponding transcripts, as well as additional written texts for training the language model (see Table 1 for details).…”
Section: Data Descriptionsmentioning
confidence: 99%
“…Languages were chosen from the CMU Wilderness dataset given availability of additional data in a different setting, and include several language families as well as more closelyrelated challenge pairs such as Javanese and Sundanese. These included data from the Common Voice project (CV; Ardila et al, 2020) which is read speech typically recorded using built-in laptop microphones; radio news data (SLR24; Juan et al, 2014Juan et al, , 2015; crowd-sourced recordings using portable electronics (SLR35, SLR36; Kjartansson et al, 2018); cleanly recorded microphone data (SLR64, SLR65, SLR66, SLR79; He et al, 2020); and a collection of recordings from varied sources (SS; Shukla, 2020). Table 1 shows the task languages and their data sources for evaluation splits for the robust language identification task.…”
Section: Provided Datamentioning
confidence: 99%