Formosa Speech Recognition Challenge 2020 and Taiwanese Across Taiwan Corpus

Liao, Yuan-Fu; Chang, Chia‐Yu; Tiun, Hak-Khiam; Su, Huey‐Jen; Khoo, Hui-Lu; Tsay, Jane S.; Tan, Le-kun; Kang, Peter B.; Thiann, Tsun-guan; Iunn, Un-Gian; Yang, Jyh-Her; Liang, Chih-Neng

doi:10.1109/o-cocosda50338.2020.9295019

Cited by 9 publications

(6 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Hokkien ASR is pre-trained on 10k-hr Mandarin speech from WenetSpeech and 2k-hr Hokkien speech, which is a combination of TAT (480hr), Hokkien dramas (1k-hr) and SpeechOcean (597-hr), with Conformer wave2vec 2.0 LARGE model. We then finetuned the model with CTC loss on 480-hr Hokkien speech and Tâi-lô scripts from TAT (Liao et al, 2020), with each Tâi-lô syllable split into initial and final with tone as the finetuning target. To further improve the ASR accuracy, we apply another round of self-training by generating pseudo labels on the same set of Hokkien speech used in speech encoder pre-training.…”

Section: Discussionmentioning

confidence: 99%

“…Since there are not many En↔Hokkien bilingual speakers who can directly translate between the two languages, we use Mandarin as a pivot language during the data creation process whenever possible. We sample from the following data sources and adopt different strategies to create human annotated parallel data: (1) Hokkien dramas, which include Hokkien speech and aligned Mandarin subtitles 4 , (2) Taiwanese Across Taiwan (TAT) (Liao et al, 2020), a Hokkien read speech dataset containing transcripts in Tâi-lô and Hanji, and (3) MuST-C v1.2 En-Zh S2T data (Cattoni et al, 2021).…”

Section: Supervised Human Annotated Datamentioning

confidence: 99%

See 1 more Smart Citation

Speech-to-Speech Translation For A Real-world Unwritten Language

Chen¹,

Tran²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field 1 .

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Supervised Human Annotated Datamentioning

confidence: 99%

Speech-to-Speech Translation For A Real-world Unwritten Language

Chen¹,

Tran²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…NSYSU-MITLab participated in the Formosa Speech Recognition Challenge 2020 (FSR-2020), which focused on the low-resource language Taiwanese (Taiwanese Hokkien) [99].…”

Section: Dnn-based Approach To Build Acoustic Modelmentioning

confidence: 99%

Frontier Research on Low-Resource Speech Recognition Technology

Slam,

Li,

Urouvas

2023

Sensors

View full text Add to dashboard Cite

With the development of continuous speech recognition technology, users have put forward higher requirements in terms of speech recognition accuracy. Low-resource speech recognition, as a typical speech recognition technology under restricted conditions, has become a research hotspot nowadays because of its low recognition rate and great application value. Under the premise of low-resource speech recognition technology, this paper reviews the research status of feature extraction and acoustic models, and conducts research on resource expansion. Especially in terms of the technical challenges faced by this technology, solutions are proposed, and future research directions are prospected.

show abstract

“…Taiwanese Hokkien, also known as Taiwanese, Hokkien, Taigi, Southern Min, or Min-Nan, is a branched-off variety of Southern Min dialects popular in Taiwan. Under the history background (Chen, 2008), the ability to use Taiwanese Hokkien declines by age (Chen, 2008;Liao et al, 2020;Tan, 2019;of Linguistics at Academia Sinica, 2007;Yang, 2021;Pan, 2016;Ho, 2020). Taiwanese Hokkien has always been the most widely spoken dialect in Taiwan, many people can have conversations in both Mandarin and Taiwanese Hokkien.…”

Section: Background Of Taiwanese Hokkienmentioning

confidence: 99%

“…Although Mandarin is the dominant language in Taiwan, Taiwanese Hokkien has nearly as many speakers as Mandarin (Liao et al, 2020). Taiwanese tend to mix dialects and Mandarin in daily communication, creating code-mixed languages such as Taiwanese Hokkien-Mandarin or Hakka-Mandarin.…”

Section: Introductionmentioning

confidence: 99%

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Lu¹,

Bo-Han²,

Lü³

et al. 2023

Preprint

View full text Add to dashboard Cite

In natural language processing (NLP), codemixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.

show abstract

Formosa Speech Recognition Challenge 2020 and Taiwanese Across Taiwan Corpus

Cited by 9 publications

References 1 publication

Speech-to-Speech Translation For A Real-world Unwritten Language

Speech-to-Speech Translation For A Real-world Unwritten Language

Frontier Research on Low-Resource Speech Recognition Technology

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Contact Info

Product

Resources

About