Investigating End-to-end Speech Recognition for Mandarin-english Code-switching

Shan, Changhao; Weng, Chao; Wang, Guangsen; Su, Dan; Luo, Mingzhang; Yu, Dong; Xie, Lei

doi:10.1109/icassp.2019.8682850

Cited by 65 publications

(50 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A multilingual ASR approach does not need an additional LID module to identify speech segments since language information is incorporated directly into the system [1]. One technique is to use a linguistic knowledge-based method to establish a multilingual phone set mapping or clustering of similar phonetic features that share the training data [7]. Common examples are the International Phonetic Alphabet (IPA), Speech Assessment Methods Phonetic Alphabet (SAMPA) and Wordbet [15].…”

Section: Related Workmentioning

confidence: 99%

“…The IPA-based phoneme set, and data-driven phoneme set contained 38 phonemes, excluding the silent phonemes. In this case, to train our multilingual acoustic model that effectively handled Sepedi-English code-switched speech, we adopted the technique used by Biswas et al [6], Shan et al [7] and Bhuvanagiri and Kopparapu [8]. Lastly, problematic words of Sepedi or English origin were manually reviewed for correct pronunciation prior to training the HMMs.…”

Section: Multilingual Dictionary and Phoneme Setmentioning

confidence: 99%

“…There are two ASR approaches that are reported to handle code-switched speech [1], [6], [7]. The first approach employs two monolingual ASR systems and an LID module.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Phone Clustering Methods for Multilingual Language Identification

Mabokela¹

2020

Computer Science &Amp; Information Technology (CS &Amp; IT)

View full text Add to dashboard Cite

This paper proposes phoneme clustering methods for multilingual language identification (LID) on a mixed-language corpus. A one-pass multilingual automated speech recognition (ASR) system converts spoken utterances into occurrences of phone sequences. Hidden Markov models were employed to train multilingual acoustic models that handle multiple languages within an utterance. Two phoneme clustering methods were explored to derive the most appropriate phoneme similarities between the target languages. Ultimately a supervised machine learning technique was employed to learn the language transition of the phonotactic information and engage the support vector machine (SVM) models to classify phoneme occurrences. The system performance was evaluated on mixed-language speech corpus for two South African languages (Sepedi and English) using the phone error rate (PER) and LID classification accuracy separately. We show that multilingual ASR which fed directly to the LID system has a direct impact on LID accuracy. Our proposed system has achieved an acceptable phone recognition and classification accuracy in mixed-language speech and monolingual speech (i.e. either Sepedi or English). Data-driven, and knowledge-driven phoneme clustering methods improve ASR and LID for code-switched speech. The data-driven method obtained the PER of 5.1% and LID classification accuracy of 94.5% when the acoustic models are trained with 64 Gaussian mixtures per state.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Multilingual Dictionary and Phoneme Setmentioning

confidence: 99%

See 1 more Smart Citation

Phone Clustering Methods for Multilingual Language Identification

Mabokela¹

2020

Computer Science &Amp; Information Technology (CS &Amp; IT)

View full text Add to dashboard Cite

show abstract

“…In the very first work [23], Seki et al explored an E2E ASR system for code-switching task on an artificially created dataset obtained by concatenating the monolingual utterances. In contrast, Shan et al [24] employed a real Mandarin-English code-switching dataset for developing the attention-based E2E ASR system. For improving the ASR performance, the multi-task learning (MTL) framework involving the language identification (LID) [25] was employed.…”

Section: Introductionmentioning

confidence: 99%

Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements

Sreeram

Sinha

2020

IEEE Access

View full text Add to dashboard Cite

The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%. INDEX TERMS Code-switching, speech recognition, end-to-end system, factored language model, targetto-word transduction.

show abstract

“…This includes techniques specifically targeting the acoustic model [1,2] and the language model [4,5,6] to handle code-mixing in speech. Apart from these cascaded ASR systems, there is also recent work on using end-to-end systems trained on multilingual data to recognize CM speech [7,8,9,10,11,12]. Leveraging monolingual sentences for code-mixed language models has been extensively studied in prior work [13,14,15,16,17].…”

Section: Introductionmentioning

confidence: 99%

Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition

Taneja¹,

Guha

Jyothi

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed codemixed text and discuss their benefits.

show abstract

Investigating End-to-end Speech Recognition for Mandarin-english Code-switching

Cited by 65 publications

References 20 publications

Phone Clustering Methods for Multilingual Language Identification

Phone Clustering Methods for Multilingual Language Identification

Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements

Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition

Contact Info

Product

Resources

About