This paper presents a deep-learning based assessment method of a spoken computer-assisted language learning (CALL) for a non-native child speaker, which is performed in a data-driven approach rather than in a rule-based approach. Especially, we focus on the spoken CALL assessment of the 2017 SLaTE challenge. To this end, the proposed method consists of four main steps: speech recognition, meaning feature extraction, grammar feature extraction, and deep-learning based assessment. At first, speech recognition is performed on an input speech using three automatic speech recognition (ASR) systems. Second, twenty-seven meaning features are extracted from the recognized texts via the three ASRs using language models (LMs), sentence-embedding models, and wordembedding models. Third, twenty-two grammar features are extracted from the recognized text via one ASR system using linear-order LMs and hierarchical-order LMs. Fourth, the extracted forty-nine features are fed into a full-connected deep neural network (DNN) based model for the classification of acceptance or rejection. Finally, an assessment is performed by comparing the probability of a output unit of the DNN-based classifier with a predefined threshold. For the experiments of a spoken CALL assessment, we use English spoken utterances by Swiss German teenagers. It is shown from the experiments that the D score is 4.37 for the spoken CALL assessment system employing the proposed method.
This letter proposes a more advanced joint maximum a posteriori (MAP) adaptation using a prior model based on a probabilistic scheme utilizing the bilinear transformation (BIT) concept. The proposed method not only has scalable parameters but is also based on a single prior distribution without the heuristic parameters of the previous joint BIT‐MAP method. Experiment results, irrespective of the amount of adaptation data, show that the proposed method leads to a consistent improvement over the previous method.
and the M.S. degree and Ph.D. degrees from the Korea Advanced Institute of Science and Technology (KAIST), South Korea, in 1996 and 2001, respectively, all in electronic engineering. Since 2001, he has been working at the Electronics and Telecommunications Research Institute (ETRI), where he is currently a Principal Researcher with the Visual Intelligence Research Section. He is developing artificial intelligence technologies for self-growing multimodal knowledge graphs at ETRI. His research interests include multimodal knowledge representation, knowledge graph completion, and self-growing intelligent agents.
In this paper, a novel method for speaker adaptation using bilinear model is proposed. Bilinear model can express both characteristics of speakers (style) and phonemes across speakers (content) independently in a training database. The mapping from each speaker and phoneme space to observation space is carried out using bilinear mapping matrix which is independent of speaker and phoneme space. We apply the bilinear model to speaker adaption. Using adaptation data from a new speaker, speaker-adapted model is built by estimating the style(speaker)-specific matrix. Experimental results showed that the proposed method outperformed eigenvoice and MLLR. In vocabulary-independent isolated word recognition for speaker adaptation, bilinear model reduced word error rate by about 38% and about 10% compared to eigenvoice and MLLR respectively using 50 words for adaptation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.