In this paper, alternative dynamic features for speech recognition are proposed. The goal of this work is to improve speech recognition accuracy by deriving the representation of distinctive dynamic characteristics from a speech spectrum. This work was inspired by two temporal dynamics of a speech signal. One is the highly non‐stationary nature of speech, and the other is the inter‐frame change of a speech spectrum. We adopt the use of a sub‐frame spectrum analyzer to capture very rapid spectral changes within a speech analysis frame. In addition, we attempt to measure spectral fluctuations of a more complex manner as opposed to traditional dynamic features such as delta or double‐delta. To evaluate the proposed features, speech recognition tests over smartphone environments were conducted. The experimental results show that the feature streams simply combined with the proposed features are effective for an improvement in the recognition accuracy of a hidden Markov model–based speech recognizer.
This paper presents a deep-learning based assessment method of a spoken computer-assisted language learning (CALL) for a non-native child speaker, which is performed in a data-driven approach rather than in a rule-based approach. Especially, we focus on the spoken CALL assessment of the 2017 SLaTE challenge. To this end, the proposed method consists of four main steps: speech recognition, meaning feature extraction, grammar feature extraction, and deep-learning based assessment. At first, speech recognition is performed on an input speech using three automatic speech recognition (ASR) systems. Second, twenty-seven meaning features are extracted from the recognized texts via the three ASRs using language models (LMs), sentence-embedding models, and wordembedding models. Third, twenty-two grammar features are extracted from the recognized text via one ASR system using linear-order LMs and hierarchical-order LMs. Fourth, the extracted forty-nine features are fed into a full-connected deep neural network (DNN) based model for the classification of acceptance or rejection. Finally, an assessment is performed by comparing the probability of a output unit of the DNN-based classifier with a predefined threshold. For the experiments of a spoken CALL assessment, we use English spoken utterances by Swiss German teenagers. It is shown from the experiments that the D score is 4.37 for the spoken CALL assessment system employing the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.