In this paper we experiment with methods based on Deep Belief Networks (DBNs) to recover measured articulatory data from speech acoustics. Our acoustic-to-articulatory mapping (AAM) processes go through multi-layered and hierarchical (i.e., deep) representations of the acoustic and the articulatory domains obtained through unsupervised learning of DBNs. The unsupervised learning of DBNs can serve two purposes: (i) pre-training of the Multi-layer Perceptrons that perform AAM; (ii) transformation of the articulatory domain that is recovered from acoustics through AAM. The recovered articulatory features are combined with MFCCs to compute phone posteriors for phone recognition. Tested on the MOCHA-TIMIT corpus, the recovered articulatory features, when combined with MFCCs, lead to up to a remarkable 16.6% relative phone error reduction w.r.t. a phone recognizer that only use
Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR are mainly evident in the realistic use of such systems. These limitations can be partly reduced by using normalization strategies that minimize inter-speaker variability by either explicitly removing speakers' peculiarities or adapting different speakers to a reference model. In this paper we aim at modeling a motor-based imitation learning mechanism in ASR. We tested the utility of a speaker normalization strategy that uses motor representations of speech and compare it with strategies that ignore the motor domain. Specifically, we first trained a regressor through state-of-the-art machine learning techniques to build an auditory-motor mapping, in a sense mimicking a human learner that tries to reproduce utterances produced by other speakers. This auditory-motor mapping maps the speech acoustics of a speaker into the motor plans of a reference speaker. Since, during recognition, only speech acoustics are available, the mapping is necessary to “recover” motor information. Subsequently, in a phone classification task, we tested the system on either one of the speakers that was used during training or a new one. Results show that in both cases the motor-based speaker normalization strategy slightly but significantly outperforms all other strategies where only acoustics is taken into account.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.