Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication. However, SER has long suffered from a lack of public large-scale labeled datasets. To circumvent this problem, we investigate how unsupervised representation learning on unlabeled datasets can benefit SER. We show that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets, which improves emotion recognition performance. In our experiments, this method achieved state-of-the-art concordance correlation coefficient (CCC) performance for all emotion primitives (activation, valence, and dominance) on IEMOCAP. Additionally, on the MSP-Podcast dataset, our method obtained considerable performance improvements compared to baselines.
Several speech processing and audio data-mining applications rely on a description of the acoustic environment as a feature vector for classification. The discriminative properties of the feature domain play a crucial role in the effectiveness of these methods. In this work, we consider three environment identification tasks and the task of acoustic model selection for speech recognition. A set of acoustic parameters and Machine Learning algorithms for feature selection are used and an analysis is performed on the resulting feature domains for each task. In our experiments, a classification accuracy of 100% is achieved for the majority of tasks and the Word Error Rate is reduced by 20.73 percentage points for Automatic Speech Recognition when using the resulting domains. Experimental results indicate a significant dissimilarity in the parameter choices for the composition of the domains, which highlights the importance of the feature selection process for individual applications.
Abstract-Acoustic channels are typically described by their Acoustic Impulse Response (AIR) as a Moving Average (MA) process. Such AIRs are often considered in terms of their early and late parts, describing discrete reflections and the diffuse reverberation tail respectively. We propose an approach for constructing a sparse parametric model for the early part. The model aims at reducing the number of parameters needed to represent it and subsequently reconstruct from the representation the MA coefficients that describe it. It consists of a representation of the reflections arriving at the receiver as delayed copies of an excitation signal. The Time-Of-Arrivals of reflections are not restricted to integer sample instances and a dynamically estimated model for the excitation sound is used. We also present a corresponding parameter estimation method, which is based on regularized-regression and nonlinear optimization. The proposed method also serves as an analysis tool, since estimated parameters can be used for the estimation of room geometry, the mixing time and other channel properties. Experiments involving simulated and measured AIRs are presented, in which the AIR coefficient reconstruction-error energy does not exceed 11.4% of the energy of the original AIR coefficients. The results also indicate dimensionality reduction figures exceeding 90% when compared to a MA process representation.
Using speech to interact with electronic devices and access services is becoming increasingly common. Using such applications in our households poses new challenges for speech and audio processing algorithms as these applications should perform robustly in a number of scenarios. Media devices are very commonly present in such scenarios and can interfere with the user-device communication by contributing to the noise or simply by being mistaken as user issued voice commands. Detecting the presence of media sounds in the environment can help avoid such issues. In this work we propose a method for this task based on a parallel CNN-GRU-FC classifier architecture which relies on multi-channel information to discriminate between media and live sources. Experiments performed using 378 hours of in-house audio recordings collected by volunteers show an F1 score of 71% with a recall of 72% in detecting active media sources. The use of information from multiple channels gave a relative improvement of 16% to the F1 score when compared to using information from only a single channel.
Reverberation is present in our workplaces, our homes, concert halls and theatres. This paper investigates how deep learning can use the effect of reverberation on speech to classify a recording in terms of the room in which it was recorded. Existing approaches in the literature rely on domain expertise to manually select acoustic parameters as inputs to classifiers. Estimation of these parameters from reverberant speech is adversely affected by estimation errors, impacting the classification accuracy. In order to overcome the limitations of previously proposed methods, this paper shows how DNNs can perform the classification by operating directly on reverberant speech spectra and a CRNN with an attention-mechanism is proposed for the task. The relationship is investigated between the reverberant speech representations learned by the DNNs and acoustic parameters. For evaluation, AIRs are used from the ACE-challenge dataset that were measured in 7 real rooms. The classification accuracy of the CRNN classifier in the experiments is 78% when using 5 hours of training data and 90% when using 10 hours.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.