Automatic recognition of spontaneous emotion in conversational speech is an important yet challenging problem. In this paper, we propose a deep neural network model to track continuous emotion changes in the arousal-valence two-dimensional space by combining inputs from raw waveform signals and spectrograms, both of which have been shown to be useful in the emotion recognition task. The neural network architecture contains a set of convolutional neural network (CNN) layers and bidirectional long short-term memory (BLSTM) layers to account for both temporal and spectral variation and model contextual content. Experimental results of predicting valence and arousal on the SEMAINE database and the RECOLA database show that the proposed model significantly outperforms model using hand-engineered features, by exploiting waveforms and spectrograms as input. We also compare the effects of waveforms vs. spectrograms and find that waveforms are better at capturing arousal, while spectrograms are better at capturing valence. Moreover, combining information from both inputs provides further improvement to the performance.
Acoustic word embeddings have been proven to be useful in query-by-example keyword search. Such embeddings are typically trained to distinguish the same word from a different word using exact orthographic representations; so, two different words will have dissimilar embeddings even if they are pronounced similarly or share the same stem. However, in real-world applications such as keyword search in low-resource languages, models are expected to find all derived and inflected forms for a certain keyword. In this paper, we address this mismatch by incorporating linguistic information when training neural acoustic word embeddings. We propose two linguistically-informed methods for training these embeddings, both of which, when we use metrics that consider non-exact matches, outperform state-of-the-art models on the Switchboard dataset. We also present results on Sinhala to show that models trained on English can be directly transferred to embed spoken words in a very different language with high accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.