Many people and machines are inherently unable to interpret socio-affective cues such as tone of voice. Thoughtful adoption of intelligent technologies may improve the conversation. Since direct communication often occurs via edge devices, where an additional network connection is often not guaranteed, we now describe a real-time processing method that captures and evaluates emotions in a speech via a terminal device such as the Raspberry Pi computer. In this article, we also present the current state of research on speech emotional recognition. We examine audio files from five important emotional speech databases and visualize them in situ with dB-scaled Mel spectrograms using TensorFlow and Matplotlib. Audio files are transformed using the fast Fourier transform method to generate spectrograms. For classification, a support vector machine kernel and a CNN with transfer learning are selected. The accuracy of this classification is 70% and 77%, respectively, a good value related to the execution of the algorithms on an edge device instead of on a server. On a Raspberry Pi, it took less than one second to evaluate pure emotion in speech using machine learning and the corresponding visualization, suggesting the speaker's emotional state.
Humans, as intricate beings driven by a multitude of emotions, possess a remarkable ability to decipher and respond to socio-affective cues. However, many individuals and machines struggle to interpret such nuanced signals, including variations in tone of voice. This paper explores the potential of intelligent technologies to bridge this gap and improve the quality of conversations. In particular, the authors propose a real-time processing method that captures and evaluates emotions in speech, utilizing a terminal device like the Raspberry Pi computer. Furthermore, the authors provide an overview of the current research landscape surrounding speech emotional recognition and delve into our methodology, which involves analyzing audio files from renowned emotional speech databases. To aid incomprehension, the authors present visualizations of these audio files in situ, employing dB-scaled Mel spectrograms generated through TensorFlow and Matplotlib. The authors use a support vector machine kernel and a Convolutional Neural Network with transfer learning to classify emotions. Notably, the classification accuracies achieved are 70% and 77%, respectively, demonstrating the efficacy of our approach when executed on an edge device rather than relying on a server. The system can evaluate pure emotion in speech and provide corresponding visualizations to depict the speaker’s emotional state in less than one second on a Raspberry Pi. These findings pave the way for more effective and emotionally intelligent human-machine interactions in various domains.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.