Abstract-In this paper, the audio emotion recognition system is proposed that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy in the audio paths. The audio path is designed using a combination of input prosodic features (pitch, log-energy, zero crossing rates and Teager energy operator) and spectral features (Mel-scale frequency cepstral coefficients). Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech feature extraction and current research aims to identify performance enhancements. After the MFCC feature extraction, these features are passed to three parallel sub-paths which use feature extraction and classification techniques (i.e. BDPCA+LSLDA+RBF). In addition, Naïve Bays and SVM classifier are presented with BDPCS and LSLDA for evaluation of emotion. The extracted audio features are passed into an audio feature level fusion module that uses a set of rules to determine the most likely emotion contained in the audio signal. The performances of the proposed audio path and the final system are evaluated on standard databases of audio clips extracted from the video.Keywords-Emotion recognition, audio-visual processing, rule-based, machine learning, multimodal system
I. INTRODUCTIONEmotion recognition is an automated process to identify the affective state of a person and has gained the increasing attention of researchers in the human-computer interaction (HCI) field for various applications like automotive safety, gaming experiences, mental diagnosis in military service, customer services, etc. Over the decades, several research efforts have been conducted for audio-visual emotion recognition. In the literature, three main approaches can be broadly distinguished: (i) audio based approaches, (ii) visual-based approaches, and (iii) audio-visual approaches. Initial works focused on treating the audio data and visual data modalities separately. The audio-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human speech signal. An important issue is the selection of the salient features to be used for discriminating the different emotions. Two types of features have been found to be useful for recognizing emotion in speech: prosodic and spectral features. Examples of commonly used prosodic features are pitch and energy and examples of commonly used spectral features are Mel-scale frequency cepstral coefficients (MFCC). Although prosodic features are commonly used in many works some researchers have demonstrated the usefulness of spectral features for speech emotion recognition. The monograph work further investigated combining different types of features like prosodic and spectral features for audio-based emotion recognition. The visual-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human facial expression. An example is a recent work by Tawari & Trivedi which used a representation of image sequen...