Multimodal Information Fusion of Audiovisual Emotion Recognition Using Novel Information Theoretic Tools

Xie, Zhibing; Guan, Ling

doi:10.4018/ijmdem.2013100101

Cited by 20 publications

(9 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this thesis, the visual features are generated by the representation of specific regions in face images based on Gabor library [132]. Gabor transform based feature extraction has a…”

Section: Visual Feature Extraction Based On Gabor Filtermentioning

confidence: 99%

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Xie¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Understanding human emotional states is indispensable for our daily interaction, and we can enjoy more natural and friendly human computer interaction (HCI) experience by fully utilizing human’s affective states. In the application of emotion recognition, multimodal information fusion is widely used to discover the relationships of multiple information sources and make joint use of a number of channels, such as speech, facial expression, gesture and physiological processes. This thesis proposes a new framework of emotion recognition using information fusion based on the estimation of information entropy. The novel techniques of information theoretic learning are applied to feature level fusion and score level fusion. The most critical issues for feature level fusion are feature transformation and dimensionality reduction. The existing methods depend on the second order statistics, which is only optimal for Gaussian-like distributions. By incorporating information theoretic tools, a new feature level fusion method based on kernel entropy component analysis is proposed. For score level fusion, most previous methods focus on predefined rule based approaches, which are usually heuristic. In this thesis, a connection between information fusion and maximum correntropy criterion is established for effective score level fusion. Feature level fusion and score level fusion methods are then combined to introduce a two-stage fusion platform. The proposed methods are applied to audiovisual emotion recognition, and their effectiveness is evaluated by experiments on two publicly available audiovisual emotion databases. The experimental results demonstrate that the proposed algorithms achieve improved performance in comparison with the existing methods. The work of this thesis offers a promising direction to design more advanced emotion recognition systems based on multimodal information fusion and has great significance to the development of intelligent human computer interaction systems.

show abstract

“…In this thesis, the visual features are generated by the representation of specific regions in face images based on Gabor library [132]. Gabor transform based feature extraction has a…”

Section: Visual Feature Extraction Based On Gabor Filtermentioning

confidence: 99%

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Xie¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We discuss a few relevant papers here. The authors in [42] provide a general theoretical analysis for multimodal information fusion and implements novel information theoretic tools for multimedia applications. [38] proposes a two-step approach for an optimal multimodal fusion, where in the first step statistically independent modalities are found from raw features and in the second step, super-kernel fusion is used to find the optimal combination of individual modalities.…”

Section: Related Workmentioning

confidence: 99%

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Gandhi¹,

Sharma²,

Biswas³

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Data generated from real world events are usually temporal and contain multimodal information such as audio, visual, depth, sensor etc. which are required to be intelligently combined for classification tasks. In this paper, we propose a novel generalized deep neural network architecture where temporal streams from multiple modalities are combined. There are total M+1 (M is the number of modalities) components in the proposed network. The first component is a novel temporally hybrid Recurrent Neural Network (RNN) that exploits the complimentary nature of the multimodal temporal information by allowing the network to learn both modality specific temporal dynamics as well as the dynamics in a multimodal feature space. M additional components are added to the network which extract discriminative but non-temporal cues from each modality. Finally, the predictions from all of these components are linearly combined using a set of automatically learned weights. We perform exhaustive experiments on three different datasets spanning four modalities. The proposed network is relatively 3.5%, 5.7% and 2% better than the best performing temporal multimodal baseline for UCF-101, CCV and Multimodal Gesture datasets respectively.

show abstract

“…In [14] Z. Xie et al aims at providing general theoretical analysis for the issue of multimodal information fusion and implementing novel information theoretic tools in multimedia application. The most essential issues for information fusion include feature transformation and reduction of feature dimensionality.…”

Section: Literature Surveymentioning

confidence: 99%

Recognition and Classification of Human Emotion from Audio

Pawar¹

2017

IJARCSSE

View full text Add to dashboard Cite

Abstract-In this paper, the audio emotion recognition system is proposed that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy in the audio paths. The audio path is designed using a combination of input prosodic features (pitch, log-energy, zero crossing rates and Teager energy operator) and spectral features (Mel-scale frequency cepstral coefficients). Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech feature extraction and current research aims to identify performance enhancements. After the MFCC feature extraction, these features are passed to three parallel sub-paths which use feature extraction and classification techniques (i.e. BDPCA+LSLDA+RBF). In addition, Naïve Bays and SVM classifier are presented with BDPCS and LSLDA for evaluation of emotion. The extracted audio features are passed into an audio feature level fusion module that uses a set of rules to determine the most likely emotion contained in the audio signal. The performances of the proposed audio path and the final system are evaluated on standard databases of audio clips extracted from the video.Keywords-Emotion recognition, audio-visual processing, rule-based, machine learning, multimodal system I. INTRODUCTIONEmotion recognition is an automated process to identify the affective state of a person and has gained the increasing attention of researchers in the human-computer interaction (HCI) field for various applications like automotive safety, gaming experiences, mental diagnosis in military service, customer services, etc. Over the decades, several research efforts have been conducted for audio-visual emotion recognition. In the literature, three main approaches can be broadly distinguished: (i) audio based approaches, (ii) visual-based approaches, and (iii) audio-visual approaches. Initial works focused on treating the audio data and visual data modalities separately. The audio-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human speech signal. An important issue is the selection of the salient features to be used for discriminating the different emotions. Two types of features have been found to be useful for recognizing emotion in speech: prosodic and spectral features. Examples of commonly used prosodic features are pitch and energy and examples of commonly used spectral features are Mel-scale frequency cepstral coefficients (MFCC). Although prosodic features are commonly used in many works some researchers have demonstrated the usefulness of spectral features for speech emotion recognition. The monograph work further investigated combining different types of features like prosodic and spectral features for audio-based emotion recognition. The visual-based emotion recognition efforts are based on extracting and recognizing the emotional states contained in the human facial expression. An example is a recent work by Tawari & Trivedi which used a representation of image sequen...

show abstract

Multimodal Information Fusion of Audiovisual Emotion Recognition Using Novel Information Theoretic Tools

Cited by 20 publications

References 32 publications

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Recognition and Classification of Human Emotion from Audio

Contact Info

Product

Resources

About