Nowadays, deep convolutional neural networks (CNNs) for face recognition exhibit a performance comparable to human ability in the presence of the appropriate amount of labelled training data. However, training CNNs remains as an arduous task due to the lack of training samples. To overcome this drawback, applications demand one-shot learning to improve the obtained performances over traditional machine learning approaches by learning representative information about data categories from few training samples. In this context, Siamese convolutional network (SiConvNet) provides an interesting deep architecture to tackle the data limitation. In this regard, applying the convolution operation on real world images by using the trainable correlative Gaussian kernel adds correlations to the output images, which hinder the recognition process due to the blurring effects introduced by the convolution kernel application. As a result the pixel-wise and channel-wise correlations or redundancies could appear in both single and multiple feature maps obtained by a hidden layer. In this sense, convolution-based models fail to generalize the feature representation because of both the strong correlations presence in neighboring pixels and the channel-wise high redundancies between different channels of the feature maps, which hamper the effective training. Deconvolution operation helps to overcome the shortcomings that limit the conventional SiConvNet performance, learning successfully correlation-free features representation. In this paper, a simple but efficient Siamese convolution deconvolution feature fusion network (SiCoDeF 2 Net) is proposed to learn the invariant and discriminative complementary features generated from both the (i) subconvolution (SCoNet) and (ii) sub deconvolutional (SDeNet) networks using a concatenation operation which significantly improves the one-shot unconstrained facial recognition task. Extensive experiments performed on several widely used benchmarks, provide promising results, where the proposed SiCoDeF 2 Net model significantly outperforms the current state-of-art in terms of classification accuracy, F1, precision and recall. The code will be available on: https://github.com/purbayankar/SiCoDeF2Net.
Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive marginbased triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-theart performance in ERC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.