Facial Expression Recognition (FER) is a challenging task that improves natural humancomputer interaction. This paper focuses on automatic FER on a single in-the-wild (ITW) image. ITW images suffer real problems of pose, direction, and input resolution. In this study, we propose a pyramid with super-resolution (PSR) network architecture to solve the ITW FER task. We also introduce a prior distribution label smoothing (PDLS) loss function that applies the additional prior knowledge of the confusion about each expression in the FER task. Experiments on the three most popular ITW FER datasets showed that our approach outperforms all the state-of-the-art methods.
Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods. INDEX TERMS Speech emotion recognition, multi-level multi-head fusion attention, RNN, audio features, textual features.
In recent years, deep learning has dominated medical image segmentation. Encoder-decoder architectures, such as U-Net, can be used in state-of-the-art models with powerful designs that are achieved by implementing skip connections that propagate local information from an encoder path to a decoder path to retrieve detailed spatial information lost by pooling operations. Despite their effectiveness for segmentation, these naïve skip connections still have some disadvantages. First, multi-scale skip connections tend to use unnecessary information and computational sources, where likable low-level encoder features are repeatedly used at multiple scales. Second, the contextual information of the low-level encoder feature is insufficient, leading to poor performance for pixel-wise recognition when concatenating with the corresponding high-level decoder feature. In this study, we propose a novel spatial-channel attention gate that addresses the limitations of plain skip connections. This can be easily integrated into an encoder-decoder network to effectively improve the performance of the image segmentation task. Comprehensive results reveal that our spatial-channel attention gate remarkably enhances the segmentation capability of the U-Net architecture with a minimal computational overhead added. The experimental results show that our proposed method outperforms the conventional deep networks in term of Dice score, which achieves 71.72%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.