FuseVis: Interpreting Neural Networks for Image Fusion Using Per-Pixel Saliency Visualization

Kumar, Nishant; Gumhold, Stefan

doi:10.3390/computers9040098

Cited by 11 publications

(7 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These global representations are selected from the local features obtained by the attention mechanism. Although existing audiovisual fusion models are capable of obtaining effective joint representations, they are more complex and difficult to explain [5]. As described in [4], selecting key local components from the global representation is beneficial for reducing the complexity of the model.…”

Section: Audiovisual Information Fusionmentioning

confidence: 99%

“…Specifically, unimodal representations can only describe changes in emotion from a single perspective [4]. Therefore, compared with modeling unimodal information, previous research has focused on the use of specific deep neural networks (DNNs) to efficiently learn the joint representation of multiple modalities [5]. For instance, a large number of studies seek to tackle these challenges by building complex network structures [6] and fusing multimodal feature matrices [7], which can mine deep multimodal features and enhance interaction between audiovisual signals, respectively.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Chen

Zhou

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Multimodal sentiment analysis has a wide range of applications due to its information complementarity in multimodal interactions. Previous works focus more on investigating efficient joint representations, but they rarely consider the insufficient unimodal features extraction and data redundancy of multimodal fusion. In this paper, a Video-based Cross-modal Auxiliary Network (VCAN) is proposed, which is comprised of an audio features map module and a cross-modal selection module. The first module is designed to substantially increase feature diversity in audio feature extraction, aiming to improve classification accuracy by providing more comprehensive acoustic representations. To empower the model to handle redundant visual features, the second module is addressed to efficiently filter the redundant visual frames during integrating audiovisual data. Moreover, a classifier group consisting of several image classification networks is introduced to predict sentiment polarities and emotion categories. Extensive experimental results on RAVDESS, CMU-MOSI, and CMU-MOSEI benchmarks indicate that VCAN is significantly superior to the state-of-the-art methods for improving the classification accuracy of multimodal sentiment analysis.

show abstract

Section: Audiovisual Information Fusionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Chen

Zhou

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…Xiaoning Zhang's paper proposes a new attention-guided network model that selectively integrates multilevel contextual information in an incremental manner. In addition to simulating the human attention mechanism, there is some research work that analyzes the importance of the information around the face object in judging the position of the face [29][30][31][32][33].…”

Section: Related Workmentioning

confidence: 99%

Application of a Fast RCNN Based on Upper and Lower Layers in Face Recognition

Jiang

Jia

Todo³

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

With the development of society, deep learning has been widely used in object detection, face recognition, speech recognition, and other fields. Among them, object detection is a popular direction in computer vision and digital image processing, and face detection is a focus of this hot direction. Although face detection technology has gone through a long research stage, it is still considered as one of the more difficult subjects in human feature detection technology. In addition, the face detection technology itself has two sides, imperceptibility and complexity of the environment, and other defects cause the existing technology to be unable to accurately recognize faces of different proportions, obscured and different postures. Therefore, this paper adopts an advanced deep learning method based on machine vision to detect human faces automatically. In order to accurately detect a variety of human faces, a multiscale fast RCNN method based on upper and lower layers (UPL-RCNN) is proposed. The network is composed of spatial affine transformation components and feature region components (ROI). This method plays a vital role in face detection. First of all, multiscale information can be grouped in detection, so as to deal with small areas of the face. Then, the method can use the inspiration of the human visual system to perform contextual reasoning and spatial transformation, including zooming, cutting, and rotating. Through comparative experiments, the analysis results show that this method can not only accurately detect human faces but also has better performance than fast RCNN. Compared with some advanced methods, this method has the advantages of high accuracy, less time consumption, and no correlation mark.

show abstract

“…The saliency-based algorithm can make up for this deficiency well, and it can be combined with the above three categories of methods to improve the quality of the result images. 42 Ma et al defined pixel-level saliency to fuse base layers. 43 Zhang et al improved it by squaring the intensity difference to alleviate the problem of poor perception of lesions and edges.…”

Section: Introductionmentioning

confidence: 99%

“…The reason is that the characteristics of the image content itself and the sensitivity of the human eye to information, such as high brightness and contrast changes, are not better utilized. The saliency‐based algorithm can make up for this deficiency well, and it can be combined with the above three categories of methods to improve the quality of the result images 42 . Ma et al defined pixel‐level saliency to fuse base layers 43 .…”

Section: Introductionmentioning

confidence: 99%

A multimodal molecular image fusion method based on relative total variation and co‐saliency detection

Wang

Zhang

Zhu

2022

Int J Imaging Syst Tech

View full text Add to dashboard Cite

Image fusion can integrate complementary information from multimodal molecular images to provide an informative single result image. In order to obtain a better fusion effect, this article proposes a novel method based on relative total variation and co‐saliency detection (RTVCSD). First, only the gray‐scale anatomical image is decomposed into a base layer and a texture layer according to the relative total variation; then, the three‐channel color functional image is transformed into the luminance and chroma (YUV) color space, and the luminance component Y is directly fused with the base layer of the anatomical image by comparing the co‐saliency information; next, the fused base layer is linearly combined with the texture layer, and the obtained fused result is combined with the chroma information U and V of the functional image. Finally, the fused image is obtained by transforming back to the red–green–blue color space. The dataset consists of magnetic resonance imaging (MRI)/positron emission tomography images, MRI/single photon emission computed tomography (SPECT) images, computed tomography/SPECT images, and green fluorescent protein/phase contrast images, each category with 20 image pairs. Experimental results demonstrate that the proposed method RTVCSD outperforms the nine comparison algorithms in terms of visual effects and objective evaluation. RTVCSD well preserves the texture information of the anatomical image and the metabolism or protein distribution information of the functional image.

show abstract

FuseVis: Interpreting Neural Networks for Image Fusion Using Per-Pixel Saliency Visualization

Cited by 11 publications

References 72 publications

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Application of a Fast RCNN Based on Upper and Lower Layers in Face Recognition

A multimodal molecular image fusion method based on relative total variation and co‐saliency detection

Contact Info

Product

Resources

About