Deep Audio-Visual Fusion Neural Network for Saliency Estimation

Yao, Shunyu; Min, Xiongkuo; Zhai, Guangtao

doi:10.1109/icip42928.2021.9506089

Cited by 10 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many works [134], [135], [136], [137] have adopted the multiplicative-based fusion because it can effectively enhance the consistency and compress the inconsistency between audio and visual saliency-related features. After all, those real salient regions tend to be salient in both the audio domain and visual domain simultaneously.…”

Section: Audio-visual Saliency Detection (Avsd)mentioning

confidence: 99%

A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key!

Chen

Mengke

Song

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect. In contrast, our audio system is the most vital complementary part of our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors that could directly determine AVSD deep models' performances. We claim that the audio-visual consistency degree (AVC) -a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which can potentially facilitate future works in further promoting stateof-the-art (SOTA) performance.

show abstract

Section: Audio-visual Saliency Detection (Avsd)mentioning

confidence: 99%

A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key!

Chen

Mengke

Song

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…To validate the importance of audio information in a visual attention model and reduce the computational complexity of the model, VOLUME 11, 2023 Zhu et al [36] proposed a lightweight audio-visual saliency (LAVS) model for video sequences. Yao et al [37] designed a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. In addition, they proposed a new dataset for audio-visual saliency estimation, which can be used as a new benchmark in future work.…”

Section: Multi-modal Audio-visual Information Processingmentioning

confidence: 99%

An Audio-Visual Separation Model Integrating Dual-Channel Attention Mechanism

Zhang

Zhao

2023

IEEE Access

View full text Add to dashboard Cite

Sound source separation is the separation of targeted sounds from a noisy environment, which plays an important role in signal processing and has been studied extensively. However, most of these researches only extract audio information for processing, ignoring visual information, resulting in the waste of feature information. In addition, some researchers have fused the extracted features but have not noticed the weights of different features, resulting in a poor model effect. This paper uses multi-modal information to separate sound sources to solve these problems. We constructed a multi-modal Audio-Visual separation model integrating the Dual-channel Attention mechanism named AVDA. The model realizes sound source separation through the dynamic fusion of visual and audio features. Specifically, firstly, we take audio video as input for data preprocessing and segment it to obtain video frames and audio. Then they are introduced into the visual and audio feature extractors, which integrate the attention mechanism to extract features. Finally, the extracted visual and audio features are introduced into the fusion prediction component to obtain the predicted spectrogram, which is compared with the ground truth spectrogram for subjective evaluation. In addition, three indexes of signal distortion ratio (SDR), signal interference ratio (SIR) and signal artifact ratio (SAR) are introduced for quantitative comparison. The experimental results on MUSIC-21 data set show that the model achieves 10.96, 17.91 and 12.77 respectively in three indexes, and its performance is significantly better than other models in audio-visual separation tasks.

show abstract

“…TV advertisements) remains to be explored. Audio-visual attention prediction is an emerging topic of research [66]- [69], since audio-visual modeling may contribute to better attention prediction on dynamic stimuli. In addition to saliency prediction on graphic designs, the task of quality assessment of screen content images (e.g.…”

Section: Saliency In Natural Imagesmentioning

confidence: 99%

Predicting Visual Attention in Graphic Design Documents

Chakraborty

Wei

Kelton

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

We present a model for predicting visual attention during the free viewing of graphic design documents. While existing works on this topic have aimed at predicting static saliency of graphic designs, our work is the first attempt to predict both spatial attention and dynamic temporal order in which the document regions are fixated by gaze using a deep learning based model. We propose a two-stage model for predicting dynamic attention on such documents, with webpages being our primary choice of document design for demonstration. In the first stage, we predict the saliency maps for each of the document components (e.g. logos, banners, texts, etc. for webpages) conditioned on the type of document layout. These component saliency maps are then jointly used to predict the overall document saliency. In the second stage, we use these layout-specific component saliency maps as the state representation for an inverse reinforcement learning model of fixation scanpath prediction during document viewing. To test our model, we collected a new dataset consisting of eye movements from 41 people freely viewing 450 webpages (the largest dataset of its kind). Experimental results show that our model outperforms existing models in both saliency and scanpath prediction for webpages, and also generalizes very well to other graphic design documents such as comics, posters, mobile UIs, etc. and natural images.

show abstract

Deep Audio-Visual Fusion Neural Network for Saliency Estimation

Cited by 10 publications

References 21 publications

A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key!

A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key!

An Audio-Visual Separation Model Integrating Dual-Channel Attention Mechanism

Predicting Visual Attention in Graphic Design Documents

Contact Info

Product

Resources

About