2021 IEEE International Conference on Image Processing (ICIP) 2021
DOI: 10.1109/icip42928.2021.9506089
|View full text |Cite
|
Sign up to set email alerts
|

Deep Audio-Visual Fusion Neural Network for Saliency Estimation

Abstract: In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…Many works [134], [135], [136], [137] have adopted the multiplicative-based fusion because it can effectively enhance the consistency and compress the inconsistency between audio and visual saliency-related features. After all, those real salient regions tend to be salient in both the audio domain and visual domain simultaneously.…”
Section: Audio-visual Saliency Detection (Avsd)mentioning
confidence: 99%
“…Many works [134], [135], [136], [137] have adopted the multiplicative-based fusion because it can effectively enhance the consistency and compress the inconsistency between audio and visual saliency-related features. After all, those real salient regions tend to be salient in both the audio domain and visual domain simultaneously.…”
Section: Audio-visual Saliency Detection (Avsd)mentioning
confidence: 99%
“…To validate the importance of audio information in a visual attention model and reduce the computational complexity of the model, VOLUME 11, 2023 Zhu et al [36] proposed a lightweight audio-visual saliency (LAVS) model for video sequences. Yao et al [37] designed a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. In addition, they proposed a new dataset for audio-visual saliency estimation, which can be used as a new benchmark in future work.…”
Section: Multi-modal Audio-visual Information Processingmentioning
confidence: 99%
“…TV advertisements) remains to be explored. Audio-visual attention prediction is an emerging topic of research [66]- [69], since audio-visual modeling may contribute to better attention prediction on dynamic stimuli. In addition to saliency prediction on graphic designs, the task of quality assessment of screen content images (e.g.…”
Section: Saliency In Natural Imagesmentioning
confidence: 99%