2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00482
|View full text |Cite
|
Sign up to set email alerts
|

STAViS: Spatio-Temporal AudioVisual Saliency Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
45
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 66 publications
(46 citation statements)
references
References 54 publications
0
45
1
Order By: Relevance
“…While in this paper, we mainly focus on end-to-end deep learning based audio-visual processing model. Recent years have seen some good works on audio-visual saliency estimation or prediction by using deep learning-based end-to-end models, such as STAVIS [14]. STAVIS shares similar research objectives with this paper.…”
Section: Audio-visual Saliency Modelsmentioning
confidence: 74%
“…While in this paper, we mainly focus on end-to-end deep learning based audio-visual processing model. Recent years have seen some good works on audio-visual saliency estimation or prediction by using deep learning-based end-to-end models, such as STAVIS [14]. STAVIS shares similar research objectives with this paper.…”
Section: Audio-visual Saliency Modelsmentioning
confidence: 74%
“…Each video frame is resized at 112×112 pixels. Following the previous work [59], the data augmentation is also employed for random generation of training samples. The implementation adopts the 3D ResNet-50 [24] as backbone for encoding spatio-temporal visual features, and applies SoundNet [23] as backbone for encoding high-level audio semantic features.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…TASED [78] exploits 3D fully-convolutional network architecture to generate the visual attention map of each frame by considering several past frames. STAViS [59] combines both visual and auditory information for VAP in videos.…”
Section: Comparison With State-of-the-artsmentioning
confidence: 99%
“…More recently, there have been some efforts related to audiovisual saliency [60][61][62][63], but very few explicitly incorporate spatial (two-dimensional) audio [62]. However, some of the drawbacks of those models are that they are not based on the neural principles and do not use spatial audio which is an important feature in the auditory domain.…”
Section: Related Workmentioning
confidence: 99%