2021
DOI: 10.1007/s11263-021-01519-y
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction

Abstract: In this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using features extracted at different abstraction levels. We provide the base hierarchical learning mechanism with two techniques for domain adaptation and domain-specific learning. For the former, we encourage the model to unsupervisedly learn hierarchical general features using gradient reversal at multiple scales, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 45 publications
(19 citation statements)
references
References 60 publications
0
19
0
Order By: Relevance
“…UNISAL [6] proposes a multi-objective unified framework for both 2D and 3D saliency with domain-specific modules and a lightweight recurrent architecture to handle temporal dynamics; While single-decoder approaches are common, multi-decoder output integration has recently attracted interest. DVA [35] and HD2S [1] fuse maps predicted by independent decoders operating at different abstraction levels. RecSal [30] predicts multiple saliency maps in a multiobjective training framework.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…UNISAL [6] proposes a multi-objective unified framework for both 2D and 3D saliency with domain-specific modules and a lightweight recurrent architecture to handle temporal dynamics; While single-decoder approaches are common, multi-decoder output integration has recently attracted interest. DVA [35] and HD2S [1] fuse maps predicted by independent decoders operating at different abstraction levels. RecSal [30] predicts multiple saliency maps in a multiobjective training framework.…”
Section: Related Workmentioning
confidence: 99%
“…Many solutions have been proposed, based on different assumptions on how to capture video saliency. It is interesting to note that, in spite of the remarkably different research directions followed by the variety of works in the literature, top results over video saliency prediction benchmarks are very close [1,3,36], suggesting that predictions of different models are similar. We assessed the validity of this conclusion by comparing three of the best performing methods on the DHF1K dataset [36] -TASED [25], HD2S [1] and ViNet [16] -not in terms of their scores on summary metrics, but in terms of the relative similarity of the predicted saliency maps.…”
Section: Introductionmentioning
confidence: 99%
“…Consider that optical flow can be used to direct human foreground attention when appropriate compensation is applied to the movement of the lens. We investigate the combination of spatial streaming embedding CNN and temporal streaming CNN to form a dual-stream convolutional neural network to learn video features [ 20 ]. The purpose of introducing an optical flow attention layer from the temporal network to the spatial network is to guide the spatial flow to pay more attention to the human foreground region and to reduce the effect of background noise.…”
Section: Video Content Analysis Of High-level Semantic Recognition Model Sports Under Engineering Managementmentioning
confidence: 99%
“…Visual saliency prediction. Visual saliency models have been widely developed to predict where people look in images (Huang et al, 2015;Zhang and Sclaroff, 2016;Pan et al, 2017;Wang and Shen, 2017;Cornia et al, 2018;Li et al, 2014) or videos (Hossein Khatoonabadi et al, 2015;Bak et al, 2017;Liu et al, 2017;Jiang et al, 2021;Wang et al, 2018;Min and Corso, 2019;Zanca et al, 2019;Bellitto et al, 2021;Li et al, 2010;Souly and Shah, 2016). The seminal work of Itti et al (1998) proposed a computational model to predict the image saliency, via combining three low-level features including color, intensity, and orientation.…”
Section: Saliency Predictionmentioning
confidence: 99%