2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00248
|View full text |Cite
|
Sign up to set email alerts
|

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

Abstract: TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts lowresolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by appl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
119
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 144 publications
(119 citation statements)
references
References 40 publications
0
119
0
Order By: Relevance
“…Earlier attempts to predict saliency typically utilized handcrafted image features, such as color, intensity, and orientation contrast [39], motion contrast [40], and camera motion [41]. Later on, DNN-based semantic-level features were extensively investigated for both image content [42]- [48] and video sequences [49]- [55]. Among these features, image saliency prediction only exploits spatial information, while video saliency prediction often relies on spatial and temporal attributes jointly.…”
Section: A Saliency-based Video Preprocessingmentioning
confidence: 99%
“…Earlier attempts to predict saliency typically utilized handcrafted image features, such as color, intensity, and orientation contrast [39], motion contrast [40], and camera motion [41]. Later on, DNN-based semantic-level features were extensively investigated for both image content [42]- [48] and video sequences [49]- [55]. Among these features, image saliency prediction only exploits spatial information, while video saliency prediction often relies on spatial and temporal attributes jointly.…”
Section: A Saliency-based Video Preprocessingmentioning
confidence: 99%
“…The other is the ineffective structures in extracting joint spatio-temporal FIGURE 1. Cases that TASED-Net [16] fails to allocate saliency and predict shifts of saliency focus. The rows from top to bottom are the input, ground-truth, and saliency maps predicted by TASED-Net [16].…”
Section: Introductionmentioning
confidence: 99%
“…Deep models based on LSTM or its variants [7]- [15] are difficult to train for long-range temporal evolution due to the frame-by-frame data feeding scheme. Models based on 3D fully CNN (3DCNN) [16], [17] still may fail to allocate saliency or to predict shifts of saliency focus when there are multiple instances.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations