2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01594
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Focus for Efficient Video Recognition

Abstract: In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 83 publications
(26 citation statements)
references
References 49 publications
0
26
0
Order By: Relevance
“…The training in AR-Net is simplified using the Gumbel-Softmax trick. Later, this idea was extended to adaptively select a proper modality [20] or patches [42]. Our approach is motivated by these prior works to apply a similar framework to adaptive computation on deep learn-based VIO for the first time.…”
Section: Adaptive Inferencementioning
confidence: 99%
“…The training in AR-Net is simplified using the Gumbel-Softmax trick. Later, this idea was extended to adaptively select a proper modality [20] or patches [42]. Our approach is motivated by these prior works to apply a similar framework to adaptive computation on deep learn-based VIO for the first time.…”
Section: Adaptive Inferencementioning
confidence: 99%
“…Reducing spatio-temporal redundancy for efficient video analysis has recently been a popular research topic. The mainstream approaches mostly train an additional lightweight network to achieve: (i) adaptive frame selection [12]- [14], [16], [44], i.e., dynamically determining the relevant frames for the recognition networks; (ii) adaptive frame resolution [12], i.e., learning an optimal resolution for each frame online; (iii) early stopping [45], i.e., terminating the inference process before observing all frames; (iv) adaptive spatio-temporal regions [10], [11], i.e., localizing the most task-relevant spatiotemporal regions; (v) adaptive network architectures [15], [16], [46], i.e., adjusting the network architecture to save computation on less informative features. Another line is to manually define low redundant sampling rules, such as MGSampler [47], which selects frames containing rich motion information by the cumulative motion distribution.…”
Section: B Spatio-temporal Redundancymentioning
confidence: 99%
“…Although this yields decent performances, the computation over full videos is highly redundant due to the excessive and widely present spatio-temporal redundancy of visual information [9]- [13] in videos. In light of this, a branch of previous works has proposed to reduce the spatiotemporal redundancy by training an additional model to focus on relevant frames [12]- [17] or spatio-temporal regions [10], [11], which can significantly reduce the computation cost. However, they mostly require complicated operations, such as reinforcement learning and multi-stage training.…”
Section: Introductionmentioning
confidence: 99%
“…For example, a dynamic network spends less computation on easy samples or less informative spatial areas/temporal locations of an input. For image [32,60] or video-related [30,59] tasks, sample-wise, spatial-wise, or temporal-wise adaptive inference could be conducted by formulating the recognition or detection task as a sequential decision problem and allowing early exiting during inference.…”
Section: Dynamic Neural Networkmentioning
confidence: 99%