2021
DOI: 10.48550/arxiv.2112.14238
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Abstract: Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 53 publications
(118 reference statements)
0
3
0
Order By: Relevance
“…Reducing spatio-temporal redundancy for efficient video analysis has recently been a popular research topic. The mainstream approaches mostly train an additional lightweight network to achieve: (i) adaptive frame selection [12]- [14], [16], [44], i.e., dynamically determining the relevant frames for the recognition networks; (ii) adaptive frame resolution [12], i.e., learning an optimal resolution for each frame online; (iii) early stopping [45], i.e., terminating the inference process before observing all frames; (iv) adaptive spatio-temporal regions [10], [11], i.e., localizing the most task-relevant spatiotemporal regions; (v) adaptive network architectures [15], [16], [46], i.e., adjusting the network architecture to save computation on less informative features. Another line is to manually define low redundant sampling rules, such as MGSampler [47], which selects frames containing rich motion information by the cumulative motion distribution.…”
Section: B Spatio-temporal Redundancymentioning
confidence: 99%
See 1 more Smart Citation
“…Reducing spatio-temporal redundancy for efficient video analysis has recently been a popular research topic. The mainstream approaches mostly train an additional lightweight network to achieve: (i) adaptive frame selection [12]- [14], [16], [44], i.e., dynamically determining the relevant frames for the recognition networks; (ii) adaptive frame resolution [12], i.e., learning an optimal resolution for each frame online; (iii) early stopping [45], i.e., terminating the inference process before observing all frames; (iv) adaptive spatio-temporal regions [10], [11], i.e., localizing the most task-relevant spatiotemporal regions; (v) adaptive network architectures [15], [16], [46], i.e., adjusting the network architecture to save computation on less informative features. Another line is to manually define low redundant sampling rules, such as MGSampler [47], which selects frames containing rich motion information by the cumulative motion distribution.…”
Section: B Spatio-temporal Redundancymentioning
confidence: 99%
“…Although this yields decent performances, the computation over full videos is highly redundant due to the excessive and widely present spatio-temporal redundancy of visual information [9]- [13] in videos. In light of this, a branch of previous works has proposed to reduce the spatiotemporal redundancy by training an additional model to focus on relevant frames [12]- [17] or spatio-temporal regions [10], [11], which can significantly reduce the computation cost. However, they mostly require complicated operations, such as reinforcement learning and multi-stage training.…”
Section: Introductionmentioning
confidence: 99%
“…Wu et al [47] utilizes multi-agent reinforce learning to model parallel frame sampling and Lin et al [24] make one-step decision with holistic view. Meng et al [27] and Wang et al [42,44] focus their attention on spatial redundancy. Panda et al adaptively decide modalities for video segments.…”
Section: Related Workmentioning
confidence: 99%