2021
DOI: 10.1109/access.2021.3134694
|View full text |Cite
|
Sign up to set email alerts
|

Deep Neural Networks Using Residual Fast-Slow Refined Highway and Global Atomic Spatial Attention for Action Recognition and Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 19 publications
(10 citation statements)
references
References 51 publications
0
10
0
Order By: Relevance
“…In SlowFast [65], a low-and a high-frame rate pathway, consisting of differentdepth 3D-ResNets, are used to capture the spatial frame information and rapidly changing motion, respectively. In [66], a 3D-CNN is first used to produce a feature representation for each video segment, which are then processed using an attention network with fast and slow pathways. In [67], 3D-CNN architectures are build using a temporal one-shot aggregation module to capture multiple temporal receptive fields, and depth-wise spatiotemporal factorized components for modeling short-and long-term motion dynamics.…”
Section: ) Top-down Approachesmentioning
confidence: 99%
“…In SlowFast [65], a low-and a high-frame rate pathway, consisting of differentdepth 3D-ResNets, are used to capture the spatial frame information and rapidly changing motion, respectively. In [66], a 3D-CNN is first used to produce a feature representation for each video segment, which are then processed using an attention network with fast and slow pathways. In [67], 3D-CNN architectures are build using a temporal one-shot aggregation module to capture multiple temporal receptive fields, and depth-wise spatiotemporal factorized components for modeling short-and long-term motion dynamics.…”
Section: ) Top-down Approachesmentioning
confidence: 99%
“…In order to improve performance in visual perception, several generations of CNNs have been created with the input vectors taking care of one image or multiple images. Particularly, multiple images are commonly adopted as an input vector which has the embedded temporal information as well as the spatial information [2], [4]. In addition to improving learning, many researchers used temporal networks to perform large-scale visual learning and activity classification from video clips, where temporal networks had recurrent connections to aid in video context understanding regarding time [2], [4]- [7].…”
Section: Introductionmentioning
confidence: 99%
“…The motion being performed can be at a fast-refreshing speed, and individual frames can be ambiguous. Therefore, motion cues provide a necessary approach by allowing the compensated optical flows to pick up potential [2], [4]. Another important reason is that current CNNs architectures are not able to take full advantage of temporal information and their performance is consequently often dominated by appearance recognition.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations