Deep Neural Networks Using Residual Fast-Slow Refined Highway and Global Atomic Spatial Attention for Action Recognition and Detection

Ha, Manh-Hung; Chen, Oscal Tzyh-Chiang

doi:10.1109/access.2021.3134694

Cited by 19 publications

(10 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In SlowFast [65], a low-and a high-frame rate pathway, consisting of differentdepth 3D-ResNets, are used to capture the spatial frame information and rapidly changing motion, respectively. In [66], a 3D-CNN is first used to produce a feature representation for each video segment, which are then processed using an attention network with fast and slow pathways. In [67], 3D-CNN architectures are build using a temporal one-shot aggregation module to capture multiple temporal receptive fields, and depth-wise spatiotemporal factorized components for modeling short-and long-term motion dynamics.…”

Section: ) Top-down Approachesmentioning

confidence: 99%

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

View full text Add to dashboard Cite

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet) a .a Source code and trained models will be made available upon acceptance.INDEX TERMS Video event recognition, eXplainable AI (XAI), graph attention network, factorized attention, bottom-up.

show abstract

Section: ) Top-down Approachesmentioning

confidence: 99%

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

View full text Add to dashboard Cite

show abstract

“…In order to improve performance in visual perception, several generations of CNNs have been created with the input vectors taking care of one image or multiple images. Particularly, multiple images are commonly adopted as an input vector which has the embedded temporal information as well as the spatial information [2], [4]. In addition to improving learning, many researchers used temporal networks to perform large-scale visual learning and activity classification from video clips, where temporal networks had recurrent connections to aid in video context understanding regarding time [2], [4]- [7].…”

Section: Introductionmentioning

confidence: 99%

“…The motion being performed can be at a fast-refreshing speed, and individual frames can be ambiguous. Therefore, motion cues provide a necessary approach by allowing the compensated optical flows to pick up potential [2], [4]. Another important reason is that current CNNs architectures are not able to take full advantage of temporal information and their performance is consequently often dominated by appearance recognition.…”

Section: Introductionmentioning

confidence: 99%

“…To reliably and precisely generate subject descriptors, the recognition process may focus on the meaningful parts to increase the accuracy. For example, attention features were generated automatically from the DNN's intermediate layer(s) and then used to focus on the most meaningful part of an image for identification [2], [4]. In [5], the recurrent mechanism that assigned the weighted attention to the feature map from the convolutional layer was proposed for action recognition based on RGB images.…”

Section: Introductionmentioning

confidence: 99%

“…Instead of using the RGB stream, the spatiotemporal attention mechanism adopts the joint points from the 3D skeleton for action recognition. They developed an end-to-end network with three temporal networks that individually performed the classifications, by selectively focusing on the discriminative joints of the skeleton (spatial attention), and assigning weights to the key sequential images (temporal attention) [2], [4].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Attention correlated appearance and motion feature followed temporal learning for activity recognition

Pham

Thanh

et al. 2023

IJECE

View full text Add to dashboard Cite

<span lang="EN-US">Recent advances in deep neural networks have been successfully demonstrated with fairly good accuracy for multi-class activity identification. However, existing methods have limitations in achieving complex spatial-temporal dependencies. In this work, we design two stream fusion attention (2SFA) connected to a temporal bidirectional gated recurrent unit (GRU) one-layer model and classified by prediction voting classifier (PVC) to recognize the action in a video. Particularly in the proposed deep neural network (DNN), we present 2SFA for capturing appearance information from red green blue (RGB) and motion from optical flow, where both streams are correlated by proposed fusion attention (FA) as the input of a temporal network. On the other hand, the temporal network with a bi-directional temporal layer using a GRU single layer is preferred for temporal understanding because it yields practical merits against six topologies of temporal networks in the UCF101 dataset. Meanwhile, the new proposed classifier scheme called PVC employs multiple nearest class mean (NCM) and the SoftMax function to yield multiple features outputted from temporal networks, and then votes their properties for high-performance classifications. The experiments achieve the best average accuracy of 70.8% in HMDB51 and 91.9%, the second best in UCF101 in terms of 2DConvNet for action recognition.</span>

show abstract

Deep multiple aggregation networks for action recognition

Mazari,

Sahbi

2024

Int J Multimed Info Retr

View full text Add to dashboard Cite

Deep Neural Networks Using Residual Fast-Slow Refined Highway and Global Atomic Spatial Attention for Action Recognition and Detection

Cited by 19 publications

References 51 publications

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

Attention correlated appearance and motion feature followed temporal learning for activity recognition

Deep multiple aggregation networks for action recognition

Contact Info

Product

Resources

About