Asynchronous Interaction Aggregation for Action Detection

Tang, Jiajun; Jin, Xia; Mu, Xinzhi; Pang, Bo; Lu, Cewu

doi:10.48550/arxiv.2004.07485

Cited by 6 publications

(9 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We employ the Asynchronous Interaction Aggregation (AIA) network [25] as the main action detection model for this task. The backbone of the model is SlowFast 8×8 ResNet-101 (SlowFast8x8-R101) [6] which is pretrained on the Kinetics-700 dataset [2].…”

Section: Action Detection Modelmentioning

confidence: 99%

“…The whole AIA model (with dense serial Interaction Aggregation) is then trained on the AVA dataset [9] which is a large-scale spatiotemporal action localization dataset. We take the AVA trained model open-sourced by the authors of AIA [25] as our base model. As the actions in HIE are all human-centric and have no interactions with other objects, unlike the actions in AVA, we remove the personobject interactions modules in the AIA.…”

Section: Action Detection Modelmentioning

confidence: 99%

“…As a challenging problem in computer vision, person-level action recognition has been applied in many applications, including human behavior analysis, human-computer interaction, and video surveillance. Recently, significant progress has been made in this area [5,25]. However, the person-level action recognition in complex events [18] is still relatively new and a challenging problem.…”

Section: Introductionmentioning

confidence: 99%

“…Firstly, we train a very strong person detector to detect the persons in the video frames. We then build an action detection model using a modified version of the Asynchronous Interaction Aggregation network [25]. Finally, we extract the scenes information by semantic segmentation model to boost the performance.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

Li¹,

Zhou²,

Sheng

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Detecting and recognizing human action in videos with crowded scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes;(2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).

show abstract

Section: Action Detection Modelmentioning

confidence: 99%

Section: Action Detection Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

Li¹,

Zhou²,

Sheng

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…In sequential problems of NLP (Bahdanau, Cho, and Bengio 2014;Vaswani et al 2017;Lin et al 2017b;Xu et al 2015), attention mechanisms are widely adopted in recurrent neural networks (RNN) (Pang et al 2019), Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997), SCS (Pang et al 2020b), and Transformer (Vaswani et al 2017) to capture the relationships between words or sentences. In computer vision, many tasks like fine-grained recognition (Fu, Zheng, and Mei 2017;Wang et al 2015;Fang et al 2018;Pang et al 2020c), image captioning (Anderson et al 2018;Anne Hendricks et al 2016;Xu et al 2015), classification (Mnih et al 2014;Hu, Shen, and Sun 2018;Woo et al 2018;Wang et al 2017;Tang et al 2020), and segmentation (Ren and Zemel 2017;Chen et al 2016;Cao et al 2020) also utilize attention mechanisms based on soft attention maps or boundingboxes to search salient areas. Moreover, self-attention structures (Wang et al 2018;Zhu et al 2019;Huang et al 2018;Dai et al 2019) focusing on the combination weight of elements (pixels in vision) are another attention method that adopts adjacent matrix to present attentions.…”

Section: Related Workmentioning

confidence: 99%

TDAF: Top-Down Attention Framework for Vision Tasks

Pang

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Human attention mechanisms often work in a top-down manner, yet it is not well explored in vision research. Here, we propose the Top-Down Attention Framework (TDAF) to capture top-down attentions, which can be easily adopted in most existing models. The designed Recursive Dual-Directional Nested Structure in it forms two sets of orthogonal paths, recursive and structural ones, where bottom-up spatial features and top-down attention features are extracted respectively. Such spatial and attention features are nested deeply, therefore, the proposed framework works in a mixed top-down and bottom-up manner. Empirical evidence shows that our TDAF can capture effective stratified attention information and boost performance. ResNet with TDAF achieves 2.0% improvements on ImageNet. For object detection, the performance is improved by 2.7% AP over FCOS. For pose estimation, TDAF improves the baseline by 1.6%. And for action recognition, the 3D-ResNet adopting TDAF achieves improvements of 1.7% accuracy.

show abstract

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Lan

Zeng

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Asynchronous Interaction Aggregation for Action Detection

Cited by 6 publications

References 26 publications

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

TDAF: Top-Down Attention Framework for Vision Tasks

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Contact Info

Product

Resources

About