2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00092
|View full text |Cite
|
Sign up to set email alerts
|

Actor-Transformers for Group Activity Recognition

Abstract: This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actorspecific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
119
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 175 publications
(119 citation statements)
references
References 40 publications
0
119
0
Order By: Relevance
“…The talking activity is recorded for both indoor and outdoor scenes, allowing us to test our 3D localization performance on different scenarios. Compared to other deep learning methods [115]- [117], we analyze each frame independently with no temporal information, and we do not perform any training for this task, using all the dataset for testing.…”
Section: Social Interactionsmentioning
confidence: 99%
“…The talking activity is recorded for both indoor and outdoor scenes, allowing us to test our 3D localization performance on different scenarios. Compared to other deep learning methods [115]- [117], we analyze each frame independently with no temporal information, and we do not perform any training for this task, using all the dataset for testing.…”
Section: Social Interactionsmentioning
confidence: 99%
“…Background clutter and occlusions between multiple people occur frequently. [12] BEHAVE 10 N/A 2009 Surveillance video 77.6%Zhang et al [13] CAD1 5 6 2009 Surveillance video 95.7% Tang et al [14] CAD2 6 8 2011 Surveillance video 85.5% Khamis et al [15] CAD3 6 3 2012 Surveillance video 87.2% Amer et al [16] UCLA Courtyard 6 10 2012 Surveillance video 83.7% Amer et al [17] Nursing Home 2 6 2012 Surveillance video 85.5% Deng et al [18] Broadcast Field Hockey 3 11 2012 Sports video 62.9% Lan et al [19] NCAA Basketball 11 N/A 2016 Sports video 58.1% Wu et al [20] Volleyball 8 8 2016 Sports video 94.4% Gavrilyuk et al [21] C-Sports 5 N/A 2020 Sports video 81.3% Zalluhoglu and Ikizler-Cinbis [22] NBA 9 N/A 2020 Sports video 47.5% Yan et al [23] (a)…”
Section: Surveillance Datasetsmentioning
confidence: 99%
“…Zhang et al [69] Unified modeling framework 83.8/N 86.0 framework in [20]. A two-stage scheme for event classification in basketball videos is proposed.…”
Section: Hierarchical Temporal Modelingmentioning
confidence: 99%
“…Previous approaches [2], [3], [21], [22] for group activity recognition focus on designing suitable features and modeling relation among the actors using probabilistic graphical models or AND-OR grammars. Recently, significant progress has been made in the domain of group activity recognition [5], [13], [16]- [18], [23], [29], [32], [40], mainly due to the advent of convolutional neural networks (CNNs). Ibrahim et al [18] propose a two-stage deep temporal model to capture temporal dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…Wu et al [40] build an actor-relation graph using a GCN to model the relational feature among the actors. Gavrilyuk et al [13] use self attention mechanism to model the dependency among the people present in a scene. These approaches mainly focus on designing appropriate models to understand the interaction pattern involving people present in a scene.…”
Section: Related Workmentioning
confidence: 99%