Actor-Transformers for Group Activity Recognition

Gavrilyuk, Kirill; Sanford, Ryan; Javan, Mehrsan; Snoek, Cees G. M.

doi:10.1109/cvpr42600.2020.00092

Cited by 175 publications

(119 citation statements)

References 40 publications

Supporting

Mentioning

119

Contrasting

Order By: Relevance

“…The talking activity is recorded for both indoor and outdoor scenes, allowing us to test our 3D localization performance on different scenarios. Compared to other deep learning methods [115]- [117], we analyze each frame independently with no temporal information, and we do not perform any training for this task, using all the dataset for testing.…”

Section: Social Interactionsmentioning

confidence: 99%

Perceiving Humans: From Monocular 3D Localization to Social Distancing

Bertoni

Kreiss

Alahi

2022

IEEE Trans. Intell. Transport. Syst.

View full text Add to dashboard Cite

Perceiving humans in the context of Intelligent Transportation Systems (ITS) often relies on multiple cameras or expensive LiDAR sensors. In this work, we present a new cost-effective vision-based method that perceives humans' locations in 3D and their body orientation from a single image. We address the challenges related to the ill-posed monocular 3D tasks by proposing a neural network architecture that predicts confidence intervals in contrast to point estimates. Our neural network estimates human 3D body locations and their orientation with a measure of uncertainty. Our proposed solution (i) is privacy-safe, (ii) works with any fixed or moving cameras, and (iii) does not rely on ground plane estimation. We demonstrate the performance of our method with respect to three applications: locating humans in 3D, detecting social interactions, and verifying the compliance of recent safety measures due to the COVID-19 outbreak. We show that it is possible to rethink the concept of "social distancing" as a form of social interaction in contrast to a simple location-based rule. We publicly share the source code towards an open science mission.

show abstract

Section: Social Interactionsmentioning

confidence: 99%

Perceiving Humans: From Monocular 3D Localization to Social Distancing

Bertoni

Kreiss

Alahi

2022

IEEE Trans. Intell. Transport. Syst.

View full text Add to dashboard Cite

show abstract

“…Background clutter and occlusions between multiple people occur frequently. [12] BEHAVE 10 N/A 2009 Surveillance video 77.6%Zhang et al [13] CAD1 5 6 2009 Surveillance video 95.7% Tang et al [14] CAD2 6 8 2011 Surveillance video 85.5% Khamis et al [15] CAD3 6 3 2012 Surveillance video 87.2% Amer et al [16] UCLA Courtyard 6 10 2012 Surveillance video 83.7% Amer et al [17] Nursing Home 2 6 2012 Surveillance video 85.5% Deng et al [18] Broadcast Field Hockey 3 11 2012 Sports video 62.9% Lan et al [19] NCAA Basketball 11 N/A 2016 Sports video 58.1% Wu et al [20] Volleyball 8 8 2016 Sports video 94.4% Gavrilyuk et al [21] C-Sports 5 N/A 2020 Sports video 81.3% Zalluhoglu and Ikizler-Cinbis [22] NBA 9 N/A 2020 Sports video 47.5% Yan et al [23] (a)…”

Section: Surveillance Datasetsmentioning

confidence: 99%

“…Zhang et al [69] Unified modeling framework 83.8/N 86.0 framework in [20]. A two-stage scheme for event classification in basketball videos is proposed.…”

Section: Hierarchical Temporal Modelingmentioning

confidence: 99%

A Comprehensive Review of Group Activity Recognition in Videos

Wang

Jian

et al. 2021

Int. J. Autom. Comput.

View full text Add to dashboard Cite

Human group activity recognition (GAR) has attracted significant attention from computer vision researchers due to its wide practical applications in security surveillance, social role understanding and sports video analysis. In this paper, we give a comprehensive overview of the advances in group activity recognition in videos during the past 20 years. First, we provide a summary and comparison of 11 GAR video datasets in this field. Second, we survey the group activity recognition methods, including those based on handcrafted features and those based on deep learning networks. For better understanding of the pros and cons of these methods, we compare various models from the past to the present. Finally, we outline several challenging issues and possible directions for future research. From this comprehensive literature review, readers can obtain an overview of progress in group activity recognition for future studies.

show abstract

“…Previous approaches [2], [3], [21], [22] for group activity recognition focus on designing suitable features and modeling relation among the actors using probabilistic graphical models or AND-OR grammars. Recently, significant progress has been made in the domain of group activity recognition [5], [13], [16]- [18], [23], [29], [32], [40], mainly due to the advent of convolutional neural networks (CNNs). Ibrahim et al [18] propose a two-stage deep temporal model to capture temporal dynamics.…”

Section: Related Workmentioning

confidence: 99%

“…Wu et al [40] build an actor-relation graph using a GCN to model the relational feature among the actors. Gavrilyuk et al [13] use self attention mechanism to model the dependency among the people present in a scene. These approaches mainly focus on designing appropriate models to understand the interaction pattern involving people present in a scene.…”

Section: Related Workmentioning

confidence: 99%

Context Aware Group Activity Recognition

Dasgupta

Jawahar

Alahari

2021

2020 25th International Conference on Pattern Recognition (ICPR)

View full text Add to dashboard Cite

This paper addresses the task of group activity recognition in multi-person videos. Existing approaches decompose this task into feature learning and relational reasoning. Despite showing progress, these methods only rely on appearance features for people and overlook the available contextual information, which can play an important role in group activity understanding. In this work, we focus on the feature learning aspect and propose a two-stream architecture that not only considers person-level appearance features, but also makes use of contextual information present in videos for group activity recognition. In particular, we propose to use two types of contextual information beneficial for two different scenarios: pose context and scene context that provide crucial cues for group activity understanding. We combine appearance and contextual features to encode each person with an enriched representation. Finally, these combined features are used in relational reasoning for predicting group activities. We evaluate our method on two benchmarks, Volleyball and Collective Activity and show that joint modeling of contextual information with appearance features benefits in group activity understanding.

show abstract

Actor-Transformers for Group Activity Recognition

Cited by 175 publications

References 40 publications

Perceiving Humans: From Monocular 3D Localization to Social Distancing

Perceiving Humans: From Monocular 3D Localization to Social Distancing

A Comprehensive Review of Group Activity Recognition in Videos

Context Aware Group Activity Recognition

Contact Info

Product

Resources

About