Recognition of Group Activities in Videos Based on Single-and Two-Person Descriptors

Lathuiliere, Stephane; Evangelidis, Georgios; Horaud, Radu

doi:10.1109/wacv.2017.31

Cited by 4 publications

(4 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, our approac presents a homogeneous error distribution among the classes, as indicated by the lowest standard deviation. Our overall accuracy was similar to (Sun et al, 2016) and (Amer et al, 2014), and better than recent approaches such as SSVM (Lathuilière et al, 2017) and RMIC (Wang et al, 2017). It should be noticed that the best overall method DCGF+GRU (Kim et al, 2018) was trained and evaluated with augmented data, so that a direct comparison of the results might be biased.…”

Section: Collective Behavior Recognitionsupporting

confidence: 57%

“…With the objective of detecting groups and their characteristics in a crowd scenario, Shao et al (2016) define priors that aim to add temporal smoothness and consistency for collective transitionsgroups are obtained by searching for pedestrians sets that fit well these priors. Lathuilière et al (2017) presented an approach for group activity recognition using single and two-person descriptors using Structured Support Vector Machines (SSVMs). For that purpose, they use a dictionary-based method, exploring geometric characteristics of the relative pose and motion between the two persons.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Collective behavior recognition using compact descriptors

Fuhr,

Jung

2018

Preprint

View full text Add to dashboard Cite

This paper presents a novel hierarchical approach for collective behavior recognition based solely on ground-plane trajectories. In the first layer of our classifier, we introduce a novel feature called Personal Interaction Descriptor (PID), which combines the spatial distribution of a pair of pedestrians within a temporal window with a pyramidal representation of the relative speed to detect pairwise interactions. These interactions are then combined with higher level features related to the mean speed and shape formed by the pedestrians in the scene, generating a Collective Behavior Descriptor (CBD) that is used to identify collective behaviors in a second stage. In both layers, Random Forests were used as classifiers, since they allow features of different natures to be combined seamlessly. Our experimental results indicate that the proposed method achieves results on par with state of the art techniques with a better balance of class errors. Moreover, we show that our method can generalize well across different camera setups through cross-dataset experiments.

show abstract

Section: Collective Behavior Recognitionsupporting

confidence: 57%

Section: Introductionmentioning

confidence: 99%

Collective behavior recognition using compact descriptors

Fuhr,

Jung

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Yin et al [162] employ 3D-SIFT to describe local motion events, but used a HOF to model the global motion in an image. Similarly, Lathuilière et al [73] combined HOG descriptors and trajectory information from linked local features. Single-person and two-person interaction attributes such as "two persons are standing side-by-side" were calculated from these features.…”

Section: Template Based Approachesmentioning

confidence: 99%

Analyzing human–human interactions: A survey

Stergiou

Poppe

2019

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording settings, the appearance of the people depicted and the performance of their interaction. This survey provides a summary of these challenges and datasets, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art. Main challenges in the fieldWe identify challenges when dealing with the visual and structural aspects of interaction videos. Additionally, we outline practical challenges in the development of methods of automated human-human action recognition.

show abstract

“…Yin et al [303] employ 3D-SIFT to describe local motion events, but used a HOF to model the global motion of the video sequence. Similarly, Lathuiliere et al [138] combined HOG descriptors and trajectory information from linked local features. Single-person and two-person interaction attributes, such as "two persons are standing side-by-side", were calculated from these features.…”

Section: Template-based Approachesmentioning

confidence: 99%

Efficient Modelling Across Time of Human Actions and Interactions

Stergiou¹

View full text Add to dashboard Cite

This thesis focuses on video understanding for human action and interaction recognition. We start by identifying the main challenges related to action recognition from videos and review how they have been addressed by current methods.Based on these challenges, and by focusing on the temporal aspect of actions, we argue that current fixed-sized spatio-temporal kernels in 3D convolutional neural networks (CNNs) can be improved to better deal with temporal variations in the input. Our contributions are based on the enlargement of the convolutional receptive fields through the introduction of spatio-temporal size-varying segments of videos, as well as the discovery of the local feature relevance over the entire video sequence. The resulting extracted features encapsulate information that includes the importance of local features across multiple temporal durations, as well as the entire video sequence.Subsequently, we study how we can better handle variations between classes of actions, by enhancing their feature differences over different layers of the architecture. The hierarchical extraction of features models variations of relatively similar classes the same as very dissimilar classes. Therefore, distinctions between similar classes are less likely to be modelled. The proposed approach regularises feature maps by amplifying features that correspond to the class of the video that is processed. We move away from class-agnostic networks and make early predictions based on feature amplification mechanism.The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results. In terms of performance, we compete with the state-of-the-art while being more efficient in terms of GFLOPs.Finally, we present a human-understandable approach aimed at providing visual explanations for features learned over spatio-temporal networks. We isolate spatio-temporal regions in 3D-CNNs that are informative for an action class. We extend this approach to allow for the traversal over the entire network architecture, incrementally discovering kernels at different complexities, and modelling layers related to a specific class.ix Chapter 1 IntroductionChapter 2, Related Work. We detail a synopsis of current progress made in action recognition. We distinguish between approaches that have been based on hand-crafted features and methods that used learned features through optimisation. We then present the main motivation for the works that have been included in this thesis. We discuss how our approaches differe from previous efforts and the challenges that our methods address.Chapter 3, Datasets for Video Understanding. We provide an overview of historic and current datasets used as action recognition benchmarks. We focus on datasets that exemplify milestones achieved in terms of data collection and increases in the data complexity. We then present the datasets that are used in this thesis.Chapter 4, Improving Action Recognition through Time-Consistent Features. We present a novel method to address var...

show abstract

Recognition of Group Activities in Videos Based on Single-and Two-Person Descriptors

Cited by 4 publications

References 37 publications

Collective behavior recognition using compact descriptors

Collective behavior recognition using compact descriptors

Analyzing human–human interactions: A survey

Efficient Modelling Across Time of Human Actions and Interactions

Contact Info

Product

Resources

About