Searching Video for Complex Activities with Finite State Models

İkizler, Nazlı; Forsyth, David

doi:10.1109/cvpr.2007.383168

Cited by 81 publications

(51 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Hidden semi-Markov Models (HSMM) [10] , CRFs [31], and finite-state-machines [13] have been used to model the temporal evolution of human activities. Recently, Tang et al [32] propose a conditional variant of HSMM incorporating the max-margin framework in the training phase.…”

Section: Sequential Modelsmentioning

confidence: 99%

Poselet Key-Framing: A Model for Human Activity Recognition

Raptis

Sigal

2013

2013 IEEE Conference on Computer Vision and Pattern Recognition

206

126

View full text Add to dashboard Cite

Section: Sequential Modelsmentioning

confidence: 99%

Poselet Key-Framing: A Model for Human Activity Recognition

Raptis

Sigal

2013

2013 IEEE Conference on Computer Vision and Pattern Recognition

206

126

View full text Add to dashboard Cite

“…This approach has difficulties in aligning non-repetitive actions in complex scenes. Moreover, some researchers model the configuration of the human body and its evolution in the time domain [9,10], and others solely perform action recognition from still images by computing pose primitives [11,12].…”

Section: Human Action Recognitionmentioning

confidence: 99%

Selective spatio-temporal interest points

Chakraborty

Holte

Moeslund

et al. 2012

Computer Vision and Image Understanding

121

View full text Add to dashboard Cite

a b s t r a c tRecent progress in the field of human action recognition points towards the use of Spatio-Temporal Interest Points (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bagof-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.

show abstract

“…Several works have considered a general approach of action recognition, for instance aiming to distinguish among several different activities like walking, jogging, waving, running, boxing and clapping [4,5]. The state-of-the-art research focus the limb tracking to model the human activities [6], an approach that is limited to high resolution targets and uncluttered environments [7]. In order to cope with cluttered environments, several works model activities using motion-based features [8,3], shape-based features [9], space-time interest points [4] or a combination of some of the above features [10].…”

Section: Related Workmentioning

confidence: 99%

Waving Detection Using the Local Temporal Consistency of Flow-Based Features for Real-Time Applications

Moreno

Bernardino

Santos-Victor

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We present a method to detect people waving using video streams from a fixed camera system. Waving is a natural means of calling for attention and can be used by citizens to signal emergency events or abnormal situations in future automated surveillance systems. Our method is based on training a supervised classifier using a temporal boosting method based on optical flow-derived features. The base algorithm shows a low false positive rate and if further improves through the definition of a minimum time for the duration of the waving event. The classifier generalizes well to scenarios very different from where it was trained. We show that a system trained indoors with high resolution and frontal postures can operate successfully, in real-time, in an outdoor scenario with large scale differences and arbitrary postures.

show abstract

Searching Video for Complex Activities with Finite State Models

Abstract: We

Cited by 81 publications

References 33 publications

Poselet Key-Framing: A Model for Human Activity Recognition

Poselet Key-Framing: A Model for Human Activity Recognition

Selective spatio-temporal interest points

Waving Detection Using the Local Temporal Consistency of Flow-Based Features for Real-Time Applications

Contact Info

Product

Resources

About