Human action recognition is used in areas such as surveillance, entertainment, and healthcare. This paper proposes a system to recognize both single and continuous human actions from monocular video sequences, based on 3D human modeling and cyclic hidden Markov models (CHMMs). First, for each frame in a monocular video sequence, the 3D coordinates of joints belonging to a human object, through actions of multiple cycles, are extracted using 3D human modeling techniques. The 3D coordinates are then converted into a set of geometrical relational features (GRFs) for dimensionality reduction and discrimination increase. For further dimensionality reduction, k-means clustering is applied to the GRFs to generate clustered feature vectors. These vectors are used to train CHMMs separately for different types of actions, based on the Baum-Welch re-estimation algorithm. For recognition of continuous actions that are concatenated from several distinct types of actions, a designed graphical model is used to systematically concatenate different separately trained CHMMs. The experimental results show the effective performance of our proposed system in both single and continuous action recognition problems.
Keywords
I. IntroductionHuman action recognition is a growing topic in video analysis and understanding -one of the most popular areas in the community of computer vision -thanks to its applications in surveillance, entertainment, and healthcare. In surveillance, human action recognition can be used in conjunction with video camera footage to help with the recognition and analysis of human actions. In entertainment, human-computer interaction can be helped to appear more natural via human action recognition, which in turn can help increase the entertainment experience. In healthcare, human action recognition can help detect abnormal gaits or assist in a patient's rehabilitation through an analysis of their actions.However, it is challenging to recognize various human actions due to the high number of degrees of freedom associated with the average human body -namely, variations in human poses; variations in the colors of a person's clothing; changes in lighting and illumination; variations in viewpoints; and frequent self-occlusion. Moreover, the use of monocular video sequences further increases the difficulty for human action recognition.Generally, the two main stages in human action recognition are: the feature extraction and representation stage and the classification stage.In the feature extraction and representation stage, the features or characteristics of video frames, such as silhouette, shape, color, and motion, are extracted and represented in a systematic and efficient way. consists of stacking segmented silhouettes (frame by frame) to form a 3D spatial-temporal shape. In a similar way, Ke and others [2] build STVs, for shape-based matching, from image features that are based on the consecutive silhouettes of objects along a time axis, including spatial-temporal region extraction and region matching. Kim and oth...