Motivation. Automatic recognition of human activities (or events) from video is important to many potential applications of computer vision. One of the most common approach is the bag-of-visual-features, which aggregate space-time features globally, from the entire video clip containing complete execution of a single activity. The bag-of-visual-features does not encode the spatio-temporal structure in the video. For this reason, there is a growing interest in modeling spatio-temporal structure between visual features in order to improve the results of activity recognition.The proposed framework. We model the spatio-temporal structure by exploiting the qualitative relationships between a pair of visual features. The proposed approach is inspired by [3,4]. The goal is to find a pair of visual features whose spatiotemporal relationships are discriminative enough, and temporally consistent for distinguishing various activities. The framework is applied to recognize activities from a continuous live video (egocentric view) of a person performing manipulative tasks in an industrial setup. In such environments, the purpose of activity recognition is to assist users by providing on-the-fly instructions from an automatic system that maintains an understanding of the on-going activities.In order to recognize activities in real-time, we propose a random forest with a discriminative Markov decision tree algorithm that considers a random subset of relational features at a time and Markov temporal structure that provides temporally smoothed output (Fig. 1). Our algorithm is different from conventional decision trees [2] and uses a linear SVM as a classifier at each nonterminal node and effectively explores temporal dependency at terminal nodes of the trees. We explicitly model the spatial relationships of left, right, top, bottom, very-near, near, far and very-far as well as temporal relationships of during, before and after between a pair of visual features (Fig. 2), which are selected randomly at the nonterminal nodes of a given Markov decision tree. Our hypothesis is that the proposed relationships are particularly suitable for detecting complex non-periodic manipulative tasks and can easily be applied to the existing visual descriptors such as SIFT, STIP, CUBOID and SURF.Growing discriminative Markov decision trees. Each tree is trained separately on a random subset of frames belonging to training videos. Learning proceeds recursively by splitting the training frames at internal nodes into the respective left and right subsets. This is done in the following four stages: randomly assign all frames from each activity class to a binary label; randomly sample a pair of visual words; compute the spatiotemporal relationships histogram h between them; and use a linear SVM to learn a binary split using the extracted h. The binary SVM at each internal node sends the frame to the left child if w T h ≤ 0 otherwise to the right child, where w is the set of weights learned through the linear SVM. Using an information gain criteria, ea...