“…In video activity recognition literature spatial information is often captured by various local space-time features as defined in [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] and [12]. These local space-time features capture frame-wise spatial information by first detecting interest points with either interest point detectors (Harris detector, Hessian detectors, edge detector, corner detectors) or various sampling methods (dense sampling [13] or motion adaptive sampling [14]) for each frame, then spatio-temporal regions are defined around all the detected points in each frame and finally the spatio-temporal regions are described using one of the local space-time features.…”