“…Having no previous knowledge about the location of the person in each video frame, the human action in a video stream can be recovered from a great number of local descriptors extracted from the video frames (Sekma et al, 2013), (Dammak et al, 2012), , (Sekma et al, 2014). Local descriptors, coupled with the bag-of-words (BOW) encoding method (Sivic and Zisserman, 2003) (Mejdoub et al, 2008) (Mejdoub et al, 2007) have recently become a very popular video representation (Ben Aoun et al, 2014), (Knopp et al, 2010), (Laptev et al, 2008), (Wang et al, 2009), (Alexander et al, 2008), (Wang et al, 2011), (Raptis and Soatto, 2010), (Pyry et al, 2010), (Jiang et al, 2012) and (Jain et al, 2013). The BOW uses a codebook to create a representation based on the visual content of a video, where the codebook is a set of visual words that represents the distribution of features of all the video.…”