Due to the diversity of body movements and uncertainty of recording occasion, human action recognition is still a challenging task, especially in real world. This paper provides a new method of representing the video with mid-level vision representation which is extracted from the discriminative supervoxels. In the proposed method, the discriminative supervoxels we extracted through a learning phase frequently occur within class and are distinguishing enough between classes. They contain the meaningful parts of the video, including specific background of an action and the moving human body. The video is first oversegmented to obtain supervoxels, which are described by the dense trajectories and Bag-Of-Words framework. Afterwards, the discriminative supervoxels are extracted by an iterative procedure through training and selecting. Finally the videos are represented with discriminative supervoxels. Experimental results on KTH, YouTube and UT-Interaction datasets demonstrate comparable performance with state-of-the-art models.