Human Activity Recognition (HAR) plays an irreplaceable role in various applications such as security, gaming, and assisted living. Recent studies introduce deep learning to mitigate the manual feature extraction (i.e., data representation) efforts and achieve high accuracy. However, there are still challenges in learning accurate representations for sensory data due to the weakness of representation modules and the subject variances. We propose a scheme called Distance-based HAR from Ensembled spatial-temporal Representations (DHARER) to address above challenges. The idea behind DHARER is straightforward-the same activities should have similar representations. We first learn representations of the input sensory segments and latent prototype representations of each class, using a Convolution Neural Network (CNN)based dual-stream representation module; then the learned representations are projected to activity types by measuring their similarity to the learned prototypes. We have conducted extensive experiments under a strict subject-independent setting on three large-scale datasets to evaluate the proposed scheme, and our experimental results demonstrate superior performance of DHARER to several state-of-the-art methods.