The self-regulated recognition of human activities from time-series smartphone sensor data is a growing research area in smart and intelligent health care. Deep learning (DL) approaches have exhibited improvements over traditional machine learning (ML) models in various domains, including human activity recognition (HAR). Several issues are involved with traditional ML approaches; these include handcrafted feature extraction, which is a tedious and complex task involving expert domain knowledge, and the use of a separate dimensionality reduction module to overcome overfitting problems and hence provide model generalization. In this article, we propose a DL-based approach for activity recognition with smartphone sensor data, i.e., accelerometer and gyroscope data. Convolutional neural networks (CNNs), autoencoders (AEs), and long short-term memory (LSTM) possess complementary modeling capabilities, as CNNs are good at automatic feature extraction, AEs are used for dimensionality reduction and LSTMs are adept at temporal modeling. In this study, we take advantage of the complementarity of CNNs, AEs, and LSTMs by combining them into a unified architecture. We explore the proposed architecture, namely, "ConvAE-LSTM", on four different standard public datasets (WISDM, UCI, PAMAP2, and OPPORTUNITY). The experimental results indicate that our novel approach is practical and provides relative smartphone-based HAR solution performance improvements in terms of computational time, accuracy, F1-score, precision, and recall over existing state-of-the-art methods.