Physical exercise affects many facets of life, including mental health, social interaction, physical fitness, and illness prevention, among many others. Therefore, several AI-driven techniques have been developed in the literature to recognize human physical activities. However, these techniques fail to adequately learn the temporal and spatial features of the data patterns. Additionally, these techniques are unable to fully comprehend complex activity patterns over different periods, emphasizing the need for enhanced architectures to further increase accuracy by learning spatiotemporal dependencies in the data individually. Therefore, in this work, we develop an attention-enhanced dual-stream network (PAR-Net) for physical activity recognition with the ability to extract both spatial and temporal features simultaneously. The PAR-Net integrates convolutional neural networks (CNNs) and echo state networks (ESNs), followed by a self-attention mechanism for optimal feature selection. The dual-stream feature extraction mechanism enables the PAR-Net to learn spatiotemporal dependencies from actual data. Furthermore, the incorporation of a self-attention mechanism makes a substantial contribution by facilitating targeted attention on significant features, hence enhancing the identification of nuanced activity patterns. The PAR-Net was evaluated on two benchmark physical activity recognition datasets and achieved higher performance by surpassing the baselines comparatively. Additionally, a thorough ablation study was conducted to determine the best optimal model for human physical activity recognition.