This work develops Deep Neural Networks (DNNs) by adopting Capsule Networks (Cap-sNets) and spatiotemporal skeleton-based attention to effectively recognize subject actions from abundant spatial and temporal contexts of videos. The proposed generic DNN includes four 3D Convolutional Neural Networks (3D_CNNs), Attention-Jointed Appearance (AJA) and Attention-Jointed Motion (AJM) generation layers, two Reduction Layers (RLs), two Attention-based Recurrent Neural Networks (A_RNNs), and an inference classifier, where RGB, transformed skeleton, and optical-flow channel streams are inputs. The AJA and AJM generation layers emphasize skeletons to the appearances and motions of a subject, respectively. A_RNNs generate attention weights over time steps to highlight rich temporal contexts. To integrate CapsNets in this generic DNN, three types of CapsNet-based DNNs are devised, where the CapsNets take over a classifier, A_RNN+classifier, and RL+A_RNN+classifier. The experimental results reveal that the proposed DNN using CapsNet as an inference classifier outperforms the other two CapsNetbased DNNs and the generic DNN adopting the feedforward neural network as an inference classifier. Additionally, our best CapsNet-based DNN achieves average accuracies of 98.5% for the state-of-theart performance in UCF101, 82.1% for near-state-of-the-art performance in HMDB51, and 95.3% for panoramic videos, to the best of our knowledge. Particularly, it is determined that the generic CapsNet behaves as an outstanding inference classifier but is slightly worse than the A_RNN in interpreting temporal evidence for recognition. Therefore, the proposed DNN, which employs CapsNet to fulfill an inference classifier, can be superiorly applied to various context-aware visual applications.INDEX TERMS Deep neural network, convolutional neural network, recurrent neural network, capsule network, spatiotemporal attention, skeleton, action recognition.