Gesture recognition aims at understanding dynamic gestures of the human body and is one of the most important ways of human–computer interaction; to extract more effective spatiotemporal features in gesture videos for more accurate gesture classification, a novel feature extractor network, spatiotemporal attention 3D DenseNet is proposed in this study. We extend DenseNet with 3D kernels and Refined Temporal Transition Layer based on Temporal Transition Layer, and we also explore attention mechanism in 3D ConvNets. We embed the Refined Temporal Transition Layer and attention mechanism in DenseNet3D, named the proposed network “spatiotemporal attention 3D DenseNet.” Our experiments show that our Refined Temporal Transition Layer performs better than Temporal Transition Layer and the proposed spatiotemporal attention 3D DenseNet in each modality outperforms the current state-of-the-art methods on the ChaLearn LAP Large-Scale Isolated gesture dataset. The code and pretrained model are released in https://github.com/dzf19927/STA3D .