A millimeter-wave radar is widely accepted by the public due to its low susceptibility to interference, such as changes in light, and the protection of personal privacy. With the development of the deep learning theory, the deep learning method has been dominant in the millimeter-wave radar field, which usually uses convolutional neural networks for feature extraction. In recent years, transformer networks have also been highly valued by researchers due to their parallel processing capabilities and long-distance dependency modeling capabilities. However, traditional convolutional neural networks (CNNs) and vision transformers each have their limitations: CNNs usually overlook the global features of images and vision transformers may neglect local image continuity, and both of them may impede gesture recognition performance. In addition, whether CNN or transformer, their implementation is hindered by the scarcity of public radar gesture datasets. To address these limitations, this paper proposes a new recognition method using a local pyramid visual transformer (LPVT) based on millimeter-wave radar. LPVT can capture global and local features in dynamic gesture spectrograms, ultimately improving the recognition ability of gestures. In this paper, we mainly carried out the following two tasks: building the corresponding datasets and executing gesture recognition. First, we constructed a gesture dataset for training. In this stage, we use a 77 GHz radar to collect the echo signals of gestures and preprocess them to build a dataset. Second, we propose the LPVT network specifically designed for gesture recognition tasks. By integrating local sensing into the globally focused transformer, we improve its capacity to capture both global and local features in dynamic gesture spectrograms. The experimental results using the dataset we constructed show that the proposed LPVT network achieved a gesture recognition accuracy of 92.2%, which exceeds the performance of other networks.