Wireless sensing has been increasingly used in smart homes, human–computer interaction and other fields due to its comprehensive coverage, non-contact and absence of privacy leakage. However, most existing methods are based on the amplitude or phase of the Wi-Fi signal to recognize gestures, which provides insufficient recognition accuracy. To solve this problem, we have designed a deep spatiotemporal gesture recognition method based on Wi-Fi signals, namely Wi-GC. The gesture-sensitive antennas are selected first and the fixed antennas are denoised and smoothed using a combined filter. The consecutive gestures are then segmented using a time series difference algorithm. The segmented gesture data is fed into our proposed RAGRU model, where BAGRU extracts temporal features of Channel State Information (CSI) sequences and RNet18 extracts spatial features of CSI amplitudes. In addition, to pick out essential gesture features, we introduce an attention mechanism. Finally, the extracted spatial and temporal characteristics are fused and input into softmax for classification. We have extensively and thoroughly verified the Wi-GC method in a natural environment and the average gesture recognition rate of the Wi-GC way is between 92–95.6%, which has strong robustness.