Hand gesture is a visual input of human-computer interaction for providing different applications in smart homes, healthcare, and eldercare. Most deep learning-based techniques adopt standard convolution neural networks (CNNs) which require a large number of model parameters with high computational complexity; thus, it is not suitable for application in devices with limited computational resources. However, fewer model parameters can reduce the system accuracy. To address this challenge, we propose a lightweight heterogeneous deep learning-based gesture recognition system, coined CSI-DeepNet. The CSI-DeepNet comprises four steps: i) data collection, ii) data processing, iii) feature extraction, and iv) classification. We utilize a low-power system-on-chip (SoC), ESP-32, for the first time to collect alphanumeric hand gesture datasets using channel state information (CSI) with 1,800 trials of 20 gestures, including the steady-state data of ten people. A Butterworth low-pass filter with Gaussian smoothing is applied to remove noise; subsequently, the data is split into windows with sufficient dimensions in the data processing step before feeding to the model. The feature extraction section utilizes a depthwise separable convolutional neural network (DS-Conv) with a feature attention (FA) block and residual block (RB) to extract fine-grained features while reducing the complexity using fewer model parameters. Finally, the extracted refined features are classified in the classification section. The proposed system achieves an average accuracy of 96.31% with much less computational complexity, which is better than the results obtained using state-of-the-art pre-trained CNNs and two deep learning models using CSI data.INDEX TERMS Hand gesture recognition, channel state information (CSI), deep learning, depthwise separable convolutional neural network (DS-Conv), feature attention, residual block, system-on-chip (SoC).