In recent years, Wi-Fi infrastructures have become ubiquitous, providing device-free passivesensing features. Wi-Fi signals can be affected by their reflection, refraction, and absorption by moving objects in their path. The channel state information (CSI), a signal property indicator, of the Wi-Fi signal can be analyzed for human activity recognition (HAR). Deep learning-based HAR models can enhance performance and accuracy without sacrificing computational efficiency. However, to save computational power, an inception network, which uses a variety of techniques to boost speed and accuracy, can be adopted. In contrast, the concept of spatial attention can be applied to obtain refined features. In this paper, we propose a human-human interaction (HHI) classifier, CSI-IANet, which uses a modified inception CNN with a spatial-attention mechanism. The CSI-IANet consists of three steps: i) data processing, ii) feature extraction, and iii) recognition. The data processing layer first uses the second-order Butterworth low-pass filter to denoise the CSI signal and then segment it before feeding it to the model. The feature extraction layer uses a multilayer modified inception CNN with an attention mechanism that uses spatial attention in an intense structure to extract features from captured CSI signals. Finally, the refined features are exploited by the recognition section to determine HHIs correctly. To validate the performance of the proposed CSI-IANet, a publicly available HHI CSI dataset with a total of 4800 trials of 12 interactions was used. The performance of the proposed model was compared to those of existing state-of-the-art methods. The experimental results show that CSI-IANet achieved an average accuracy of 91.30%, which is better than that of the existing best method by 5%.