Human interaction recognition technology is a hot topic in the field of computer vision, and its application prospects are very extensive. At present, there are many difficulties in human interaction recognition such as the spatial complexity of human interaction, the differences in action characteristics at different time periods, and the complexity of interactive action features. The existence of these problems restricts the improvement of recognition accuracy. To investigate the differences in the action characteristics at different time periods, we propose an improved fusion time-phase feature of the Gaussian model to obtain video keyframes and remove the influence of a large amount of redundant information. Regarding the complexity of interactive action features, we propose a multi-feature fusion network algorithm based on parallel Inception and ResNet. This multi-feature fusion network not only reduces the network parameter quantity, but also improves the network performance; it alleviates the network degradation caused by the increase in network depth and obtains higher classification accuracy. For the spatial complexity of human interaction, we combined the whole video features with the individual video features, making full use of the feature information of the interactive video. A human interaction recognition algorithm based on whole–individual detection is proposed, where the whole video contains the global features of both sides of action, and the individual video contains the individual detail features of a single person. Making full use of the feature information of the whole video and individual videos is the main contribution of this paper to the field of human interaction recognition and the experimental results in the UT dataset (UT–interaction dataset) showed that the accuracy of this method was 91.7%.