In the approaches of Automatic Modulation Recognition (AMR), modulation modes with similar characteristics are prone to be confused by the adverse factors such as noise, inevitably bringing challenges to the accuracy of recognition. Aiming at this kind of problems, this paper proposes TC-MSNet which is a novel multi-scale spatial-temporal features collaboration neural network based on deep learning. TC-MSNet extracts the Temporal Correlation (TC) features and the Multi-Scale Spatial (MSS) features respectively for enhancing the diversity of features extraction, meanwhile, a fusion strategy constructed with convolu-tional attention mechanism implement sufficiently collaboration of multi-scale features. Additionally, the Bilinear-Pooling Mechanism (BPM) is adopted to capture the differentiation of fine-grained features, eliminating the confusion among distinct modulation modes to promote AMR accuracy. The simulation experimental results show that, on the RML2016.10a and RML2016.10b, the proposed TC-MSNet not only has better performance compared with existing deep learning-based AMR methods but also possesses stronger ability to eliminate confusion of similar features.