Blind modulation classification is an important step in implementing cognitive radio networks. The multiple-input multiple-output (MIMO) technique is widely used in military and civil communication systems. Due to the lack of prior information about channel parameters and the overlapping of signals in MIMO systems, the traditional likelihood-based and feature-based approaches cannot be applied in these scenarios directly. Hence, in this paper, to resolve the problem of blind modulation classification in MIMO systems, the time–frequency analysis method based on the windowed short-time Fourier transform was used to analyze the time–frequency characteristics of time-domain modulated signals. Then, the extracted time–frequency characteristics are converted into red–green–blue (RGB) spectrogram images, and the convolutional neural network based on transfer learning was applied to classify the modulation types according to the RGB spectrogram images. Finally, a decision fusion module was used to fuse the classification results of all the receiving antennas. Through simulations, we analyzed the classification performance at different signal-to-noise ratios (SNRs); the results indicate that, for the single-input single-output (SISO) network, our proposed scheme can achieve 92.37% and 99.12% average classification accuracy at SNRs of −4 and 10 dB, respectively. For the MIMO network, our scheme achieves 80.42% and 87.92% average classification accuracy at −4 and 10 dB, respectively. The proposed method greatly improves the accuracy of modulation classification in MIMO networks.