Speech signal processing is an active area of research, the most dominant source of exchanging information among human beings, and the best way for human–computer interaction (HCI). Human behavior assessments and emotion recognition from a speech signal, such as speech emotion recognition (SER) is an emerging HCI area of exploration with various real time claims. The performance of an efficient SER system depends on feature learning, which include salient and discriminative information such as high‐level deep features. In this paper, we proposed a two‐stream deep convolutional neural network with an iterative neighborhood component analysis (INCA) to learn mutually spatial‐spectral features and select the most discriminative optimal features for the final prediction. Our model is composed of two channels, and each channel is associated with the convolutional neural network structure to extract cues from the oral signals. The first channel extracts feature from the spectral domain, and the second channel extracts features from the spatial domain, which are then fused and fed to the INCA to remove the severance and select the optimal features for the final model training. The joint refine features are passed from the fully connected network with a softmax classifier to yield the predictions of the different emotions. We trained our proposed system using three benchmarks, which included the EMO‐DB, SAVEE, and RAVDESS emotional speech corpora, and we tested the prediction performance to secure 95%, 82%, and 85% recognition rates. The performance of the system shows the effectiveness and significance of the proposed system.