The recognition of emotions is one of the most challenging issues in human–computer interaction (HCI). EEG signals are widely adopted as a method for recognizing emotions because of their ease of acquisition, mobility, and convenience. Deep neural networks (DNN) have provided excellent results in emotion recognition studies. Most studies, however, use other methods to extract handcrafted features, such as Pearson correlation coefficient (PCC), Principal Component Analysis, Higuchi Fractal Dimension (HFD), etc., even though DNN is capable of generating meaningful features. Furthermore, most earlier studies largely ignored spatial information between the different channels, focusing mainly on time domain and frequency domain representations. This study utilizes a pre-trained 3D-CNN MobileNet model with transfer learning on the spatio-temporal representation of EEG signals to extract features for emotion recognition. In addition to fully connected layers, hybrid models were explored using other decision layers such as multilayer perceptron (MLP), k-nearest neighbor (KNN), extreme learning machine (ELM), XGBoost (XGB), random forest (RF), and support vector machine (SVM). Additionally, this study investigates the effects of post-processing or filtering output labels. Extensive experiments were conducted on the SJTU Emotion EEG Dataset (SEED) (three classes) and SEED-IV (four classes) datasets, and the results obtained were comparable to the state-of-the-art. Based on the conventional 3D-CNN with ELM classifier, SEED and SEED-IV datasets showed a maximum accuracy of 89.18% and 81.60%, respectively. Post-filtering improved the emotional classification performance in the hybrid 3D-CNN with ELM model for SEED and SEED-IV datasets to 90.85% and 83.71%, respectively. Accordingly, spatial-temporal features extracted from the EEG, along with ensemble classifiers, were found to be the most effective in recognizing emotions compared to state-of-the-art methods.