The accurate identification of the human emotional status is crucial for an efficient human–robot interaction (HRI). As such, we have witnessed extensive research efforts made in developing robust and accurate brain–computer interfacing models based on diverse biosignals. In particular, previous research has shown that an Electroencephalogram (EEG) can provide deep insight into the state of emotion. Recently, various handcrafted and deep neural network (DNN) models were proposed by researchers for extracting emotion-relevant features, which offer limited robustness to noise that leads to reduced precision and increased computational complexity. The DNN models developed to date were shown to be efficient in extracting robust features relevant to emotion classification; however, their massive feature dimensionality problem leads to a high computational load. In this paper, we propose a bag-of-hybrid-deep-features (BoHDF) extraction model for classifying EEG signals into their respective emotion class. The invariance and robustness of the BoHDF is further enhanced by transforming EEG signals into 2D spectrograms before the feature extraction stage. Such a time-frequency representation fits well with the time-varying behavior of EEG patterns. Here, we propose to combine the deep features from the GoogLeNet fully connected layer (one of the simplest DNN models) together with the OMTLBP_SMC texture-based features, which we recently developed, followed by a K-nearest neighbor (KNN) clustering algorithm. The proposed model, when evaluated on the DEAP and SEED databases, achieves a 93.83 and 96.95% recognition accuracy, respectively. The experimental results using the proposed BoHDF-based algorithm show an improved performance in comparison to previously reported works with similar setups.