Human emotions are complex psychological and physiological responses to external stimuli. Correctly identifying and providing feedback on emotions is an important goal in human–computer interaction research. Compared to facial expressions, speech, or other physiological signals, using electroencephalogram (EEG) signals for the task of emotion recognition has advantages in terms of authenticity, objectivity, and high reliability; thus, it is attracting increasing attention from researchers. However, the current methods have significant room for improvement in terms of the combination of information exchange between different brain regions and time–frequency feature extraction. Therefore, this paper proposes an EEG emotion recognition network, namely, self-organized graph pesudo-3D convolution (SOGPCN), based on attention and spatiotemporal convolution. Unlike previous methods that directly construct graph structures for brain channels, the proposed SOGPCN method considers that the spatial relationships between electrodes in each frequency band differ. First, a self-organizing map is constructed for each channel in each frequency band to obtain the 10 most relevant channels to the current channel, and graph convolution is employed to capture the spatial relationships between all channels in the self-organizing map constructed for each channel in each frequency band. Then, pseudo-three-dimensional convolution combined with partial dot product attention is implemented to extract the temporal features of the EEG sequence. Finally, LSTM is employed to learn the contextual information between adjacent time-series data. Subject-dependent and subject-independent experiments are conducted on the SEED dataset to evaluate the performance of the proposed SOGPCN method, which achieves recognition accuracies of 95.26% and 94.22%, respectively, indicating that the proposed method outperforms several baseline methods.