A new kind of sequence-to-sequence model called a transformer has been applied to electroencephalogram (EEG) systems. However, the majority of EEG-based transformer models have applied attention mechanisms to the temporal domain, while the connectivity between brain regions and the relationship between different frequencies have been neglected. In addition, many related studies on imagery-based brain-computer interface (BCI) have been limited to classifying EEG signals within one type of imagery. Therefore, it is important to develop a general model to learn various types of neural representations. In this study, we designed an experimental paradigm based on motor imagery, visual imagery, and speech imagery tasks to interpret the neural representations during mental imagery in different modalities. We conducted EEG source localization to investigate the brain networks. In addition, we propose the multiscale convolutional transformer for decoding mental imagery, which applies multi-head attention over the spatial, spectral, and temporal domains. The proposed network shows promising performance with 0.62, 0.70, and 0.72 mental imagery accuracy with the private EEG dataset, BCI competition IV 2a dataset, and Arizona State University dataset, respectively, as compared to the conventional deep learning models. Hence, we believe that it will contribute significantly to overcoming the limited number of classes and low classification performances in the BCI system.