Due to the powerful ability in capturing the global information, transformer has become an alternative architecture of CNNs for hyperspectral image classification. However, general transformer mainly considers the global spectral information while ignores the multiscale spatial information of the hyperspectral image. In this paper, we propose a multiscale spectral–spatial convolutional transformer (MultiFormer) for hyperspectral image classification. First, the developed method utilizes multiscale spatial patches as tokens to formulate the spatial transformer and generates multiscale spatial representation of each band in each pixel. Second, the spatial representation of all the bands in a given pixel are utilized as tokens to formulate the spectral transformer and generate the multiscale spectral–spatial representation of each pixel. Besides, a modified spectral–spatial CAF module is constructed in the MultiFormer to fuse cross‐layer spectral and spatial information. Therefore, the proposed MultiFormer can capture the multiscale spectral–spatial information and provide better performance than most of other architectures for hyperspectral image classification. Experiments are conducted over commonly used real‐world datasets and the comparison results show the superiority of the proposed method.