Landslides pose a serious threat to human life, safety, and natural resources. Remote sensing images can be used to effectively monitor landslides at a large scale, which is of great significance for pre-disaster warning and post-disaster assistance. In recent years, deep learning based methods have made great progress in the field of remote sensing image landslide detection. In remote sensing images, landslides display a variety of scales and shapes. In this paper, to better extract and keep the multi-scale shape information of landslides, a shape-enhanced vision transformer (ShapeFormer) model is proposed. For the feature extraction, a pyramid vision transformer (PVT) model is introduced, which directly models the global information of local elements at different scales. To learn the shape information of different landslides, a shape feature extraction branch is designed, which uses the adjacent feature maps at different scales in the PVT model to improve the boundary information. After the feature extraction step, a decoder with deconvolutional layers follows, which combines the multiple features and gradually recovers the original resolution of the combined features. A softmax layer is connected with the combined features to acquire the final pixelwise result. The proposed ShapeFormer model was tested on two public datasets-the Bijie dataset and the Nepal dataset-which have different spectral and spatial characteristics. The results, when compared with those of some of the state-of-the-art methods, show the potential of the proposed method for use with multisource optical remote sensing data for landslide detection.