Change detection using high-resolution remote sensing images provides crucial information for geospatial monitoring, which is of great importance as urbanization continues. However, current deep learning models for change detection tasks are mostly based on convolutional neural networks (CNNs), from which it is difficult to extract global information owing to the locality of convolution operations. In this paper, we propose a deep learning model, Siam-Swin-UNet (SSUNet), for remote sensing change detection. SSUNet is designed following the classic UNet-like encoder-decoder framework but has three major innovations: (1) The encoder and decoder are pure transformer-based and hierarchically structured, which avoids the locality problem of CNN but retains the capability of hierarchical representation. (2) The encoder incorporates the Siamese structure, which can process bi-temporal remote sensing images in parallel, and to which is added a fusion module to properly fuse the feature maps extracted from the Siamese structure. (3) The backbone of the SSUNet is Swin Transformer V2 blocks, which can be more stable in further applications of the model, such as transfer learning or scaling up of the model capacity. We experimented with the proposed SSUNet on the LEVIR-CD dataset, along with CNN-based models such as UNet, UNet++, FC-Siam-Conc, and FC-Siam-Diff. The results showed our model outperformed the CNN-based models by a large margin based on evaluation metrics including precision, recall, F1-score, and overall accuracy (OA). Moreover, we conducted ablation studies to further prove the effectiveness of the Siamese structure and the choice of the backbone. The proposed SSUNet has great potential for use in remote sensing change detection tasks.