Image inpainting is the process of replacing broken or defective regions in an image. Recent models use an ensemble of Transformer and CNN to expand the diversity and accuracy of image inpainting. However, when processing high-resolution images with Transformers, the computational cost can be high due to the need to establish global dependencies. This paper proposes a new inpainting method based on the ViT U-net structure. The network utilises a single memory-bound MHSA and cascade group attention ViT which incorporates an improved adaptive gated convolution in the encoder to extract features and employs a PixelShuffle upsampling in the decoder to generate the restored image. Additionally, it recovers local spatial features at multiple levels to refine details. We evaluate our model on CelebA-HQ, Places2, and Paris StreetView datasets. The experimental results demonstrate that the model achieves faster training speed and less memory consumption with fewer parameters, while matching the current state-of-the-art in terms of quantitatively and qualitatively.