Defects in wood growth affect the product’s quality and grade. At present, the research on texture defects of wood mainly focuses on defect localization, ignoring the splicing problem of maintaining texture consistency. In this paper, we designed the MRS-Transformer network and introduced image inpainting to the field of solid wood board splicing. First, we proposed an asymmetric encoder-decoder based on Vision Transformer, where the encoder uses a fixed mask(M) strategy, discarding the masked patches and using only the unmasked visual patches as input to reduce model calculations. Second, we designed a reverse Swin (RS) module with multi-scale characteristics as the decoder to adjust the divided image patches’ size and complete the restoration from coarse to fine. Finally, we proposed a weighted L2 loss (MSE, mean square error), which assigns different weights to the unmasked region according to the distance from the defective region, allowing the model to make full use of the effective pixels to repair the masked region. To demonstrate the effectiveness of the designed modules, we used MSE (mean square error), LPIPS (learned perceptual image patch similarity), PSNR (peak signal to noise ratio), SSIM (structural similarity), and FLOPs (floating point operations) to measure the quality of the model generated wood texture images and the model computational complexity, we designed relevant ablation experiments. The results show that the MSE, LPIPS, PSNR, and SSIM of the wood images restored by the MRS-Transformer reached 0.0003, 0.154, 40.12, 0.9173, and the GFLOPs is 20.18. Compared with images generated by the Vision Transformer, the MSE and LPIPS were reduced by 51.7% and 30%, PSNR and SSIM were improved by 12.2% and 7.5%, and the GFLOPs were reduced by 38%. To verify the superiority of MRS-Transformer, we compared the image inpainting algorithms with Deepfill v2 and TG-Net, respectively, in which the MSE was 47.0% and 66.9% lower; the LPIPS was 60.6% and 42.5% lower; the FLOPs was 70.6% and 53.5% lower; the PSNR was 16.1% and 26.2% higher; and the SSIM was 7.3% and 5.8% higher. MRS-Transformer repairs a single image in 0.05 s, nearly five times faster than Deepfill v2 and TG-Net. The experimental results demonstrate that the RSwin module effectively alleviates the sense of fragmentation caused by the division of images into patches, the proposed weighted L2 loss improves the semantic consistency of the edges of the missing regions and makes the generated wood texture more detailed and coherent, and the adopted asymmetric encoder-decoder effectively reduces the computational effort of the model and speeds up the training.