Image editing technology has brought about revolutionary changes in the field of architectural design, garnering significant attention in both the computer and architectural industries. However, architectural image editing is a challenging task due to the complex hierarchical structure of architectural images, which complicates the learning process for the high-dimensional features of architectural images. Some methods invert the images into the latent space of a pre-trained generative adversarial network (GAN) model, completing the editing process by manipulating this latent space. However, the task of striking a balance between reconstruction fidelity and editing efficacy through latent space mapping presents a formidable challenge. To address this issue, we propose a Residual Spatial Cross-Attention Network (RSCAN) for architectural image editing, which is an encoder model integrating multiple latent spaces. Specifically, we introduce the spatial feature extractor, which maps the image to the high-dimensional space F of the synthesis network, to enhance the spatial information retention and preserve the structural consistency of the architectural image. In addition, we propose the residual cross-attention to learn the mapping relationship between the low-dimensional space W and F space, generating modified features corresponding to the latent code and leveraging the benefits of multiple latent spaces to facilitate editing. Extensive experiments are performed on the LSUN Church dataset, and the experimental results indicate that our proposed RSCAN achieves significant improvements over the relevant methods in quantitative analysis metrics including the reconstruction quality, SSIM, FID, L2, LPIPS, PSNR, and editing effect ΔS, with enhancements of 29.49%, 17.29%, 8.81%, 11.43%, 11.26%, and 47.8%, respectively, thereby enhancing the practicality of architectural image editing.