Computer-generated holography (CGH) based on neural networks has been actively investigated in recent years, and convolutional neural networks (CNNs) are frequently adopted. A convolutional kernel captures local dependencies between neighboring pixels. However, in CGH, each pixel on the hologram influences all the image pixels on the observation plane, thus requiring a network capable of learning long-distance dependencies. To tackle this problem, we propose a CGH model called Holomer. Its single-layer perceptual field is 43 times larger than that of a widely used 3×3 convolutional kernel, thanks to the embedding-based feature dimensionality reduction and multi-head sliding-window self-attention mechanisms. In addition, we propose a metric to measure the networks’ learning ability of the inverse diffraction process. In the simulation, our method demonstrated noteworthy performance on the DIV2K dataset at a resolution of 1920×1024, achieving a PSNR and an SSIM of 35.59 dB and 0.93, respectively. The optical experiments reveal that our results have excellent image details and no observable background speckle noise. This work paves the path of high-quality hologram generation.