In this paper, we propose an image reconstruction method through CNN and attention layers. The proposed method reconstructs a square block from reference pixels that are located in its top and left areas. The first CNN layers of the proposed network find spatial features in vertical and horizontal directions for the reference pixels in the top and left areas, respectively. Then, the two spatial features are merged. The spatial features of the block are predicted by applying self-attention to the merged spatial features. A correlation between the block and the reference pixels is found through attention layers for the spatial features of the block and of the reference pixels. Finally, the last CNN layers reconstruct the pixels of the block by converting spatial feature domain to pixel domain. In simulation results, block reconstruction accuracies increase by an average of 14% for actual videos compared with intra prediction in VVC. The proposed method can accurately reconstruct for blocks including more nonlinear patterns such as curves.