In this study, a deep learning network for extracting spatial-temporal features is proposed to estimate significant wave height (Hs) and wave period (Ts) from X-band marine radar images. Since the shore-based radar image in this study is interfered with by other radar radial noise lines and solid target objects, to ensure that the proposed convolutional neural network (CNN) extracts the image features accurately, it is necessary to pre-process the radar image to eliminate interference. Firstly, a pre-trained GoogLeNet is used to extract multi-scale depth space features from the radar images to estimate Hs and Ts. Since CNN-based models cannot analyze the temporal behavior of wave features in radar image sequences, self-attention is connected after the deep convolutional layer of the CNN to construct a convolutional self-attention (CNNSA)-based model that generates spatial-temporal features for Hs and Ts estimation. Simultaneously, Hs and Ts measured by nearby buoys are used for model training and reference. The experimental results show that the proposed CNNSA model reduces the RMSD by 0.24 m and 0.11 m, respectively, in Hs estimation compared to the traditional SNR-based and CNN-based methods. In Ts estimation, the RMSD is reduced by 0.3 s and 0.08 s, respectively.