Soil moisture (SM) is a critical variable affecting ecosystem carbon and water cycles and their feedback to climate change. In this study, we proposed a convolutional neural network (CNN) model embedded with a residual block and attention module, named SMNet, to spatially downscale the European Space Agency (ESA) Climate Change Initiative (CCI) SM product. In the SMNet model, a lightweight Convolutional Block Attention Module (CBAM) dual-attention mechanism was integrated to comprehensively extract the spatial and channel information from the high-resolution input remote sensing products, the reanalysis meteorological dataset, and the topographic data. The model was employed to downscale the ESA CCI SM from its original spatial resolution of 25 km to 1 km in California, USA, in the annual growing season (1 May to 30 September) from 2003 to 2021. The original ESA CCI SM data and in situ SM measurements (0–5 cm depth) from the International Soil Moisture Network were used to validate the model’s performance. The results show that compared with the original ESA CCI SM data, the downscaled SM data have comparable accuracy with a mean correlation (R) and root mean square error (RMSE) of 0.82 and 0.052 m3/m3, respectively. Moreover, the model generates reasonable spatiotemporal SM patterns with higher accuracy in the western region and relatively lower accuracy in the eastern Nevada mountainous area. In situ site validation results in the SCAN, the SNOTEL network, and the USCRN reveal that the R and RMSE are 0.62, 0.63, and 0.77, and 0.077 m3/m3, 0.093 m3/m3, and 0.078 m3/m3, respectively. The results are slightly lower than the validation results from the original ESA CCI SM data. Overall, the validation results suggest that the SMNet downscaling model proposed in this study has satisfactory performance in handling the task of soil moisture downscaling. The downscaled SM model not only preserves a high level of spatial consistency with the original ESA CCI SM model but also offers more intricate spatial variations in SM depending on the spatial resolution of model input data.