Large-scale crop yield estimation is critical for understanding the dynamics of global food security. Understanding and quantifying the temporal cumulative effect of crop growth and spatial variances across different regions remains challenging for large-scale crop yield estimation. In this study, a deep spatial-temporal learning framework, named DeepCropNet (DCN), has been developed to hierarchically capture the features for county-level corn yield estimation. The temporal features are learned by an attention-based long short-term memory network and the spatial features are learned by the multi-task learning (MTL) output layers. The DCN model has been applied to quantify the relationship between meteorological factors and the county-level corn yield in the US Corn Belt from 1981 to 2016. Three meteorological factors, including growing degree days, killing degree days, and precipitation, are used as time-series inputs. The results show that DCN provides an improved estimation accuracy (RMSE=0.82 Mg ha −1 ) as compared to that of conventional methods such as LASSO (RMSE=1.14 Mg ha −1 ) and Random Forest (RMSE=1.05 Mg ha −1 ). Temporally, the attention values computed from the temporal learning module indicate that DCN captures the temporal cumulative effect and this temporal pattern is consistent across all states. Spatially, the spatial learning module improves the estimation accuracy based on the regional specific features captured by the MTL mechanism. The study highlights that the DCN model provides a promising spatial-temporal learning framework for corn yield estimation under changing meteorological conditions across large spatial regions.