Rich information in multi-temporal satellite images can facilitate pixel-level land cover classification. However, what is the most suitable deep learning architecture for high-dimension spatio-temporal representation of remote sensing time-series remains unclear. In this study, we theoretically analyzed the different mechanisms of the different deep learning structures, including the commonly used convolutional neural network (CNN), the high-dimension CNN (3D CNN), the recurrent neural network (RNN), and the newest vision transformer (ViT), with regard to learning and representing the temporal information for spatiotemporal data. The performance of the different models was comprehensively evaluated on large-scale Sentinel-1 and Sentinel-2 time-series images covering the whole of Slovenia. Several observations can be made. Firstly, the 3D CNN, long short-term memory (LSTM), and ViT, which all have specific structures that preserve temporal information, can effectively extract the spatiotemporal information, with the 3D CNN and ViT showing the best performance.Secondly, the performance of the 2D CNN, in which the temporal information is collapsed, is lower than that of the 3D CNN, LSTM, and ViT; however, it significantly outperforms the conventional methods of random forest (RF) and XGBoost. We also observed that using both optical and synthetic aperture radar (SAR) images performs almost the same as using only optical images, indicating that the information that can be extracted from optical images is sufficient for land-cover classification. However, when optical imaging is affected by poor weather, SAR images, as a beneficial supplement, can provide satisfactorily classification results. Finally, the modern deep learning methods can effectively overcome the disadvantages in imaging conditions where parts of an image or images of some periods are missing. The testing data are available at gpcv.whu.edu.cn/data.