Traffic congestion is a significant problem that adversely affects the economy, environment, and public health in urban areas worldwide. One promising solution is to forecast road-level congestion levels in the short-term and long-term, enabling commuters to avoid congested areas and allowing traffic agencies to take appropriate action. In this study, we propose a hybrid deep neural network algorithm based on High-Resolution Network (HRNet) and ConvLSTM decoder for 10, 30, and 60-min traffic congestion prediction. Our model utilizes the HRNet’s multi-scale feature extraction capability to capture rich spatial features from a sequence of past traffic input images. The ConvLSTM module learns temporal information from each HRNet multi-scale output and aggregates all feature maps to generate accurate traffic forecasts. Our experiments demonstrate that the proposed model can efficiently and effectively learn both spatial and temporal relationships for traffic congestion and outperforms four other state-of-the-art architectures (PredNet, UNet, ConvLSTM, and Autoencoder) in terms of accuracy, precision, and recall. A case study was conducted on the dataset from Seoul, South Korea.