Accurate prediction of metro passenger flow helps operating departments optimize scheduling plans, alleviate passenger flow pressure, and improve service quality. However, existing passenger flow prediction models tend to only consider the historical passenger flow of a single station while ignoring the spatial relationships between different stations and correlations between passenger flows, resulting in low prediction accuracy. Therefore, a multi-scale residual depthwise separable convolution network (MRDSCNN) is proposed for metro passenger flow prediction, which consists of three pivotal components, including residual depthwise separable convolution (RDSC), multi-scale depthwise separable convolution (MDSC), and attention bidirectional gated recurrent unit (AttBiGRU). The RDSC module is designed to capture local spatial and temporal correlations leveraging the diverse temporal patterns of passenger flows, and then the MDSC module is specialized in obtaining the inter-station correlations between the target station and other heterogeneous stations throughout the metro network. Subsequently, these correlations are fed into AttBiGRU to extract global interaction features and obtain passenger flow prediction results. Finally, the Hangzhou metro passenger inflow and outflow data are employed to assess the model performance, and the results show that the proposed model outperforms other models.