Bus passenger flow prediction is a critical component of advanced transportation information system for public traffic management, control, and dispatch. With the development of artificial intelligence, many previous studies attempted to apply machine learning models to extract comprehensive correlations from transit networks to improve passenger flow prediction accuracy, given that the variety and volume of traffic data have been easily obtained. The passenger flow on a station is highly affected by various factors such as the previous time step, peak hours or nonpeak hours, and extracting the key features from the data is essential for a passenger flow prediction model. Although the neural networks,
k
-nearest neighbor, and some deep learning models have been adopted to mine the temporal correlations of the passenger flow data, the lack of interpretability of the influenced variables is still a big problem. Classical tree-based models can mine the correlations between variables and rank the importance of each variable. In this study, we presented a method to extract passenger flow of different routes on the station and implemented a XGBoost model to find the contributions of variables to the prediction of passenger flow. Comparing to benchmark models, the proposed model can reach state-of-the-art prediction accuracy and computational efficiency on the real-world dataset. Moreover, the XGBoost model can interpret the predicted results. It can be seen that period is the most important variable for the passenger flow prediction, and so the management of buses during peak hours should be improved.