Using real customer data from a large community bank in the South of the US, this paper analyzes the customer churn prediction problem by constructing and comparing ten machine learning classification models with five sample techniques. Our results show that Random Forest, XG Boost, AdaBoost, and Bagging Meta classifiers dominate others in terms of overall accuracy, F-score, and AUC curve for the test observations. For the four classifiers, the overall accuracy ranges from 87% to 96% across five different sampling methods explored, while the AUC values range between 0.9 to 0.93. Considering overall accuracy and F-Score, AdaBoost with original and MTDF sampling technique dominates others; however, considering the AUC measure, XG Boost and Random Forest perform similarly to AdaBoost, which slightly dominate Bagging Meta across all sampling techniques; although the performance measures for these four classifiers are comparable across all sampling techniques. The paper further presents important features of customer churn behavior as predicted by the model. The diagnostic analysis also provides an insightful comparison between churned and non-churned customers.
JEL classification numbers: C0, C5, C8, G21.
Keywords: Machine learning, Big data, Sampling techniques, Customer churn, Customer retention, Financial services, Community bank.