Churn prediction is gaining popularity in the research community as a powerful paradigm that supports data-driven operational decisions. Datasets related to churn prediction are often skewed with imbalanced class distribution. Data-level solutions, like over-sampling and under-sampling, have been commonly used by researchers to address this problem. There are limited number of case studies that attempt to evolve these data-level solutions by integrating them with computationally advanced frameworks, like ensembles. Ensembles primarily employ algorithmic diversity using a fixed set of training instances to achieve superior performance. This study aims to introduce algorithmic diversity in ensembles by modifying the fixed set of training instances using diverse sampling strategies to increase predictive performance in imbalanced learning. Data is acquired from the world's largest open hotel commerce platform company. A four-part series of experiments is conducted to analyze the effectiveness of sampling techniques and ensemble solutions on model performance. A new sampling-based stack framework called "Stacking of Samplers for Imbalanced Learning" is proposed. The framework combines the prediction capabilities of sampling solutions to stimulate the information gain of the meta features in ensemble. It is observed that the proposed framework leads to improvement in model performance with AUC of 86.4% and top-decile lift of 4.7 for customers of the hotel technology provider. Additionally, results show that the framework records a higher information gain for meta features used in a stack, compared to commonly used stack frameworks.
The effectiveness of any Machine Learning process depends on the accuracy of annotated data that is used to train a learner. However, manual annotation is expensive. Hence, researchers adopt a semi-supervised approach called active learning that aims to achieve state-of-the-art performance using minimal number of samples. Although it boosts classifier performance, the underlying query strategies are unable to eliminate redundancy in selected samples. Redundant samples lead to increased cost and sub-optimal performance of learner. Inspired by this challenge, the study proposes a new representation-based query strategy that selects highly informative and representative subsets of samples for manual annotation. Data comprises messages of a set of customers sent to a service provider. Series of experiments are conducted to analyze the effectiveness of the proposed query strategy, called ''Entropy-based Min Max Similarity'' (E-MMSIM), in the context of topic classification for churn prediction. The foundation of E-MMSIM is an algorithm that is popularly used to sequence proteins in protein databases. The algorithm is modified and utilized to select the most representative and informative samples. The performance is evaluated using F1-score, AUC and accuracy. It is observed that ''E-MMSIM'' outperforms popular query strategies, and improves performance of topic classifiers for each of the 4 topics of churn prediction. The trained topic classifiers are used to derive qualitative features. These features are further integrated with structured variables for the same group of customers to predict churn. Experiments provide evidence that inclusion of qualitative features derived using E-MMSIM, enhance the performance of churn classifiers by 5%.
INDEX TERMSActive learning, churn prediction, query strategy, entropy, topic classification. SOUMI DE (Member, IEEE) received the B.Sc. degree (Hons.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.