Hideko KAWAKUBO†a) , Marthinus Christoffel DU PLESSIS † †b) , Nonmembers, and Masashi SUGIYAMA † †c) , Member
SUMMARYIn many real-world classification problems, the class balance often changes between training and test datasets, due to sample selection bias or the non-stationarity of the environment. Naive classifier training under such changes of class balance systematically yields a biased solution. It is known that such a systematic bias can be corrected by weighted training according to the test class balance. However, the test class balance is often unknown in practice. In this paper, we consider a semi-supervised learning setup where labeled training samples and unlabeled test samples are available and propose a class balance estimator based on the energy distance. Through experiments, we demonstrate that the proposed method is computationally much more efficient than existing approaches, with comparable accuracy. key words: class balance change, class-prior estimation, energy distance
IntroductionA fundamental assumption in supervised machine learning is that training and test data follow the same probability distribution. However, in real-world data, this assumption does not necessarily hold due to intrinsic sample selection bias and non-stationarity of the environment [1], and naive training yields a biased solution [2]. In this paper, we consider the situation called the class balance change in classification [3], where only the class-prior probabilities change between the training and test phases. In principle, the bias caused by the class balance change can be corrected by weighted training according to the class ratio of the test data. However, in practice, the test class balance is often unknown and thus it needs to be estimated from data.So far, semi-supervised class balance estimators from labeled training samples and unlabeled test samples have been developed, which are based on fitting a mixture of class-wise training input distributions to the test input distribution. A seminal method [4] adopts the expectationmaximization (EM) algorithm [5] to estimate the class ratio. Another earlier paper [3] showed that the EM-based method can be interpreted as indirectly fitting a mixture of classwise training input distributions to the test input distribu- The divergence-based methods reviewed above [3], [11] are equipped with cross-validation (CV), and therefore all tuning parameters can be objectively optimized. Thanks to this property, the divergence-based methods work very well in practice, although CV is computationally rather expensive. On the other hand, choosing a kernel function in the MMD-based method is not straightforward because changing the kernel function corresponds to changing the error metric and thus CV cannot be employed. Using the median distance of samples as the Gaussian kernel width is a popular heuristic in MMD [12], but this can cause significant performance degradation in practice [15]. Using MKL for MMD is potentially powerful, but this implementation is computationally highly...