Reuse of learnt knowledge is of critical importance in the majority of knowledge-intensive application areas, particularly because the operating context can be expected to vary from training to deployment. Dataset shift is a crucial example of this where training and testing datasets follow different distributions. However, most of the existing dataset shift solving algorithms need costly retraining operation and are not suitable to use the existing model. In this paper, we propose a new approach called reframing to handle dataset shift. The main objective of reframing is to build a model once and make it workable without retraining. We propose two efficient reframing algorithms to learn the optimal shift parameter values using only a small amount of labelled data available in the deployment. Thus, they can transform the shifted input attributes with the optimal parameter values and use the same existing model in several deployment environments without retraining. We have addressed supervised learning tasks both for classification and regression. Extensive experimental results demonstrate the efficiency and effectiveness of our approach compared to the existing solutions. In particular, we report the existence of dataset shift in two real-life datasets. These real-life unknown shifts can also be accurately modeled by our algorithms.
Companies want to extract value from their relational databases. This is the aim of relational data mining. Propositionalization is one possible approach to relational data mining. Propositionalization adds new attributes, called features, to the main table, leading to an attribute-value representation, a single table, on which a propositional learner can be applied. However, current relational databases are large and composed of mixed, numerical and categorical, data. Moreover, the specificity of relational data is to involve one-to-many relationships. As an example of such data, consider customers purchasing products: each customer can purchase several products. Therefore, there is a need for techniques able to learn complex aggregates. Learning such features means to explore a combinatorial, possibly infinite, space and such an approach is prone to overfitting. We introduce a propositionalization approach dedicated to a robust Bayesian classifier. It efficiently samples a given number of features in the language bias, following a distribution over the complex aggregates. This distribution is also used to penalize complex aggregates in the regularization of the robust Bayesian classifier. Experiments show that it performs better than state-of-the-art methods on most investigated benchmarks and can deal with large datasets more easily. A new real, large, mixed relational dataset is introduced which confirms the ability of our approach to learn complex aggregates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.