This paper presents the performance of a classifier built using the stackingC algorithm in nine different data sets. Each data set is generated using a sampling technique applied on the original imbalanced data set. Five new sampling techniques are proposed in this paper (i.e., SMOTERandRep, Lax Random Oversampling, Lax Random Undersampling, Combined-Lax Random Oversampling Undersampling, and Combined-Lax Random Undersampling Oversampling) that were based on the three sampling techniques (i.e., Random Undersampling, Random Oversampling, and Synthetic Minority Oversampling Technique) usually used as solutions in imbalance learning. The metrics used to evaluate the classifier's performance were F-measure and G-mean. Fmeasure determines the performance of the classifier for every class, while G-mean measures the overall performance of the classifier. The results using F-measure showed that for the data without a sampling technique, the classifier's performance is good only for the majority class. It also showed that among the eight sampling techniques, RU and LRU have the worst performance while other techniques (i.e., RO, C-LRUO and C-LROU) performed well only on some classes. The best performing techniques in all data sets were SMOTE, SMOTERandRep, and LRO having the lowest Fmeasure values between 0.5 and 0.65. The results using Gmean showed that the oversampling technique that attained the highest G-mean value is LRO (0.86), next is C-LROU (0.85), then SMOTE (0.84) and finally is SMOTERandRep (0.83). Combining the result of the two metrics (F-measure and G-mean), only the three sampling techniques are considered as good performing (i.e., LRO, SMOTE, and SMOTERan-dRep)
Abstract. The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has been sidelined in the literature due to the perceived inadequacy of its label-independence assumption. Instead, most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, especially in terms of scalability to large datasets. We exemplify this with a novel chaining method that can model label correlations while maintaining acceptable computational complexity. Empirical evaluation over a broad range of multi-label datasets with a variety of evaluation metrics demonstrates the competitiveness of our chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multi-label datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.
Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling non-stationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and real-world datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.
In learning to classify streaming data, obtaining true labels may require major effort and may incur excessive cost. Active learning focuses on carefully selecting as few labeled instances as possible for learning an accurate predictive model. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and models need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. Changes occurring further from the boundary may be missed, and models may fail to adapt. This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.