Abstract-An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions that change over time. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this paper, we introduce compacted object sample extraction (COMPOSE), a computational geometry-based framework to learn from nonstationary streaming data, where labels are unavailable (or presented very sporadically) after initialization. We introduce the algorithm in detail, and discuss its results and performances on several synthetic and real-world data sets, which demonstrate the ability of the algorithm to learn under several different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch.Index Terms-Alpha shape, concept drift, nonstationary environment, semisupervised learning (SSL), verification latency.
Learning in nonstationary environments, also called concept drift, requires an algorithm to track and learn from streaming data, drawn from a nonstationary (drifting) distribution. When data arrive continuously, a concept drift algorithm is required to maintain an up-to-date hypothesis that evolves with the changing environment. A more difficult problem that has received less attention, however, is learning from socalled initially labeled nonstationary environments, where the the environment provides only unlabeled data after initialization. Since the labels to such data never become available, learning in such a setting is also referred to as extreme verification latency, where the algorithm must only use unlabeled data to keep the hypothesis current. In this contribution, we analyze COMPOSE, a framework recently proposed for learning in such environments. One of the central processes of COMPOSE is core support extraction, where the algorithm predicts which data instances will be useful and relevant for classification in future time steps. We compare two different options, namely Gaussian mixture model based maximum a posteriori sampling and α-shape compaction, for core support extraction, and analyze their effects on both accuracy and computational complexity of the algorithm. Our findings point to-as is the case in most engineering problems-a trade-off: that α-shapes are more versatile in most situations, but they are far more computationally complex, especially as the dimensionality of the dataset increases. Our proposed GMM procedure allows COMPOSE to operate on datasets of substantially larger dimensionality without affecting its classification performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.