Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.
In this paper, we examine approaches for reducing the complexity of evolving fuzzy systems (EFSs) by eliminating local redundancies during training, evolving the models on on-line data streams. Thus, the complexity reduction steps should support fast incremental single-pass processing steps. In EFSs, such reduction steps are important due to several reasons: (1) originally distinct rules representing distinct local regions in the input/output data space may move together over time and get significantly over-lapping as data samples are filling up the gaps in-between these, (2) two or several fuzzy sets in the fuzzy partitions may become redundant because of projecting high-dimensional clusters onto the single axes, (3) they can be also seen as a first step towards a better readability and interpretability of fuzzy systems, as unnecessary information is discarded and the models are made more transparent. One technique is performing a new rule merging approach directly in the product cluster space using a novel concept for calculating the similarity degree between an updated rule and the remaining ones. Inconsistent rules elicited by comparing the similarity of two redundant rule antecedent parts with the similarity of their consequents are specifically handled in the merging procedure. The second one is operating directly in the fuzzy partition space, where redundant fuzzy sets are merged based on their joint a-cut levels. Redundancy is measured by a novel kernel-based similarity measure. The complexity reduction approaches are evaluated based on high-dimensional noisy real-world measurements and an artificially generated data stream containing 1.2 million samples. Based on this empirical comparison, it will be shown that the novel techniques are (1) fast enough in order to cope with on-line demands and (2) produce fuzzy systems with less structural components while at the same time achieving accuracies similar to EFS not integrating any reduction steps.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.