The need for efficient tools is pressing in the era of big data, particularly in streaming data applications. As data streams are ubiquitous, the ability to accurately detect multiple changepoints, without affecting the continuous flow of data, is an important issue. Change detection for categorical data streams is understudied, and existing work commonly introduces fixed control parameters while providing little insight into how they may be chosen. This is ill-suited to the streaming paradigm, motivating the need for an approach that introduces few parameters which may be set without requiring any prior knowledge of the stream. This paper introduces such a method, which can accurately detect changepoints in categorical data streams with fixed storage and computational requirements. The detector relies on the ability to adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors. A novel adaptive threshold is also developed which can be computed given a desired false positive rate. This method is then compared to sequential and nonsequential change detectors in a large simulation study which verifies the usefulness of our approach. A real data set consisting of nearly 40 million events from a computer network is also investigated.
Sequentially detecting multiple changepoints in a data stream is a challenging task. Difficulties relate to both computational and statistical aspects, and in the latter, specifying control parameters is a particular problem. Choosing control parameters typically relies on unrealistic assumptions, such as the distributions generating the data, and their parameters, being known. This is implausible in the streaming paradigm, where several changepoints will exist. Further, current literature is mostly concerned with streams of continuous-valued observations, and focuses on detecting a single changepoint. There is a dearth of literature dedicated to detecting multiple changepoints in transition matrices, which arise from a sequence of discrete states. This paper makes the following contributions: a complete framework is developed for adaptively and sequentially estimating a Markov transition matrix in the streaming data setting. A change detection method is then developed, using a novel moment matching technique, which can effectively monitor for multiple changepoints in a transition matrix. This adaptive detection and estimation procedure for transition matrices, referred to as ADEPT-M, is compared to several change detectors on synthetic data streams, and is implemented on two real-world data streams – one consisting of over nine million HTTP web requests, and the other being a well-studied electricity market data set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.