We present PROUD -A PRObabilistic approach to processing similarity queries over Uncertain Data streams, where the data streams here are mainly time series streams. In contrast to data with certainty, an uncertain series is an ordered sequence of random variables. The distance between two uncertain series is also a random variable. We use a general uncertain data model, where only the mean and the deviation of each random variable at each timestamp are available. We derive mathematical conditions for progressively pruning candidates to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are available. Extensive experiments are conducted to evaluate the effectiveness of PROUD and compare it with Det, a deterministic approach that directly processes data without considering uncertainty. The results show that, compared with Det, PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. In contrast, Det does not provide such flexibility. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.
In applications of multiple data streams such as stock market trading and sensor network data analysis, the clusters of streams change at different times because of data evolution. The information about evolving cluster is valuable to support corresponding online decisions. In this paper, we present a framework for Clustering Over Multiple Evolving sTreams by CORrelations and Events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlation. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of a cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. As shown in our experimental studies, COMET-CORE can be performed effectively with good clustering quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.