A major problem in detecting events in streams of data is that the data can be imprecise (e.g. RFID data). However, current state-ofthe-art event detection systems such as Cayuga [14], SASE [46] or SnoopIB[1], assume the data is precise. Noise in the data can be captured using techniques such as hidden Markov models. Inference on these models creates streams of probabilistic events which cannot be directly queried by existing systems. To address this challenge we propose Lahar 1 , an event processing system for probabilistic event streams. By exploiting the probabilistic nature of the data, Lahar yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples. By using a novel static analysis and novel algorithms, Lahar processes data orders of magnitude more efficiently than a naïve approach based on sampling. In this paper, we present Lahar's static analysis and core algorithms. We demonstrate the quality and performance of our approach through experiments with our prototype implementation and comparisons with alternate methods.
Model-based views have recently been proposed as an effective method for querying noisy sensor data. Commonly used models from the AI literature (e.g., the hidden Markov model) expose to applications a stream of probabilistic and correlated state estimates computed from the sensor data. Many applications want to detect sophisticated patterns of states from these Markovian streams. Such queries are called event queries.In this paper, we present a new Markovian stream storage manager, Caldera. We develop and evaluate Caldera as a component of Lahar, a Markovian stream event query processing system developed in previous work. At the heart of Caldera is a set of access methods for Markovian streams that can improve event query performance by orders of magnitude compared to existing techniques, which must scan the entire stream. Our access methods use new adaptations of traditional B+ tree indexes, and a new index, called the Markov-chain index. They efficiently extract only the relevant timesteps from a stream, while retaining the stream's Markovian properties. We have implemented our prototype system on BDB and demonstrate its effectiveness on both synthetic data and real data from a building-wide RFID deployment. I. IApplications that make decisions based on sensor data are increasingly common, with sensor deployments now playing integral roles in supply chain automation [5], [39], environment monitoring [17], elder-care [25], [28], etc. Unfortunately, building applications on top of raw sensor data remains challenging because sensors produce inaccurate information, frequently fail, and can rarely collect data on an entire region of interest. As an example, consider a Radio Frequency IDentification (RFID) tracking application [38] in which RFID readers are distributed throughout an environment. Ideally, when a tag (carried by a person or attached to an object) passes close to a reader, the reader detects and logs the tag's presence: e.g., Bob's tag was sighted by reader A at time 7, reader B at time 8, etc. In practice, however, readers often fail to detect nearby tags [40], and cannot provide information about a tag's position within the reader's range. Applications are thus forced to deal with imprecise input streams.The reduction of errors and gaps in sensor data streams is the focus of a large body of probabilistic modeling/inference techniques developed in the AI community [34]. While a limited number of these techniques can be applied in real time, the most effective (Bayesian smoothing [13]) can be applied only as a post-processing step, after the raw data stream is archived. Our goal is to support archive-based applications that leverage this smoothed data in order to provide the most accurate possible answers to historical queries (e.g., "Was Bob in his office yesterday?", "Did Margot take her medication before breakfast every day last month?", etc.).The result of any smoothing technique is a probabilistic stream in which each timestep encodes not a single state, but a distribution over possibl...
Abstract-A large amount of the world's data is both sequential and imprecise. Such data is commonly modeled as Markovian streams; examples include words/sentences inferred from raw audio signals, or discrete location sequences inferred from RFID or GPS data. The rich semantics and large volumes of these streams make them difficult to query efficiently. In this paper, we study the effects-on both efficiency and accuracy-of two common stream approximations. Through experiments on a realworld RFID data set, we identify conditions under which these approximations can improve performance by several orders of magnitude, with only minimal effects on query results. We also identify cases when the full rich semantics are necessary.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.