Mining of Frequent Itemsets from Streams of Uncertain Data

Leung, Carson K.; Hao, Boyu

doi:10.1109/icde.2009.157

Cited by 79 publications

(40 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work is similar to that of [5], whose algorithm processes uncertain data flows. Flows are characterized by a number of transactions containing items, and must be treated immediately because they are no longer available for reuse.…”

Section: Discussionmentioning

confidence: 75%

Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

Cryans

Ratté

Champagne

2010

2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications

View full text Add to dashboard Cite

Abstract-The Semantic Web has made possible the use of the Internet to extract useful content, a task that could necessitate an infrastructure across the Web. With Hadoop, a free implementation of the MapReduce programming paradigm created by Google, we can treat these data reliably over hundreds of servers. This article describes how the Apriori algorithm was adapted to MapReduce in the search for relations between entities to deal with thousands of Web pages coming from RSS feeds daily. First, every feed is looked up five times per day and each entry is registered in a database with MapReduce. Second, the entries are read and their content sent to the Web service OpenCalais for the detection of named entities. For each Web page, the set of all itemsets found is generated and stored in the database. Third, all generated sets, from first to last, are counted and their support is registered. Finally, various analytical tasks are executed to present the relationships found. Our tests show that the third step, executed over 3,000,000 sets, was 4.5 times faster using five servers than using a single machine. This approach allows us to easily and automatically distribute treatments on as many machines as are available, and be able to process datasets that one server, even a very powerful one, would not be able to manage alone. Based on these findings, we can generalize that, with this great scalability, processing more data would be faster, depending on the number of servers used.

show abstract

Section: Discussionmentioning

confidence: 75%

Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

Cryans

Ratté

Champagne

2010

2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications

View full text Add to dashboard Cite

show abstract

“…Sliding windows allow maintaining the result any time the stream is updated, but they need more CPU. Today, existing methods for probabilistic data stream mining are batchbased and work with Expected Support [17], [11], [10]. Meanwhile, working with sliding windows is a major matter for numerous monitoring applications where handling "anytime queries" is crucial.…”

Section: Huan Huanmentioning

confidence: 99%

“…Their approaches allow finding items (itemsets of only one item) in static data and likely frequent items in data streams. [11] proposes to extract frequent itemset from streaming probabilistic data by means of Expected Support and a batch model. In [10], we find a batch-based approach to extract frequent itemsets using Expected Support in probabilistic data streams with a technique inspired from [7].…”

Section: Related Workmentioning

confidence: 99%

“…Dealing with probabilistic data has gained increasing attention these past few years in both static and streaming data management and mining [3], [9], [2], [11], [10]. There are many possible reasons for probabilistic data, such as noise occurring when data are collected, noise injected for privacy reasons, semantics of the results of a search engine (often ambiguous), etc.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast and Exact Mining of Probabilistic Data Streams

Akbarinia

Masséglia

2013

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. Discovering Probabilistic Frequent Itemsets (PFI) is very challenging since algorithms designed for deterministic data are not applicable in probabilistic data. The problem is even more difficult for probabilistic data streams where massive frequent updates need to be taken into account while respecting data stream constraints. In this paper, we propose FEMP (Fast and Exact Mining of Probabilistic data streams), the first solution for exact PFI mining in data streams with sliding windows. FEMP allows updating the frequentness probability of an itemset whenever a transaction is added or removed from the observation window. Using these update operations, we are able to extract PFI in sliding windows with very low response times. Furthermore, our method is exact, meaning that we are able to discover the exact probabilistic frequentness distribution function for any monitored itemset, at any time. We implemented FEMP and conducted an extensive experimental evaluation over synthetic and real-world data sets; the results illustrate its very good performance.

show abstract

“…In [32], the authors consider the problem of identifying frequent itemsets in uncertain data streams. Uncertain data streams are processed through a sliding window containing a fixed number of batches (each batch contains a fixed number of transactions).…”

Section: Related Workmentioning

confidence: 99%

Sliding windows over uncertain data streams

et al. 2014

View full text Add to dashboard Cite

Uncertain data streams can have tuples with both value and existential uncertainty. A tuple has value uncertainty when it can assume multiple possible values. A tuple is existentially uncertain when the sum of the probabilities of its possible values is <1. A situation where existential uncertainty can arise is when applying relational operators to streams with value uncertainty. Several prior works have focused on querying and mining data streams with both value and existential uncertainty. However, none of them have studied, in depth, the implications of existential uncertainty on sliding window processing, even though it naturally arises when processing uncertain data. In this work, we study the challenges arising from existential uncertainty, more specifically the management of count-based sliding windows, which are a basic building block of stream processing applications. We extend the semantics of sliding window to define the novel concept of uncertain sliding windows and provide both exact and approximate algorithms for managing windows under existential uncertainty. We also show how current state-of-the-art techniques for answering similarity join queries can be easily adapted to be used with uncertain sliding windows. We evaluate our proposed techniques under a variety of configurations using real data. The results show that the algorithms used to maintain uncertain sliding windows can efficiently operate while providing a highquality approximation in query answering. In addition, we show that sort-based similarity join algorithms can perform better than index-based techniques (on 17 real datasets) when the number of possible values per tuple is low, as in many real-world applications.

show abstract

Mining of Frequent Itemsets from Streams of Uncertain Data

Cited by 79 publications

References 17 publications

Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

Fast and Exact Mining of Probabilistic Data Streams

Sliding windows over uncertain data streams

Contact Info

Product

Resources

About