Finding associations and computing similarity via biased pair sampling

Campagna, Andrea; Pagh, Rasmus

doi:10.1007/s10115-011-0428-y

Cited by 19 publications

(17 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [19], authors derive an upper bound of Kulczynski, which was shown to be effective only for the comparatively high minimum support thresholds. The techniques based on sampling were recently proposed in [4], which are much faster, but at the cost of the incompleteness of results. Our approach works well for all null-invariant measures including Kulczynski and Cosine, which did not have efficient algorithms for low support, and it produces the complete results.…”

Section: Related Workmentioning

confidence: 99%

Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures

Kim

Barsky

2011

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. Mining strong correlations from transactional databases often leads to more meaningful results than mining association rules. In such mining, null (transaction)-invariance is an important property of the correlation measures. Unfortunately, some useful null-invariant measures such as Kulczynski and Cosine, which can discover correlations even for the very unbalanced cases, lack the (anti)-monotonicity property. Thus, they could only be applied to frequent itemsets as the post-evaluation step. For large datasets and for low supports, this approach is computationally prohibitive. This paper presents new properties for all known null-invariant measures. Based on these properties, we develop efficient pruning techniques and design the Apriori-like algorithm NICOMINER for mining strongly correlated patterns directly. We develop both the threshold-bounded and the top-k variations of the algorithm, where top-k is used when the optimal correlation threshold is not known in advance and to give user control over the output size. We test NICOMINER on real-life datasets from different application domains, using Cosine as an example of the null-invariant correlation measure. We show that NICOMINER outperforms support-based approach more than an order of magnitude, and that it is very useful for discovering top correlations in itemsets with low support.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures

Kim

Barsky

2011

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

show abstract

“…The overlap coefficient measure has the property that finding pairs having similarity over a certain threshold implies finding all association rules with confidence over that the same threshold. As argued in [18], [19], Jaccard similarity can be handled via dice similarity. parameter s determines the space usage of the algorithm, which is O(n + s) words.…”

Section: Lower Boundmentioning

confidence: 99%

“…For itemsets of size two (or more) the paper lacks a theoretical analysis of the proposed algorithm, but claims an empirical space usage bounded by m 3 /k 3 . Sampling according to the similarity: Our algorithms builds on top of an idea presented in [18], [19]. The sampling technique used in that algorithm is such that pairs are sampled a number of times that is proportional to their similarity.…”

Section: A Previous Workmentioning

confidence: 99%

“…(A more technical explanation can be found in section III-A where we improve the sampling procedure to make it suitable for a streaming environment.) The algorithms presented in [18], [19] have near-optimal running time, when no information on the distribution of similarities are given. As a matter of fact, the running time is linear in the size of the input and output (when there are many pairs of roughly the same similarity).…”

Section: A Previous Workmentioning

confidence: 99%

“…We base our technique on the sampling method of the BISAM algorithm [18], [19]. For each transaction the pairs are sampled according to their support, such that the pair {i, j} is sampled with probability τ f (|S i |, |S j |), where f is a function that depends on the similarity measure considered, and τ is a parameter that is used to control the sampling rate.…”

Section: A Pair Samplingmentioning

confidence: 99%

See 2 more Smart Citations

On Finding Similar Items in a Stream of Transactions

Campagna

Pagh

2010

2010 IEEE International Conference on Data Mining Workshops

Self Cite

View full text Add to dashboard Cite

Abstract-While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem.We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space Ω(min{mb, n k , (mb/ϕ) k }) bits, where mb is the number of items in the stream so far, n is the number of distinct items and ϕ is a support threshold.To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy improves with the length of the stream for any fixed support threshold.

show abstract

Improved Counter Based Algorithms for Frequent Pairs Mining in Transactional Data Streams

Kutzkov

2012

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

A straightforward approach to frequent pairs mining in transactional streams is to generate all pairs occurring in transactions and apply a frequent items mining algorithm to the resulting stream. The well-known counter based algorithms Frequent and Space-Saving are known to achieve a very good approximation when the frequencies of the items in the stream adhere to a skewed distribution. Motivated by observations on real datasets, we present a general technique for applying Frequent and Space-Saving to transactional data streams for the case when the transactions considerably vary in their lengths. Despite of its simplicity, we show through extensive experiments that our approach is considerably more efficient and precise than the naïve application of Frequent and Space-Saving.

show abstract

Finding associations and computing similarity via biased pair sampling

Cited by 19 publications

References 36 publications

Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures

Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures

On Finding Similar Items in a Stream of Transactions

Improved Counter Based Algorithms for Frequent Pairs Mining in Transactional Data Streams

Contact Info

Product

Resources

About