2011
DOI: 10.1007/s10115-011-0428-y
|View full text |Cite
|
Sign up to set email alerts
|

Finding associations and computing similarity via biased pair sampling

Abstract: Abstract-Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find highconfidence associa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2011
2011
2017
2017

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 36 publications
0
17
0
Order By: Relevance
“…In [19], authors derive an upper bound of Kulczynski, which was shown to be effective only for the comparatively high minimum support thresholds. The techniques based on sampling were recently proposed in [4], which are much faster, but at the cost of the incompleteness of results. Our approach works well for all null-invariant measures including Kulczynski and Cosine, which did not have efficient algorithms for low support, and it produces the complete results.…”
Section: Related Workmentioning
confidence: 99%
“…In [19], authors derive an upper bound of Kulczynski, which was shown to be effective only for the comparatively high minimum support thresholds. The techniques based on sampling were recently proposed in [4], which are much faster, but at the cost of the incompleteness of results. Our approach works well for all null-invariant measures including Kulczynski and Cosine, which did not have efficient algorithms for low support, and it produces the complete results.…”
Section: Related Workmentioning
confidence: 99%
“…The overlap coefficient measure has the property that finding pairs having similarity over a certain threshold implies finding all association rules with confidence over that the same threshold. As argued in [18], [19], Jaccard similarity can be handled via dice similarity. parameter s determines the space usage of the algorithm, which is O(n + s) words.…”
Section: Lower Boundmentioning
confidence: 99%
“…For itemsets of size two (or more) the paper lacks a theoretical analysis of the proposed algorithm, but claims an empirical space usage bounded by m 3 /k 3 . Sampling according to the similarity: Our algorithms builds on top of an idea presented in [18], [19]. The sampling technique used in that algorithm is such that pairs are sampled a number of times that is proportional to their similarity.…”
Section: A Previous Workmentioning
confidence: 99%
See 2 more Smart Citations