Proceedings of the 21st ACM International Conference on Information and Knowledge Management 2012
DOI: 10.1145/2396761.2398445
|View full text |Cite
|
Sign up to set email alerts
|

Efficient jaccard-based diversity analysis of large document collections

Abstract: We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 29 publications
0
7
0
Order By: Relevance
“…The kappa can range between −1 and + 1, and the kappa result is explained as follows: if the values are ≤0, there is no agreement; between 0.01 and 0.20, there is minor arrangement; between 0.21 and 0.40, there is known fair agreement; between 0.41 and 0.60, there is moderate agreement, between 0.61 and 0.80, there is substantial agreement; and from 0.81 to 1.00, there is nearly perfect agreement [ 29 ]. The Jaccard index determines how close the commonality of the two datasets can be a measured [ 30 ]. The Jaccard coefficient is given in the following equation: …”
Section: Resultsmentioning
confidence: 99%
“…The kappa can range between −1 and + 1, and the kappa result is explained as follows: if the values are ≤0, there is no agreement; between 0.01 and 0.20, there is minor arrangement; between 0.21 and 0.40, there is known fair agreement; between 0.41 and 0.60, there is moderate agreement, between 0.61 and 0.80, there is substantial agreement; and from 0.81 to 1.00, there is nearly perfect agreement [ 29 ]. The Jaccard index determines how close the commonality of the two datasets can be a measured [ 30 ]. The Jaccard coefficient is given in the following equation: …”
Section: Resultsmentioning
confidence: 99%
“…The above definition is used to compute the pairwise diversity of a set of paths P. This in general requires O(|P| 2 ) computations. To avoid pairwise computations, one can use min-wise hashing [32].…”
Section: Path Diversitymentioning
confidence: 99%
“…To express the similarity between graph nodes, a meaningful similarity measure is required. One such measure is the Jaccard similarity, which has been applied successfully in areas such as duplicate detection [6,19], link prediction [15], similarity evaluation in wikipedia [4], triangle counting in massive graphs [5] and diversity analysis in documents [9].…”
Section: Related Workmentioning
confidence: 99%
“…In general, similarity is expressed by a function V ×V → [0,1], where a value close to 0 means low similarity and a value close to 1 denotes a high similarity between a node pair. In this work, we express similarity by means of the Jaccard similarity coefficient, which enjoys a widespread use in diverse areas such as link prediction and recommendation [15], data cleaning [3], near duplicate detection [19], diversity analysis [9], whereas it is one of the most important measures for set similarity. We associate with each node u the set of its immediate neighbors N (u) (u inclusive).…”
Section: Introductionmentioning
confidence: 99%