Finding low-entropy sets and trees from binary data

Heikinheimo, Hannes; Hinkkanen, Eino; Mannila, Heikki; Mielikäinen, Taneli; Seppänen, Janne

doi:10.1145/1281192.1281232

Cited by 32 publications

(45 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Continuing the above line of research, Heikinheimo et al define two related problems, namely, mining high-and lowentropy sets [5]. Zhang and Masseglia [6] extended their method to work on streaming data and proposed to reduce its output by removing similar sets according to criteria based on mutual information [20].…”

Section: Entropy-based Measures Of Itemset Interestingnessmentioning

confidence: 99%

Diverse Dimension Decomposition of an Itemset Space

Tsytsarau

Bonchi

Gionis

et al. 2011

2011 IEEE 11th International Conference on Data Mining

View full text Add to dashboard Cite

We introduce the problem of diverse dimension decomposition in transactional databases. A dimension is a set of mutually-exclusive itemsets, and our problem is to find a decomposition of the itemset space into dimensions, which are orthogonal to each other, and that provide high coverage of the input database. The mining framework we propose effectively represents a dimensionality-reducing transformation from the space of all items to the space of orthogonal dimensions. Our approach relies on information-theoretic concepts, and we are able to formulate the dimension-finding problem with a single objective function that simultaneously captures constraints on coverage, exclusivity and orthogonality. We describe an efficient greedy method for finding diverse dimensions from transactional databases. The experimental evaluation of the proposed approach using two real datasets, flickr and del.icio.us, demonstrates the effectiveness of our solution. Although we are motivated by the applications in the collaborative tagging domain, we believe that the mining task we introduce in this paper is general enough to be useful in other application domains.

show abstract

Section: Entropy-based Measures Of Itemset Interestingnessmentioning

confidence: 99%

Diverse Dimension Decomposition of an Itemset Space

Tsytsarau

Bonchi

Gionis

et al. 2011

2011 IEEE 11th International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…In other words, in each iteration the algorithm tries to reduce the total description length as much as possible. If a merge reduces the lowest description length seen yet, we remember it (6-7), and finally return the best clustering (10).…”

Section: Mining Attribute Clusteringsmentioning

confidence: 99%

“…Most related to our method are low-entropy sets [10], itemsets for which the entropy of the data is below a given threshold. As entropy is strongly monotonically increasing, typically very many low-entropy sets are discovered even for low thresholds.…”

Section: Related Workmentioning

confidence: 99%

“…Instead, we view the data symmetrically with regard to 0s and 1s and aim to optimally group those items that interact most strongly. In this regard, our approach is also related to selecting low-entropy sets [10], itemsets that identify strong interactions in the data. An existing proposal to this end, LESS [11], requires a collection of low-entropy sets as input, and the resulting model cannot easily be queried.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Summarising Data by Clustering Items

Mampaey

Vreeken

2010

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. For a book, the title and abstract provide a good first impression of what to expect from it. For a database, getting a first impression is not so straightforward. While low-order statistics only provide limited insight, mining the data quickly provides too much detail. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality summaries for binary data. Our method builds a summary by grouping items that strongly correlate, and uses the Minimum Description Length principle to identify the best grouping -without requiring a distance measure between items.Besides offering a practical overview of which attributes interact most strongly, these summaries are also easily-queried surrogates for the data. Experiments show that our method discovers high-quality results: correlated attributes are correctly grouped and the supports of frequent itemsets are closely approximated.

show abstract

“…Ref. [12] proposed to find those low-entropy sets, and introduced two low entropy trees. They discussed properties of their trees and proposed some mining algorithms.…”

Section: Related Workmentioning

confidence: 99%

Mining non-redundant diverse patterns: an information theoretic perspective

Sha

Gong

Zhou

2010

Front. Comput. Sci. China

View full text Add to dashboard Cite

The discovery of diversity patterns from binary data is an important data mining task. In this paper, we propose the problem of mining highly diverse patterns called non-redundant diversity patterns (NDPs). In this framework, entropy is adopted to measure the diversity of itemsets. In addition, an algorithm called NDP miner is proposed to exploit both monotone properties of entropy diversity measure and pruning power for the efficient discovery of non-redundant diversity patterns. Finally, our experimental results are given to show that the NDP miner can efficiently identify non-redundant diversity patterns.

show abstract

Finding low-entropy sets and trees from binary data

Cited by 32 publications

References 27 publications

Diverse Dimension Decomposition of an Itemset Space

Diverse Dimension Decomposition of an Itemset Space

Summarising Data by Clustering Items

Mining non-redundant diverse patterns: an information theoretic perspective

Contact Info

Product

Resources

About