Abstract:We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
“…We have shown in [14] that the conditional β-entropy enjoys the property specified next. Theorem 2.3 Let π, σ, σ ′ be three partitions of a finite set…”
Section: An Axiomatization Of Generalized Entropymentioning
confidence: 99%
“…These metrics are used for a variety of data mining tasks ranging from clustering [7,15] to classification [13,14] and discretization [10].…”
Starting from an axiomatization of a generalization of Shannon entropy we introduce a set of axioms for a parametric family of distances over sets of partitions of finite sets. This family includes some well-known metrics used in data mining and in the study of finite functions.
“…We have shown in [14] that the conditional β-entropy enjoys the property specified next. Theorem 2.3 Let π, σ, σ ′ be three partitions of a finite set…”
Section: An Axiomatization Of Generalized Entropymentioning
confidence: 99%
“…These metrics are used for a variety of data mining tasks ranging from clustering [7,15] to classification [13,14] and discretization [10].…”
Starting from an axiomatization of a generalization of Shannon entropy we introduce a set of axioms for a parametric family of distances over sets of partitions of finite sets. This family includes some well-known metrics used in data mining and in the study of finite functions.
“…Before defining the distance between sparse context trees we introduce the notion of β-entropy of a tree τ . Following Simovici and Szymon (2006) we define, for all β > 0,…”
The goal of this paper is to study the similarity between sequences using a distance between the context trees associated to the sequences. These trees are defined in the framework of Sparse Probabilistic Suffix Trees (SPST), and can be estimated using the SPST algorithm. We implement the Phyl-SPST package to compute the distance between the sparse context trees estimated with the SPST algorithm. The distance takes into account the structure of the trees, and indirectly the transition probabilities. We apply this approach to reconstruct a phylogenetic tree of protein sequences in the globin family of vertebrates. We compare this tree with the one obtained using the well-known PAM distance.
“…Com a noção de entropia e a definição de partição máxima entre duas partições, deriva-se a definição de distância introduzida em (Simovici & Szymon, 2006). Essa será a distância que utilizaremos para estudar a similaridade entre as seqüências de proteínas.…”
Section: Um Espaço Métrico Deárvoresunclassified
“…Para issoé utilizada uma distância entre asárvores de contextos, introduzida em Simovici & Szymon (2006). O estudoé feito em seqüências de globinas e de fatores de crescimento de fibroblastos (FGF).…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.