rCOSA: A Software Package for Clustering Objects on Subsets of Attributes

Kampert, Maarten M.; Meulman, Jacqueline J.; Friedman, Jerome H.

doi:10.1007/s00357-017-9240-z

Cited by 7 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 , region ML and FL), COSA (clustering objects on subsets of attributes) analysis was used. This analysis is appropriate when the differentiation among groups of objects is unclear and when there are objects that do not clearly belong to any of the groups ( Kampert, Meulman & Friedman, 2016 ). It is a method involving iterations that minimize the distance among individuals.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Darwin’s naturalization hypothesis does not explain the spread of nonnative weed species naturalized in México

Sánchez-Blanco

Vega-Peña

Espinosa‐García

2018

PeerJ

View full text Add to dashboard Cite

BackgroundDespite numerous tests of Darwin’s naturalization hypothesis (DNH) evidence for its support or rejection is still contradictory. We tested a DNH derived prediction stating that nonnative species (NNS) without native congeneric relatives (NCR) will spread to a greater number of localities than species with close relatives in the new range. This test controlled the effect of residence time (Rt) on the spread of NNS and used naturalized species beyond their lag phase to avoid the effect of stochastic events in the establishment and the lag phases that could obscure the NCR effects on NNS.MethodsWe compared the number of localities (spread) occupied by NNS with and without NCR using 13,977 herbarium records for 305 NNS of weeds. We regressed the number of localities occupied by NNS versus Rt to determine the effect of time on the spread of NNS. Then, we selected the species with Rt greater than the expected span of the lag phase, whose residuals were above and below the regression confidence limits; these NNS were classified as widespread (those occupying more localities than expected by Rt) and limited-spread (those occupying fewer localities than expected). These sets were again subclassified into two groups: NNS with and without NCR at the genus level. The number of NNS with and without NCR was compared using χ2 tests and Spearman correlations between the residuals and the number of relatives. Then, we grouped the NNS using 34 biological attributes and five usages to identify the groups’ possible associations with spread and to test DNH. To identify species groups, we performed a nonmetric multidimensional scaling (NMDS) analysis and evaluated the influences of the number of relatives, localities, herbarium specimens, Rt, and residuals of regression. The Spearman correlation and the Mann–Whitney U test were used to determine if the DNH prediction was met. Additionally, we used the clustering objects on subsets of attributes (COSA) method to identify possible syndromes (sets of biological attributes and usages) associated to four groups of NNS useful to test DNH (those with and without NCR and those in more and fewer localities than expected by Rt).ResultsResidence time explained 33% of the variation in localities occupied by nonnative trees and shrubs and 46% of the variation for herbs and subshrubs. The residuals of the regression for NNS were not associated with the number or presence of NCR. In each of the NMDS groups, the number of localities occupied by NNS with and without NCR did not significantly differ. The COSA analysis detected that only NNS with NCR in more and fewer localities than expected share biological attributes and usages, but they differ in their relative importance.DiscussionOur results suggest that DNH does not explain the spread of naturalized species in a highly heterogeneous country. Thus, the presence of NCR is not a useful characteristic in risk analyses for naturalized NNS.

show abstract

Section: Methodsmentioning

confidence: 99%

“…(1) : where d ijk is the dissimilarity between objects i and j as evaluated for attribute k , and N is the number of objects. This method was implemented using the software rCOSA ( Kampert, Meulman & Friedman, 2016 ). The NNS were grouped a priori based on two criteria: occurrence in more/fewer localities and the presence/absence of NCR.…”

Section: Methodsmentioning

confidence: 99%

Darwin’s naturalization hypothesis does not explain the spread of nonnative weed species naturalized in México

Sánchez-Blanco

Vega-Peña

Espinosa‐García

2018

PeerJ

View full text Add to dashboard Cite

show abstract

“…The testing procedure proposed by Janitza et al is particularly appealing here, but it is expected to work only if the holdout permutation importance is centred around 0 and symmetric for noisy variables. Following Kampert et al, 102 samples were generated having 500 normally distributed attributes. The first 50 attributes are generated from normal distributions with different means so that three groups containing 34 samples each are generated ( μ 1 = −1.2, μ 2 = 0, μ 3 = 1.2, and σ = 0.2) with a total of 500 attributes; 450 of which are normally distributed noise that do not contribute to the clusters.…”

Section: Applicationsmentioning

confidence: 99%

On the behaviour of permutation‐based variable importance measures in random forest clustering

Nembrini

2019

Journal of Chemometrics

View full text Add to dashboard Cite

Unsupervised random forest (RF) is a popular clustering method that can be implemented by artificially creating a two-class problem. Variable importance measures (VIMs) can be used to determine which variables are relevant for defining the RF dissimilarity, but they have not received as much attention as the supervised case. Here, I show that sampling schemes used in generating the artificial data-including the original one-can influence the behaviour of the permutation importance in a way that can affect conclusions on variable relevance and also propose a solution. Generating the artificial data using a Bayesian bootstrap keeps the desirable properties of the permutation VIM. KEYWORDS random forest clustering, variable importance measures, variable selection Journal of Chemometrics. 2019;33:e3135.wileyonlinelibrary.com/journal/cem

show abstract

“…For categorical data, the dissimilarity function called simple matching (KAUFMAN; ROUSSEEUW, 1990) is commonly used to measure the difference between qualitative attributes. Such a function is expressed by:…”

Section: Dissimilarity Functionsmentioning

confidence: 99%

“…It is very common using such strategy to define the number of the clusters a dataset might has, although it can be also done using the third strategy, the relative performance measures. Some examples of internal performance measures are silhouette width (ROUSSEEUW, 1987;KAUFMAN;ROUSSEEUW, 1990), Dunn index (DUNN, 1974, Davies-Bouldin index (DAVIES;BOULDIN, 1979), also known as DB index, PBM index (PAKHIRA; BANDYOPADHYAY; MAULIK, 2004), and c-index (HUBERT; LEVIN, 1975). To obtain more information about internal performance measures, see in Vendramin, Campello and Hruschka (2010) and Xiong and Li (2014).…”

Section: Performance Measuresmentioning

confidence: 99%

Contributions to the mixed data clustering problem: from a conceptual codification and classification proposal to the usage of optimization methods

Fróes¹

View full text Add to dashboard Cite

This thesis is dedicated to my parents, Andrea and Antonio, my sister Thais, my brother Rafael, my husband Thiago. Each line of this document shows the support I received from all of you and is a way of expressing the love I feel for each one of you."Todo jardim começa com um sonho de amor. Antes que qualquer árvore seja plantada ou qualquer lago seja construído, é preciso que as árvores e os lagos tenham nascido dentro da alma. Quem não tem jardins por dentro, não planta jardins por fora e nem passeia por eles..." Rubem Alves

show abstract

rCOSA: A Software Package for Clustering Objects on Subsets of Attributes

Cited by 7 publications

References 24 publications

Darwin’s naturalization hypothesis does not explain the spread of nonnative weed species naturalized in México

Darwin’s naturalization hypothesis does not explain the spread of nonnative weed species naturalized in México

On the behaviour of permutation‐based variable importance measures in random forest clustering

Contributions to the mixed data clustering problem: from a conceptual codification and classification proposal to the usage of optimization methods

Contact Info

Product

Resources

About