A stability based method for discovering structure in clustered data

Ben-Hur, Asa; Elisseeff, André; Guyon, Isabelle

doi:10.1142/9789812799623_0002

Cited by 299 publications

(347 citation statements)

References 13 publications

Supporting

Mentioning

345

Contrasting

Order By: Relevance

“…In spite of its straightforwardness, the proposed measure has revealed useful for analyzing the structure of patients clusters, as shown by our experiments. Nevertheless, if the main goal is to estimate the "natural" or "optimal" number of clusters we suggest to use also other more principled global measures based on distribution of some property of the data, such as measures based on distribution of pairwise similarity between clusterings of subsamples of a dataset [30].…”

Section: Discussionmentioning

confidence: 99%

“…The Smolkin and Gosh method based on random subspace does not provide a technique to estimate the number of clusters. As suggested by the authors, the model explorer algorithm [30] has been applied to estimate the correct number of cluster. The model explorer algorithm is specifically designed to estimate only the number of cluster (no estimation of the reliability of each individual cluster is provided) and it exploits the overall distribution of the similarity measures to asses the stability of the clustering.…”

Section: Experimental Comparison With Other Stability-based Methodsmentioning

confidence: 99%

“…For instance the stability of hierarchical clustering [48] as well as of more general clustering methods [49] have been addressed, but with respect to the overall partition, while less work has been dedicated to the evaluation of the stability of the individual clusters. Several methods proposed to evaluate the "natural" number of clusters, ranging from strategies that attempt to maximize measures of cluster compactness [50] to jackknife and resampling-based approaches [27,30,31].…”

Section: Related Workmentioning

confidence: 99%

“…5): their results support clusterings with N ≤ 9 clusters, but they do not provide an individual cluster stability measure. Note that the application of the Ben-Hur et al method [30], based on bootstrapping techniques to estimate the "natural" number of clusters, found N = 4 clusters as the most reliable number of estimated clusters in the data. [ Table 4 …”

Section: Melanoma Patientsmentioning

confidence: 99%

“…Most of the works focused on the estimate of the number of clusters in gene expression data [27,[29][30][31][32], while the problem of stability of each individual cluster has been less investigated. Nevertheless, the stability and reliability of the obtained clusters is crucial to assess the confidence and the significance of a bio-medical discovery [33,34].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Bertoni

Valentini

2006

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

Objective: Clustering algorithms may be applied to the analysis of DNA microarray data to identify novel subgroups that may lead to new taxonomies of diseases defined at bio-molecular level. A major problem related to the identification of biologically meaningful clusters is the assessment of their reliability, since clustering algorithms may find clusters even if no structure is present. Methodology:Recently, methods based on random "perturbations" of the data, such as bootstrapping, noise injections techniques and random subspace methods have been applied to the problem of cluster validity estimation. In this framework, we propose stability measures that exploits the high dimensionality of DNA microarray data and the redundancy of information stored in microarray chips. To this end we randomly project the original gene expression data into lower dimensional subspaces, approximately preserving the distance between the examples according to the Johnson-Lindenstrauss (JL) theory. The stability of the clusters discovered in the original high dimensional space is estimated by comparing them with the clusters discovered in randomly projected lower dimensional subspaces. The proposed cluster-stability measures may be applied to validate and to quantitatively assess the reliability of the clusters obtained by a large class of clustering algorithms. Results and conclusion:We tested the effectiveness of our approach with high dimensional synthetic data, whose distribution is a priori known, showing that the stability measures based on randomized maps correctly predict the number of clusters and the reliability of each individual cluster. Then we showed how to apply the proposed measures to the analysis of DNA microarray data, whose underlying distribution is unknown. We evaluated the validity of clusters discovered by hierarchical clustering algorithms in diffuse large B-cell lymphoma (DLBCL) and malignant melanoma patients, showing that the proposed reliability measures can support bio-medical researchers in the identification of stable clusters of patients and in the discovery of new subtypes of diseases characterized at bio-molecular level.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Experimental Comparison With Other Stability-based Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Melanoma Patientsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Bertoni

Valentini

2006

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

show abstract

Characterizing Heterogeneity in Neuroimaging, Cognition, Clinical Symptoms, and Genetics Among Patients With Late-Life Depression

Wen

Tosun

et al. 2022

JAMA Psychiatry

View full text Add to dashboard Cite

IMPORTANCE Late-life depression (LLD) is characterized by considerable heterogeneity in clinical manifestation. Unraveling such heterogeneity might aid in elucidating etiological mechanisms and support precision and individualized medicine.OBJECTIVE To cross-sectionally and longitudinally delineate disease-related heterogeneity in LLD associated with neuroanatomy, cognitive functioning, clinical symptoms, and genetic profiles. DESIGN, SETTING, AND PARTICIPANTSThe Imaging-Based Coordinate System for Aging and Neurodegenerative Diseases (iSTAGING) study is an international multicenter consortium investigating brain aging in pooled and harmonized data from 13 studies with more than 35 000 participants, including a subset of individuals with major depressive disorder. Multimodal data from a multicenter sample (N = 996), including neuroimaging, neurocognitive assessments, and genetics, were analyzed in this study. A semisupervised clustering method (heterogeneity through discriminative analysis) was applied to regional gray matter (GM) brain volumes to derive dimensional representations. Data were collected from July 2017 to July 2020 and analyzed from July 2020 to December 2021.MAIN OUTCOMES AND MEASURES Two dimensions were identified to delineate LLD-associated heterogeneity in voxelwise GM maps, white matter (WM) fractional anisotropy, neurocognitive functioning, clinical phenotype, and genetics.RESULTS A total of 501 participants with LLD (mean [SD] age, 67.39 [5.56] years; 332 women) and 495 healthy control individuals (mean [SD] age, 66.53 [5.16] years; 333 women) were included. Patients in dimension 1 demonstrated relatively preserved brain anatomy without WM disruptions relative to healthy control individuals. In contrast, patients in dimension 2 showed widespread brain atrophy and WM integrity disruptions, along with cognitive impairment and higher depression severity. Moreover, 1 de novo independent genetic variant (rs13120336; chromosome: 4, 186387714; minor allele, G) was significantly associated with dimension 1 (odds ratio, 2.35; SE, 0.15; P = 3.14 ×10 8 ) but not with dimension 2. The 2 dimensions demonstrated significant single-nucleotide variant-based heritability of 18% to 27% within the general population (N = 12 518 in UK Biobank). In a subset of individuals having longitudinal measurements, those in dimension 2 experienced a more rapid longitudinal change in GM and brain age (Cohen f 2 = 0.03; P = .02) and were more likely to progress to Alzheimer disease (Cohen f 2 = 0.03; P = .03) compared with those in dimension 1 (N = 1431 participants and 7224 scans from the Alzheimer's Disease Neuroimaging Initiative [ADNI], Baltimore Longitudinal Study of Aging [BLSA], and Biomarkers for Older Controls at Risk for Dementia [BIOCARD] data sets).CONCLUSIONS AND RELEVANCE This study characterized heterogeneity in LLD into 2 dimensions with distinct neuroanatomical, cognitive, clinical, and genetic profiles. This dimensional approach provides a potential mechanism for investigating the heterogeneity of LLD and ...

show abstract

The structure and dynamics of cocitation clusters: A multiple‐perspective cocitation analysis

Chen

Ibekwe-Sanjuan

Hou

2010

J. Am. Soc. Inf. Sci.

1,367

980

View full text Add to dashboard Cite

A multiple-perspective co-citation analysis method is introduced for characterizing and interpreting the structure and dynamics of co-citation clusters. The method facilitates analytic and sense making tasks by integrating network visualization, spectral clustering, automatic cluster labeling, and text summarization. Co-citation networks are decomposed into co-citation clusters. The interpretation of these clusters is augmented by automatic cluster labeling and summarization. The method focuses on the interrelations between a co-citation cluster's members and their citers. The generic method is applied to a three-part analysis of the field of Information Science as defined by 12 journals published between 1996 and 2008: 1) a comparative author co-citation analysis (ACA), 2) a progressive ACA of a time series of co-citation networks, and 3) a progressive document co-citation analysis (DCA). Results show that the multipleperspective method increases the interpretability and accountability of both ACA and DCA networks.

show abstract

A stability based method for discovering structure in clustered data

Cited by 299 publications

References 13 publications

Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

Characterizing Heterogeneity in Neuroimaging, Cognition, Clinical Symptoms, and Genetics Among Patients With Late-Life Depression

The structure and dynamics of cocitation clusters: A multiple‐perspective cocitation analysis

Contact Info

Product

Resources

About