Clustering in the presence of side information: a non-linear approach

Abin, Ahmad Ali

doi:10.1108/ijicc-04-2018-0046

Cited by 3 publications

(3 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clustering categories in rows and/or columns of a contingency table is also desirable to enhance interpretability and transparency (Baesens et al 2003;Carrizosa et al 2017bCarrizosa et al , 2022Goodman and Flaxman 2017;Ustun and Rudin 2016), by easing the presentation of the table as well as the conclusions of the analysis from a statistical perspective. Furthermore, constrained clustering allows the analyst to incorporate knowledge about the problem under study and support meaningful decision making (Abin 2019;Śmieja and Wiercioch 2017). However, it is known that the conclusions on independence depend, in general, on the granularity chosen for each of the categorical variables.…”

Section: Introductionmentioning

confidence: 99%

“…Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. This constrained clustering approach allows us to incorporate background knowledge to support the analysis and extract meaningful conclusions (Abin 2019;Śmieja and Wiercioch 2017).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On mathematical optimization for clustering categories in contingency tables

Carrizosa

Guerrero

Morales

2022

Adv Data Anal Classif

View full text Add to dashboard Cite

Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest $$\chi ^2$$ χ 2 statistic. Repeating this procedure for different values of the granularity, we can either identify an extreme grouping, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On mathematical optimization for clustering categories in contingency tables

Carrizosa

Guerrero

Morales

2022

Adv Data Anal Classif

View full text Add to dashboard Cite

show abstract

“…In contrast to some previous efforts that implicitly encode ML and CL constraints by modifying the graph Laplacian or constraining the underlying Eigenspace, they present a more natural and principled formulation, which explicitly encodes the constraints as part of a constrained optimization problem. Abin (2019) suggests a new perspective for constrained clustering by finding an effective transformation of data into target space on the reference of background knowledge. Most of the existing methods in constrained clustering are limited to learn a distance metric or kernel matrix from the background knowledge while looking for transformation of data in target space.…”

Section: Introductionmentioning

confidence: 99%

CDEC: a constrained deep embedded clustering

Amirizadeh

Boostani

2021

IJICC

View full text Add to dashboard Cite

PurposeThe aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approachIn data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.FindingsFirst of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.Originality/valueLittle works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

show abstract

The distance and entropy measures-based intuitionistic fuzzy C-means and similarity matrix clustering algorithms and their applications

Zhang,

Huang

2025

Applied Soft Computing

View full text Add to dashboard Cite

Clustering in the presence of side information: a non-linear approach

Cited by 3 publications

References 25 publications

On mathematical optimization for clustering categories in contingency tables

On mathematical optimization for clustering categories in contingency tables

CDEC: a constrained deep embedded clustering

The distance and entropy measures-based intuitionistic fuzzy C-means and similarity matrix clustering algorithms and their applications

Contact Info

Product

Resources

About