Yuqiang Guan scite author profile

A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel k-means are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods--in particular, a general weighted kernel k-means objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast, high-quality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods, such as Metis, have suffered from the restriction of equal-sized clusters; our multilevel algorithm removes this restriction by using kernel k-means to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a state-of-the-art spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to large-scale clustering tasks such as image segmentation, social network analysis and gene network analysis.

show abstract

Minimum Sum-Squared Residue Co-clustering of Gene Expression Data

Cho

Dhillon²,

Guan

et al. 2004

232

229

View full text Add to dashboard Cite

Microarray experiments have been extensively used for simultaneously measuring DNA expression levels of thousands of genes in genome research. A key step in the analysis of gene expression data is the clustering of genes into groups that show similar expression values over a range of conditions. Since only a small subset of the genes participate in any cellular process of interest, by focusing on subsets of genes and conditions, we can lower the noise induced by other genes and conditions -a co-cluster characterizes such a subset of interest. Cheng and Church [3] introduced an effective measure of co-cluster quality based on mean squared residue. In this paper, we use two similar squared residue measures and propose two fast k-means like co-clustering algorithms corresponding to the two residue measures. Our algorithms discover k row clusters and l column clusters simultaneously while monotonically decreasing the respective squared residues. Our co-clustering algorithms inherit the simplicity, efficiency and wide applicability of the k-means algorithm. Minimizing the residues may also be formulated as trace optimization problems that allow us to obtain a spectral relaxation that we use for a principled initialization for our iterative algorithms. We further enhance our algorithms by an incremental local search strategy that helps avoid empty clusters and escape poor local minima. We illustrate co-clustering results on a yeast cell cycle dataset and a human B-cell lymphoma dataset. Our experiments show that our co-clustering algorithms are efficient and are able to discover coherent co-clusters.

show abstract

Efficient Clustering of Very Large Document Collections

Dhillon¹,

Fan²,

Guan³

2001

150

View full text Add to dashboard Cite

An invaluable portio~of scientific data occurs naturally in text form.Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented -a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

show abstract

A fast kernel-based multilevel algorithm for graph clustering

Dhillon

Guan

Kulis

2005

View full text Add to dashboard Cite

Graph clustering (also called graph partitioning)-clustering the nodes of a graph-is an important problem in diverse data mining applications. Traditional approaches involve optimization of graph clustering objectives such as normalized cut or ratio association; spectral methods are widely used for these objectives, but they require eigenvector computation which can be slow. Recently, graph clustering with a general cut objective has been shown to be mathematically equivalent to an appropriate weighted kernel k-means objective function. In this paper, we exploit this equivalence to develop a very fast multilevel algorithm for graph clustering. Multilevel approaches involve coarsening, initial partitioning and refinement phases, all of which may be specialized to different graph clustering objectives. Unlike existing multilevel clustering approaches, such as METIS, our algorithm does not constrain the cluster sizes to be nearly equal. Our approach gives a theoretical guarantee that the refinement step decreases the graph cut objective under consideration. Experiments show that we achieve better final objective function values as compared to a state-of-the-art spectral clustering algorithm: on a series of benchmark test graphs with up to thirty thousand nodes and one million edges, our algorithm achieves lower normalized cut values in 67% of our experiments and higher ratio association values in 100% of our experiments. Furthermore, on large graphs, our algorithm is significantly faster than spectral methods. Finally, our algorithm requires far less memory than spectral methods; we cluster a 1.2 million node movie network into 5000 clusters, which due to memory requirements cannot be done directly with spectral methods.

show abstract

Iterative clustering of high dimensional text data augmented by local search

Dhillon

Guan

Kogan

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuqiang Guan

Weighted Graph Cuts without Eigenvectors A Multilevel Approach

Minimum Sum-Squared Residue Co-clustering of Gene Expression Data

Efficient Clustering of Very Large Document Collections

A fast kernel-based multilevel algorithm for graph clustering

Iterative clustering of high dimensional text data augmented by local search

Contact Info

Product

Resources

About