In this paper we consider a clustering problem that arises in qualitative data analysis. This problem can be transformed to a combinatorial optimization problem, the clique partitioning problem. We have studied the latter problem from a polyhedral point of view and determined large classes of facets of the associated polytope. These theoretical results are utilized in this paper. We describe a cutting plane algorithm that is based on the simplex method and uses exact and heuristic separation routines for some of the classes of facets mentioned before. We discuss some details of the implementation of our code and present our computational results. We mention applications from, e.g., zoology, economics, and the political sciences.
IntroductionThe need of analysing data that arise from the measurement of a number of characteristics (or attributes) associated with each object of a given set, occurs very frequently in sociology, zoology, economics, and many other sciences. The areas of study concerned with this type of problem are known as data analysis, multivariate analysis, and taxonomy.We consider here a problem occurring in qualitative data analysis, of the following type.
Given a data set consisting of the description of a set of objects with respect to a number of characteristics, find a best partition of the object set into "homogeneous" disjoint classes (or clusters).In this paper we give a precise formulation of this clustering problem and show how it can be reduced to a graph optimization problem which we call clique partitioning problem (CPP). This is done in Section 2. In Section 3 we summarize some results of Gr&schel and Wakabayashi (1987) on the polyhedron associated with the CPP. In Section 4 we describe a cutting plane algorithm for CPP which is based on these theoretical results. Finally, in Section 5 we report on the computational results with our code. Many applications from zoology, marketing, and the 6 0
M. Gr6tschel and Y. Wakabayashi / On a clustering problempolitical sciences are given and the optimization process for each of these applications is illustrated.
I. Definitions and notationWe assume that the reader is familiar with the basic concepts of graph theory. The definitions not given here can be found in Bondy and Murty (1976). All graphs we consider are simple. We denote a graph G with node set V and edge set E by G = (V, E). An edge e with endnodes u and v is denoted by uv. If S is a node set of G = (V, E) then we denote the set of edges in G with both endnodes in S by
E(S), that is,
E ( S ) = { u v c E l u , v6 S}.Moreover, if $ 1 , . . . , Sk are subsets of V then k E (Sl,..
., Sk):--CJ E(S,). i--1
IfS, T_c V a n d S~T = O t h e n [S: T]:={uvlucS, vc T}denotes the set of edges with one endnode in S and the other in T.A graph is called complete if every pair of its nodes is linked by an edge. A clique is a subgraph of a graph that is complete (a clique is not necessarily a maximal complete subgraph). We will denote the (up to isomorphism unique) complete graph with n nodes by Kn = ...