One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis-Patrick's (J-P) (Jarvis, R. A. IEEE Trans. Comput. 1973Comput. , C-22, 1025Comput. -1034. The implementation of J-P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J-P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J-P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J-P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the oVerall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index.
INTRODUCTIONClustering 2,3 has been described as 'the art of finding groups in data' 4 and is widely used within the pharmaceutical industry to design different representative sets. Most common uses of representative sets could be as training sets in the development of different structure-activity models and for screening in different biological screens. In both cases, one would assume that the cluster centroid is a good representative member of the corresponding cluster. It is therefore of great importance to be able to create homogeneous clusters in a consistent way and to deal with either small or very large sets equally well. Our approach uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program.
METHODOLOGYThere are three key steps in this clustering approach: 1. generation of standard Daylight's fingerprints (ASCII); 2. identification of potential cluster centroids; 3. clustering based on the exclusion spheres. 1. Generation of Fingerprints. Fingerprints for each molecule are generated, using Daylight software, as an ASCII string of 1's and 0's (fixed width at 1024). See Appendix 1 for more details on the concept of Daylight's fingerprints.2. Identifying Potential Cluster Centroids. It is reasonable to postulate that a molecule within a given cluster which has the largest number of neighbors and is therefore 'most like' the rest of the cluster is a good choice to become a cluster centroid. To identify such molecules, we calculate the number of neighbors for each molecule in the set, at the Tanimoto level chosen for the clustering. The set is then sorted in descending order, so that the potential cluster centroids, ...