Rapid Quantification of Molecular Diversity for Selective Database Acquisition

Turner, David B.; Tyrrell, Simon; Willett, Peter

doi:10.1021/ci960463h

Cited by 105 publications

(96 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The validity of this assumption was challenged by Gillet et al (1997), who took three published combinatorial syntheses, generated libraries by both of the procedures described above, and then calculated the diversities of the two libraries using the diversity index described by Turner et al (1997): in all cases, the library L had a diversity that was greater than that of the library c. Thus, the greater effort involved in generating L, which involves the analysis of N 1 N 2 product molecules as against the analysis of the N 1 + N 2 reactant molecules required to generate c, results in an increase in the diversity of the final library. However, while L is a library, it is not a combinatorial library in that it contains a maximally diverse set of independent product molecules, rather than a set that can besynthesised using a combinatorial reaction.…”

Section: Product-based Design Of Combinatorial Librariesmentioning

confidence: 99%

“…The index used here was the mean pairwise dissimilarity (specifically the complement of the Tanimoto coefficient) when averaged over all the pairs of molecules in a size-n 1 n 2 library, the molecules being represented by molecular fingerprints. This index is discussed by Pickett et al (1998) and Turner et al (1997) and was used here since it can be calculated very rapidly, a pre-requisite for use in a GA-based application where very large numbers of fitness values may need to be calculated. The GA operators are applied to maximise the average diversity and hence to identify the maximally diverse library.…”

Section: Product-based Design Of Combinatorial Librariesmentioning

confidence: 99%

“…This has led to the development of several diversity indices, which provide a single-number quantification of the degree of structural variation within a dataset. Examples of such approaches include a count of the number ofbits that are set in the union of all of the fingerprints for a dataset (Martin et al, 1995), the number of distinct substructures that can be generated from all of the molecules in a dataset (Bone and Villar, 1997), the fraction of the bins in a partition that contain some minimal number of molecules (Pickett et al, 1996), and the sum of the pairwise inter-molecular dissimilarities for a dataset (Turner et al, 1997).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Willett

1999

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

Section: Product-based Design Of Combinatorial Librariesmentioning

confidence: 99%

Section: Product-based Design Of Combinatorial Librariesmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Willett

1999

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

“…In other words, the more similar molecules are in the given set, the higher set similarity will be. 6,7 In our case, the tighter the clusters are (the more similar molecules are within the cluster), the higher set similarities will be obtained. For example:…”

Section: Appendix 4 Set Similaritymentioning

confidence: 85%

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

Butina

1999

J. Chem. Inf. Comput. Sci.

460

424

View full text Add to dashboard Cite

One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis-Patrick's (J-P) (Jarvis, R. A. IEEE Trans. Comput. 1973Comput. , C-22, 1025Comput. -1034. The implementation of J-P under Daylight software, using Daylight's fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J-P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J-P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J-P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the oVerall similarity between two given molecules, as defined by Daylight's fingerprints and the Tanimoto similarity index. INTRODUCTIONClustering 2,3 has been described as 'the art of finding groups in data' 4 and is widely used within the pharmaceutical industry to design different representative sets. Most common uses of representative sets could be as training sets in the development of different structure-activity models and for screening in different biological screens. In both cases, one would assume that the cluster centroid is a good representative member of the corresponding cluster. It is therefore of great importance to be able to create homogeneous clusters in a consistent way and to deal with either small or very large sets equally well. Our approach uses desired similarity within the cluster, as defined by Tanimoto index, as the only input to the clustering program. METHODOLOGYThere are three key steps in this clustering approach: 1. generation of standard Daylight's fingerprints (ASCII); 2. identification of potential cluster centroids; 3. clustering based on the exclusion spheres. 1. Generation of Fingerprints. Fingerprints for each molecule are generated, using Daylight software, as an ASCII string of 1's and 0's (fixed width at 1024). See Appendix 1 for more details on the concept of Daylight's fingerprints.2. Identifying Potential Cluster Centroids. It is reasonable to postulate that a molecule within a given cluster which has the largest number of neighbors and is therefore 'most like' the rest of the cluster is a good choice to become a cluster centroid. To identify such molecules, we calculate the number of neighbors for each molecule in the set, at the Tanimoto level chosen for the clustering. The set is then sorted in descending order, so that the potential cluster centroids, ...

show abstract

“…To quantify the diversity of a set of structures, we calculated the average of Tc (avTc) with all possible pairs of structures, which is widely used as a measure of diversity. [36,37] The DAECS generates various structures by searching several areas on the chemical space. Therefore, the generated structures with several targets are integrated to conSpecial Issue JAPAN Figure 8.…”

Section: Special Issue Japanmentioning

confidence: 99%

Development of a New De Novo Design Algorithm for Exploring Chemical Space

Mishima

Kaneko

Funatsu

2014

Molecular Informatics

View full text Add to dashboard Cite

In the first stage of development of new drugs, various lead compounds with high activity are required. To design such compounds, we focus on chemical space defined by structural descriptors. New compounds close to areas where highly active compounds exist will show the same degree of activity. We have developed a new de novo design system to search a target area in chemical space. First, highly active compounds are manually selected as initial seeds. Then, the seeds are entered into our system, and structures slightly different from the seeds are generated and pooled. Next, seeds are selected from the new structure pool based on the distance from target coordinates on the map. To test the algorithm, we used two datasets of ligand binding affinity and showed that the proposed generator could produce diverse virtual compounds that had high activity in docking simulations.

show abstract

Rapid Quantification of Molecular Diversity for Selective Database Acquisition

Cited by 105 publications

References 13 publications

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

Development of a New De Novo Design Algorithm for Exploring Chemical Space

Contact Info

Product

Resources

About