Comparison of algorithms for dissimilarity-based compound selection

Snarey, M.; Terrett, N.K.; Willett, Peter; Wilton, David J.

doi:10.1016/s1093-3263(98)00008-4

Cited by 191 publications

(156 citation statements)

References 18 publications

Supporting

Mentioning

154

Contrasting

Unclassified

Order By: Relevance

“…We have already noted that the identification of the n most diverse molecules in a dataset containing N molecules is generally infeasible for non-trivial values of n and N (but see Section 4 below for an exception to this general rule), and practicable approaches to dissimilarity-based compound selection hence involve approximate methods that are not guaranteed to result in the identification of the most dissimilar possible subset (see, e.g., Bawden, 1993;Clark, 1997;Hudson et al, 1996;Lajiness, 1990, Marengo andTodeschini, 1992;Nilakantan et al, 1997;Pickett et al, 1998;Polinsky et al, 1996); that said, there is evidence to suggest that the subsets identified are only marginally sub-5 optimal (Gillet et al, 1997). Thus far, two major classes of algorithm have been described: maximum-dissimilarity algorithms and sphere-exclusion algorithms (Snarey et al, 1998) The basic maximum-dissimilarity algorithm for selecting a size-nSubset from a size-NDataset is shown in Figure 1. This algorithm, which was first described by Kennard and Stone (1969) and which was applied to compound selection by Lajiness (1990) and Bawden (1993), permits many variants depending upon the precise implementation of Steps 1 and 3.…”

Section: Selection Of Compounds From a Databasementioning

confidence: 99%

“…Holliday et al (1995) described a MaxSum selection algorithm with a time complexity of O(nN), using an equivalence that had been developed for the rapid implementation of hierarchic agglomerative document clustering using the group-average clustering method (Voorhees, 1986). However, an analysis of the MaxSum definition by Agrafiotis and Lobanov (1999) suggested that it could result in subsets containing groups of closely-related molecules, and this limitation was subsequently demonstrated by Snarey et al (1998) (Higgs et al, 1997;Polinsky et al, 1996) and the comparative evaluation of Snarey et al (1998) showed it to be more effective than MaxSum in identifying database subsets exhibiting a range of biological activities; accordingly, it is probably the method of choice for this class of selection algorithms.…”

Section: Insert Figure 1 About Herementioning

confidence: 99%

“…Having provided a brief overview of the current status of computational tools for theanalysis of molecular diversity, we now focus on dissimilarity-based methods for compound selection, illustrating the range of procedures that are available by reference to work carried out over the last three years in the University of Sheffield (Gardiner et al, 1998;Gillet et al, 1997Gillet et al, , 1999Holliday etal., 1995;Snarey et al, 1998).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Willett

1999

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

Section: Selection Of Compounds From a Databasementioning

confidence: 99%

Section: Insert Figure 1 About Herementioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Willett

1999

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

“…In subsequent stages, that non-excluded molecule is chosen for inclusion in the subset that has the largest dissimilarity to those molecules that have already been selected, and further molecules excluded if they are nearest neighbours of the one that has been chosen [85] (other approaches have also been described [86]). These approaches involve the identification of the most dissimilar molecule at each stage, and different results can be obtained depending on how 'most dissimilar' is defined: the MaxMin approach is widely used, and involves selecting that molecule for inclusion that has the maximum dissimilarity to its nearest neighbour in the current subset of selected molecules [87].…”

Section: Molecular Diversity Analysismentioning

confidence: 99%

Similarity‐based data mining in files of two‐dimensional chemical structures using fingerprint measures of molecular resemblance

Willett

2011

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

This paper reviews the use of measures of inter-molecular similarity for processing databases of chemical structures, which play an important role in the discovery of new drugs by the pharmaceutical industry. The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment substructures in a molecule and where the similarity between pairs of such fingerprints is computed using an association coefficient such as the Tanimoto coefficient. The Similar Property Principle provides the basic rationale for the use of similarity methods in three important chemoinformatics applications: similarity searching, database clustering, and molecular diversity analysis. Similarity searching enables the identification of those molecules in a database that are most similar to a userdefined, biologically active query molecule, with data fusion providing an effective way of combining the results of multiple similarity searches. Cluster analysis, typically using the Jarvis-Patrick, Ward or divisive k-means clustering methods, enables the cost-effective selection of molecules for biological testing, for property prediction and for investigating database overlap. Molecular diversity analysis, typically using cluster-based, dissimilarity-based or optimisation-based approaches, enables the identification of structurally diverse sets of molecules, so as to ensure that the full chemical space spanned by a database is tested in the search for novel bioactive molecules.

show abstract

“…In each case, ten representative reference structures from an activity class were chosen for searching: the choices were made using a MaxMin diversity selection procedure, to ensure that the reference structures covered the full range of structural types within each activity class [40]. The numbers of actives retrieved in these similarity searches then averaged over the ten reference structures, using cut-offs of the top-1% and the top-5% of the similarity rankings.…”

Section: Datasetsmentioning

confidence: 99%

Analysis and use of fragment-occurrence data in similarity-based virtual screening

Arif

Holliday

Willett

2009

J Comput Aided Mol Des

Self Cite

View full text Add to dashboard Cite

Current systems for similarity-based virtual screening use similarity measures in which all the fragments in a fingerprint contribute equally to the calculation of structural similarity. This paper discusses the weighting of fragments on the basis of their frequencies of occurrence in molecules. Extensive experiments with sets of active molecules from the MDL Drug Data Report and the World of Molecular Bioactivity databases, using fingerprints encoding Tripos holograms, Pipeline Pilot ECFC_4 circular substructures and Sunset Molecular keys, demonstrate clearly that frequency-based screening is generally more effective than conventional, unweighted screening. The results suggest that standardising the raw occurrence frequencies by taking the square root of the frequencies will maximise the effectiveness of virtual screening. An upper-bound analysis shows the complex interactions that can take place between representations, weighing schemes and similarity coefficients when similarity measures are computed, and provides a rationalisation of the relative performance of the various weighting schemes.

show abstract

Comparison of algorithms for dissimilarity-based compound selection

Cited by 191 publications

References 18 publications

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds

Similarity‐based data mining in files of two‐dimensional chemical structures using fingerprint measures of molecular resemblance

Analysis and use of fragment-occurrence data in similarity-based virtual screening

Contact Info

Product

Resources

About