We studied the similarity search performance of differently designed molecular fingerprints using multiple reference structures and different search strategies. For this purpose, nine compound activity classes were assembled that exclusively consisted of molecules with different core structures and that represented different levels of intra-class structural diversity. Thus, there was a strict one-to-one correspondence between test compounds and core structures. Analysis of unique core structures was found to be a better measure of class diversity than distributions of simplified scaffolds. On increasingly diverse classes, a trainable fingerprint using a unique search strategy performed better than others tested herein. Overall, clear preferences were detected for nearest-neighbor search strategies over fingerprint-averaging techniques. Nearest-neighbor searching that relied on selecting database compounds most similar to one of the reference structures often improved compound recovery over other averaging methods, but at the cost of decreasing the ability to detect hits that were structurally distinct from reference molecules.
Recent attempts to increase similarity search performance using molecular fingerprints have mostly focused on the evaluation of alternative similarity metrics or scoring schemes, rather than the development of new types of fingerprints. Here, we introduce a novel 2D fingerprint design (property descriptor value range-derived fingerprint or PDR-FP) that involves activity-oriented selection of property descriptors and the transformation of descriptor value ranges into a binary format such that each fingerprint bit position represents a specific value interval. The design is tailored toward multiple-template similarity searching and permits training on specific activity classes. In search calculations on 15 compound classes of increasing structural diversity, the PDR fingerprint performed better than other state-of-the-art 2D fingerprints. Among the structurally diverse classes were six compound sets with peptide character, which represent a notoriously difficult chemotype for 2D similarity searching. In these cases, PDR-FP produced promising results, whereas other fingerprint methods mostly failed. PDR-FP is specifically designed for search calculations on structurally diverse compounds, and these calculations are not influenced by molecular size effects, which represent a general problem for similarity searching using bit string representations.
Recently, systematic similarity calculations using Tversky coefficients have suggested that putting higher weight on bit settings of active reference molecules (templates) than database compounds increases hit rates in similarity searching using 2D fingerprints. These findings have been interpreted as evidence for "asymmetry" in chemical similarity searching. We have thoroughly analyzed this phenomenon and demonstrate that apparent asymmetry in similarity search calculations is a direct consequence of differences in fingerprint bit densities, which often correlate with differences in molecular size. Accordingly, a size-independent fingerprint with constant bit density does not produce asymmetrical search results. For Tversky similarity calculations, differences in fingerprint bit densities between active and inactive compounds determine which weighting factors produce high hit rates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.