The study of structure-activity relationships (SARs) of small molecules is of fundamental importance in medicinal chemistry and drug design. Here, we introduce an approach that combines the analysis of similarity-based molecular networks and SAR index distributions to identify multiple SAR components present within sets of active compounds. Different compound classes produce molecular networks of distinct topology. Subsets of compounds related by different local SARs are often organized in small communities in networks annotated with potency information. Many local SAR communities are not isolated but connected by chemical bridges, i.e., similar molecules occurring in different local SAR contexts. The analysis makes it possible to relate local and global SAR features to each other and identify key compounds that are major determinants of SAR characteristics. In many instances, such compounds represent start and end points of chemical optimization pathways and aid in the selection of other candidates from their communities.
Neural networks are powerful data mining tools with a wide range of applications in drug design. This paper largely concentrates on self-organizing neural networks that can be used for investigating datasets both by unsupervised and by supervised learning. The representation of chemical structures is the key to success in establishing useful relationships. Applications are shown for exploring different structure representations, for establishing quantitative structure-activity relationships and for handling compounds having multicategory activities. The applications comprise the separation of compounds according to different biological activities, the location of biologically active compounds in large chemical spaces, the analysis of high-throughput screening data and the classification of compounds according to mode of toxic action.
A hierarchical clustering algorithm--NIPALSTREE--was developed that is able to analyze large data sets in high-dimensional space. The result can be displayed as a dendrogram. At each tree level the algorithm projects a data set via principle component analysis onto one dimension. The data set is sorted according to this one dimension and split at the median position. To avoid distortion of clusters at the median position, the algorithm identifies a potentially more suited split point left or right of the median. The procedure is recursively applied on the resulting subsets until the maximal distance between cluster members exceeds a user-defined threshold. The approach was validated in a retrospective screening study for angiotensin converting enzyme (ACE) inhibitors. The resulting clusters were assessed for their purity and enrichment in actives belonging to this ligand class. Enrichment was observed in individual branches of the dendrogram. In further retrospective virtual screening studies employing the MDL Drug Data Report (MDDR), COBRA, and the SPECS catalog, NIPALSTREE was compared with the hierarchical k-means clustering approach. Results show that both algorithms can be used in the context of virtual screening. Intersecting the result lists obtained with both algorithms improved enrichment factors while losing only few chemotypes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.