Persistent homology is constrained to purely topological persistence, while multiscale graphs account only for geometric information. This work introduces persistent spectral theory to create a unified low‐dimensional multiscale paradigm for revealing topological persistence and extracting geometric shapes from high‐dimensional datasets. For a point‐cloud dataset, a filtration procedure is used to generate a sequence of chain complexes and associated families of simplicial complexes and chains, from which we construct persistent combinatorial Laplacian matrices. We show that a full set of topological persistence can be completely recovered from the harmonic persistent spectra, that is, the spectra that have zero eigenvalues, of the persistent combinatorial Laplacian matrices. However, non‐harmonic spectra of the Laplacian matrices induced by the filtration offer another powerful tool for data analysis, modeling, and prediction. In this work, fullerene stability is predicted by using both harmonic spectra and non‐harmonic persistent spectra, while the latter spectra are successfully devised to analyze the structure of fullerenes and model protein flexibility, which cannot be straightforwardly extracted from the current persistent homology. The proposed method is found to provide excellent predictions of the protein B‐factors for which current popular biophysical models break down.
Current drug discovery is expensive and time-consuming. It remains a challenging task to create a wide variety of novel compounds that not only have desirable pharmacological properties but also are cheaply available to low-income people. In this work, we develop a generative network complex (GNC) to generate new drug-like molecules based on the multiproperty optimization via the gradient descent in the latent space of an autoencoder. In our GNC, both multiple chemical properties and similarity scores are optimized to generate drug-like molecules with desired chemical properties. To further validate the reliability of the predictions, these molecules are reevaluated and screened by independent 2D fingerprint-based predictors to come up with a few hundreds of new drug candidates. As a demonstration, we apply our GNC to generate a large number of new BACE1 inhibitors, as well as thousands of novel alternative drug candidates for eight existing market drugs, including Ceritinib, Ribociclib, Acalabrutinib, Idelalisib, Dabrafenib, Macimorelin, Enzalutamide, and Panobinostat.
The ability of molecular property prediction is of great significance to drug discovery, human health, and environmental protection. Despite considerable efforts, quantitative prediction of various molecular properties remains a challenge. Although some machine learning models, such as bidirectional encoder from transformer, can incorporate massive unlabeled molecular data into molecular representations via a self-supervised learning strategy, it neglects three-dimensional (3D) stereochemical information. Algebraic graph, specifically, element-specific multiscale weighted colored algebraic graph, embeds complementary 3D molecular information into graph invariants. We propose an algebraic graph-assisted bidirectional transformer (AGBT) framework by fusing representations generated by algebraic graph and bidirectional transformer, as well as a variety of machine learning algorithms, including decision trees, multitask learning, and deep neural networks. We validate the proposed AGBT framework on eight molecular datasets, involving quantitative toxicity, physical chemistry, and physiology datasets. Extensive numerical experiments have shown that AGBT is a state-of-the-art framework for molecular property prediction.
Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprints are still valuable in computer-aided drug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligand binding, toxicity, solubility and partition coefficient to assess the performance of eight 2D fingerprints. Advanced machine learning algorithms including random forest, gradient boosted decision tree, single-task deep neural network and multitask deep neural network are employed to construct efficient 2D-fingerprint based models. Additionally, appropriate consensus models are built to further enhance the performance of 2D-fingerprintbased methods. It is demonstrated that 2D-fingerprint-based models perform as well as the state-of-the-art 3D structure-based models for the predictions of toxicity, solubility, partition coefficient and protein-ligand binding affinity based on only ligand information. However, 3D structure-based models outperform 2D fingerprint-based methods in complex-based protein-ligand binding affinity predictions. * Corresponding to Guo-Wei Wei.Molecular fingerprints are one way of encoding the structural features of a molecule. They play a fundamental role in QSAR/QSPR analysis, virtual screening, similarity-based compound search, target molecule ranking, drug ADMET prediction, and other drug discovery processes. Molecular fingerprints are property profiles of a molecule, usually in the form of vectors with each vector element indicating the existence, the degree or the frequency of one particular structure feature. [12][13][14] Various fingerprints have been developed for molecular feature encoding in the past few decades. [15][16][17] Most fingerprints are 2D fingerprints which can be extracted from molecular connection tables without 3D structure information. However, high dimensional fingerprints have also been developed to utilize 3D molecular structure and other information. 18 There are four main categories of 2D fingerprints, namely substructure key-based fingerprints, topological or path-based fingerprints, circular fingerprints, and pharmacophore fingerprints. Substructure key-based fingerprints are bit strings representing the presence of certain substructures or fragments from a given list of structural keys in the compound. Molecular access system (MACCS) 19 is one of the most popular substructure keybased fingerprint methods. Topological or path-based fingerprints are based on analyzing all the fragments of a molecule following a (usually linear) path up to a certain number of bonds, and then hashing every one of these paths to create one fingerprint. The most prominent ones in this category are FP2, 20 Daylight 21 and electrotopological sta...
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on the protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energy, toxicity, partition coefficient, protein folding stability changes upon mutation, etc.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.