Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, and medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 67,064,204 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 165-fold expansion in library size over the approximately 400,000 known natural products. This study highlights the potential of using deep generative models to explore novel natural product chemical space for high throughput in silico discovery.
Natural products have proven to be valuable, particularly in the fields of drug discovery and chemogenomics. Tandem mass spectrometry, along with reference mass spectral libraries, has been frequently used to assist the characterization of natural products present in unknown complex mixtures. As current spectral libraries only contain a small percentage of known natural products, their continual expansion is crucial for accurate molecular identification. However, doing so through experimental means is often expensive and time-consuming. This study explores the use of ab initio molecular dynamics simulations (AIMD) based on the lightweight GFN2-xTB semiempirical Hamiltonian, to generate mass spectra for small natural products molecules. Through this approach, more than 2,700 unique mass spectra were generated and analysed in relation to the Global Natural Products Social Molecular Networking (GNPS) database. This study found that AIMD performs relative well (mean cosine similarity score of 0.68), with improved performance observed in aromatic molecules but limitations found when applied to molecules with carboxylic acid groups. Other key findings relating to experimental and simulated conditions also led to several recommendations for future work in this area. Overall, AIMD proved to have huge potential to be used to develop a putative natural product mass spectral library.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.