Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.mass spectrometry | small compound identification | metabolomics | bioinformatics | machine learning M etabolites, small molecules that are involved in cellular reactions, can provide detailed information about cellular state. Untargeted metabolomic studies may use NMR or MS technologies, but liquid chromatography followed by MS (LC/MS) can detect the highest number of metabolites from minimal amounts of sample (1, 2). Untargeted metabolomics comprehensively compares the mass spectral intensities of metabolite signals (peaks) between two or more samples (3, 4). Advances in MS instrumentation allow us to simultaneously detect thousands of metabolites in a biological sample. Identification of these compounds relies on tandem MS (MS/MS) data, produced by fragmenting the compound and recording the masses of the fragments. Structural elucidation remains a challenging problem, in particular for compounds that cannot be found in any spectral library (1): In total, all available spectral MS/MS libraries of pure chemical standards cover fewer than 20,000 compounds (5). Growth of spectral libraries is limited by the unavailability of pure reference standards for many compounds.In contrast, molecular structure databases such as PubChem (6) and ChemSpider (7) contain millions of compounds, with PubChem alone having surpassed 50 million entries. Searching in molecular structure databases using MS/MS data is therefore considered a powerful tool for assisting an expert in the elucidation of a compound. This problem is considerably harder than the fundamental analysis step in the shotgun proteomics workflow, namely, searching peptide MS/MS data in a peptide sequence database (8): Unlike proteins and peptides, metabolites show a large structural variability and, consequently, also large variations in MS/MS fragmentation. Computational approaches for interpreting and predicting MS/MS data of small molecules date back to the 1960s (9): Due to the unavailability of molecular structure databases at that time, structure libraries were combinatorially generated and then "searched" using the experimental MS/MS data. "Modern" methods for this question have been developed since mid-2000. Particular progress has been made for restricted metabolite cl...
Metabolomics experiments can employ non-targeted tandem mass spectrometry to detect hundreds to thousands of molecules in a biological sample. Structural annotation of molecules is typically carried out by searching their fragmentation spectra in spectral libraries or, recently, in structure databases. Annotations are limited to structures present in the library or database employed, prohibiting a thorough utilization of the experimental data. We present a computational tool for systematic compound class annotation: CANOPUS uses a deep neural network to predict 1,270 compound classes from fragmentation spectra, and explicitly targets compounds where neither spectral nor structural reference data are available. CANOPUS even predicts classes for which no MS/MS training data are available. We demonstrate the broad utility of CANOPUS by investigating the effect of the microbial colonization in the digestive system in mice, and through analysis of the chemodiversity of different Euphorbia plants; both uniquely revealing biological insights at the compound class level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.