We engineered a machine learning approach, MSHub, to enable auto-deconvolution of gas chromatography-mass spectrometry (GC-MS) data. We then designed workflows to enable the community to store, process, share, annotate, compare and perform molecular networking of GC-MS data within the Global Natural Product Social (GNPS) Molecular Networking analysis platform. MSHub/GNPS performs auto-deconvolution of compound fragmentation patterns via unsupervised non-negative matrix factorization and quantifies the reproducibility of fragmentation patterns across samples.Given its ease of use and low operational cost, GC-MS has applications with broad societal effect, such as detection of metabolic disease in newborns, toxicology, doping, forensics, food science and clinical testing. The predominant ionization technique in GC-MS is electron ionization (EI), in which all compounds are ionized by high-energy (70-eV) electrons. Because fragmentation occurs with ionization, EI GC-MS data are subjected to spectral deconvolution, a process that separates fragmentation ion patterns for each eluting molecule into a composite mass spectrum.The 70 eV for ionizing electrons in GC-MS has been the standard, making it possible to use decades-old EI reference spectra for annotation 1 . There are ~1.2 million reference spectra that have been accumulated and curated over a period of more than 50 years 2 . Many tools and repositories for GC-MS data have been introduced [3][4][5][6][7][8][9][10][11][12][13][14][15] ; however, much of GC-MS data processing is restricted to vendor-specific formats and software 8 . Currently, deconvolution requires setting multiple parameters manually [3][4][5] or posessing computational skills to run the software 7 . Also, the lack of data sharing in a uniform format precludes data comparison between laboratories and prevents taking advantage of repository-scale information and community knowledge, resulting in infrequent reuse of GC-MS data 8,[11][12][13][14][15] .Although batch modes exist, deconvolution quality is currently not enhanced by using information from all other files. To leverage across-file information, improve scalability of spectral deconvolution and eliminate the need for manually setting the deconvolution parameters (m/z error correction of the ions and peak shapeslopes of raising and trailing edges, peak RT shifts and noise/intensity thresholds), we developed an algorithmic learning strategy for auto-deconvolution (Fig. 1a-f). We deployed this functionality within GNPS/MassIVE (https://gnps.ucsd.edu) 16 (Fig. 1f-i). To promote analysis reproducibility, all GNPS jobs performed are retained in the 'My User' space and can be shared as hyperlinks.This user-independent 'automatic' parameter optimization is accomplished via fast Fourier transform (FFT), multiplication and inverse Fourier transform for each ion across an entire data set, followed by an unsupervised non-negative matrix factorization (NMF) (one-layer neural network). Then, the compositional consistency of spectral patterns for each spec...
We report the development of a spectral knowledgebase named ADAP-KDB for tracking and prioritizing unknown gas chromatography−mass spectrometry (GC−MS) spectra in the NIH's Metabolomics Data Repositorya national and international repository for metabolomics data. ADAP-KDB consists of two parts. One part is a computational workflow that preprocesses raw mass spectrometry data and derives consensus mass spectra. The other part is a web portal for users to browse the consensus spectra and match query spectra against them. For each consensus spectrum, the Gini-Simpson diversity index and the pvalue from χ 2 goodness-of-fit test are calculated to measure its statistical significance, which enables prioritization of unknown mass spectra for subsequent costly compound identification.
The number of metabolomics studies and spectral libraries for compound annotation (i.e., assigning possible compound identities to a fragmentation spectrum) has been growing steadily in recent years. Accompanying this growth is the number of mass spectra available for searching through those libraries. As the size of spectral libraries grows, accurate and fast compound annotation becomes more challenging. We herein report a prescreening algorithm that was developed to address the speed of spectral search under the constraint of low memory requirements. This prescreening has been incorporated into the Automated Data Analysis Pipeline Spectral Knowledgebase (ADAP-KDB) and can be applied to compound annotation by searching other spectral libraries as well. Performance of the prescreening algorithm was evaluated for different sets of parameters and compared to the original ADAP-KDB spectral search and the MSSearch software. The comparison has demonstrated that the new algorithm is about four-times faster than the original when searching for low-resolution mass spectra, and about as fast as the original when searching for high-resolution mass spectra. However, the new algorithm is still slower than MSSearch due to the relational database design of the former. The new search workflow can be tried out at the ADAP-KDB web portal.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.