Spectrum matching is the most common method for compound identification in mass spectrometry (MS). However, some challenges limit its efficiency, including the coverage of spectral libraries, the accuracy, and the speed of matching. In this study, a million-scale in-silico EI-MS library is established. Furthermore, an ultra-fast and accurate spectrum matching (FastEI) method is proposed to substantially improve accuracy using Word2vec spectral embedding and boost the speed using the hierarchical navigable small-world graph (HNSW). It achieves 80.4% recall@10 accuracy (88.3% with 5 Da mass filter) with a speedup of two orders of magnitude compared with the weighted cosine similarity method (WCS). When FastEI is applied to identify the molecules beyond NIST 2017 library, it achieves 50% recall@1 accuracy. FastEI is packaged as a standalone and user-friendly software for common users with limited computational backgrounds. Overall, FastEI combined with a million-scale in-silico library facilitates compound identification as an accurate and ultra-fast tool.
Region of interest (ROI) extraction is a fundamental
step in analyzing
metabolomic datasets acquired by liquid chromatography–mass
spectrometry (LC–MS). However, noises and backgrounds in LC–MS
data often affect the quality of extracted ROIs. Therefore, developing
effective ROI evaluation algorithms is necessary to eliminate false
positives meanwhile keep the false-negative rate as low as possible.
In this study, a deep fused filter of ROIs (dffROI) was proposed to
improve the accuracy of ROI extraction by combining the handcrafted
evaluation metrics with convolutional neural network (CNN)-learned
representations. To evaluate the performance of dffROI, dffROI was
compared with peakonly (CNN-learned representation) and five handcrafted
metrics on three LC–MS datasets and a gas chromatography–mass
spectrometry (GC–MS) dataset. Results show that dffROI can
achieve higher accuracy, better true-positive rate, and lower false-positive
rate. Its accuracy, true-positive rate, and false-positive rate are
0.9841, 0.9869, and 0.0186 on the test set, respectively. The classification
error rate of dffROI (1.59%) is significantly reduced compared with
peakonly (2.73%). The model-agnostic feature importance demonstrates
the necessity of fusing handcrafted evaluation metrics with the convolutional
neural network representations. dffROI is an automatic, robust, and
universal method for ROI filtering by virtue of information fusion
and end-to-end learning. It is implemented in Python programming language
and open-sourced at under BSD License. Furthermore, it has been integrated into the
KPIC2 framework previously proposed by our group to facilitate real
metabolomic LC–MS dataset analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.