Tandem mass spectrometry (MS/MS) is a powerful technique
for chemical
analysis in many areas of science. The vast MS/MS spectral data generated
in liquid chromatography–mass spectrometry (LC-MS) experiments
require efficient analysis and interpretation methods for the following
compound identification. In this study, we propose MSBERT based on
self-supervised learning strategies to embed MS/MS spectra into reasonable
embeddings for efficient compound identification. It adopts the transformer
encoder as the backbone for mask learning and uses the same spectra
with different masks for contrastive learning. MSBERT is trained on
the GNPS data set and tested on the GNPS data set, the MoNA data set,
and the MTBLS1572 data set. It exhibits enhanced library matching
and analogous compound searching capabilities compared to existing
methods. The recalls at 1, 5, and 10 on a GNPS test subset with structures
not in the training set are 0.7871, 0.8950, and 0.9080, respectively.
The results are better than those of Spec2Vec with 0.6898, 0.8276,
and 0.8620, and DreaMS with 0.7158, 0.8327, and 0.8635. The rationality
of embeddings is demonstrated by t-SNE visualization, structural similarity,
spectra clustering, compound identification, and analogous compound
searching. A user-friendly web server is provided for efficient spectral
analysis, and the source code for MSBERT is available at .