Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.mass spectrometry | small compound identification | metabolomics | bioinformatics | machine learning M etabolites, small molecules that are involved in cellular reactions, can provide detailed information about cellular state. Untargeted metabolomic studies may use NMR or MS technologies, but liquid chromatography followed by MS (LC/MS) can detect the highest number of metabolites from minimal amounts of sample (1, 2). Untargeted metabolomics comprehensively compares the mass spectral intensities of metabolite signals (peaks) between two or more samples (3, 4). Advances in MS instrumentation allow us to simultaneously detect thousands of metabolites in a biological sample. Identification of these compounds relies on tandem MS (MS/MS) data, produced by fragmenting the compound and recording the masses of the fragments. Structural elucidation remains a challenging problem, in particular for compounds that cannot be found in any spectral library (1): In total, all available spectral MS/MS libraries of pure chemical standards cover fewer than 20,000 compounds (5). Growth of spectral libraries is limited by the unavailability of pure reference standards for many compounds.In contrast, molecular structure databases such as PubChem (6) and ChemSpider (7) contain millions of compounds, with PubChem alone having surpassed 50 million entries. Searching in molecular structure databases using MS/MS data is therefore considered a powerful tool for assisting an expert in the elucidation of a compound. This problem is considerably harder than the fundamental analysis step in the shotgun proteomics workflow, namely, searching peptide MS/MS data in a peptide sequence database (8): Unlike proteins and peptides, metabolites show a large structural variability and, consequently, also large variations in MS/MS fragmentation. Computational approaches for interpreting and predicting MS/MS data of small molecules date back to the 1960s (9): Due to the unavailability of molecular structure databases at that time, structure libraries were combinatorially generated and then "searched" using the experimental MS/MS data. "Modern" methods for this question have been developed since mid-2000. Particular progress has been made for restricted metabolite cl...