Mass spectrometry‐based proteomics is a popular and powerful method for precise and highly multiplexed protein identification. The most common method of analyzing untargeted proteomics data is called database searching, where the database is simply a collection of protein sequences from the target organism, derived from genome sequencing. Experimental peptide tandem mass spectra are compared to simplified models of theoretical spectra calculated from the translated genomic sequences. However, in several interesting application areas, such as forensics, archaeology, venomics, and others, a genome sequence may not be available, or the correct genome sequence to use is not known. In these cases, de novo peptide identification can play an important role. De novo methods infer peptide sequence directly from the tandem mass spectrum without reference to a sequence database, usually using graph‐based or machine learning algorithms. In this review, we provide a basic overview of de novo peptide identification methods and applications, briefly covering de novo algorithms and tools, and focusing in more depth on recent applications from venomics, metaproteomics, forensics, and characterization of antibody drugs.
Metaproteomics has been increasingly utilized for high-throughput characterization of proteins in complex environments and has been demonstrated to provide insights into microbial composition and functional roles. However, significant challenges remain in metaproteomic data analysis, including creation of a sample-specific protein sequence database. A well-matched database is a requirement for successful metaproteomics analysis, and the accuracy and sensitivity of PSM identification algorithms suffer when the database is incomplete or contains extraneous sequences. When matched DNA sequencing data of the sample is unavailable or incomplete, creating the proteome database that accurately represents the organisms in the sample is a challenge. Here, we leverage a de novo peptide sequencing approach to identify the sample composition directly from metaproteomic data. First, we created a deep learning model, Kaiko, to predict the peptide sequences from mass spectrometry data and trained it on 5 million peptide–spectrum matches from 55 phylogenetically diverse bacteria. After training, Kaiko successfully identified organisms from soil isolates and synthetic communities directly from proteomics data. Finally, we created a pipeline for metaproteome database generation using Kaiko. We tested the pipeline on native soils collected in Kansas, showing that the de novo sequencing model can be employed as an alternative and complementary method to construct the sample-specific protein database instead of relying on (un)matched metagenomes. Our pipeline identified all highly abundant taxa from 16S rRNA sequencing of the soil samples and uncovered several additional species which were strongly represented only in proteomic data.
Bottom-up proteomics is increasingly being used to characterize unknown environmental, clinical, and forensic samples. Proteomics-based bacterial identification typically proceeds by tabulating peptide "hits" (i.e., confidently identified peptides) associated with the organisms in a database; those organisms with enough hits are declared present in the sample. This approach has proven to be successful in laboratory studies; however, important research gaps remain. First, the common-practice reliance on unique peptides for identification is susceptible to a phenomenon known as signal erosion. Second, no general guidelines are available for determining how many hits are needed to make a confident identification. These gaps inhibit the transition of this approach to real-world forensic samples where conditions vary and large databases may be needed. In this work, we propose statistical criteria that overcome the problem of signal erosion and can be applied regardless of the sample quality or data analysis pipeline. These criteria are straightforward, producing a p-value on the result of an organism or toxin identification. We test the proposed criteria on 919 LC-MS/MS data sets originating from 2 toxins and 32 bacterial strains acquired using multiple data collection platforms. Results reveal a > 95% correct species-level identification rate, demonstrating the effectiveness and robustness of proteomics-based organism/toxin identification.
Antimicrobial resistance (AMR) is a well-recognized, widespread, and growing issue of concern. With increasing incidence of AMR, the ability to respond quickly to infection with or exposure to an AMR pathogen is critical. Approaches that could accurately and more quickly identify whether a pathogen is AMR also are needed to more rapidly respond to existing and emerging biological threats. We examined proteins associated with paired AMR and antimicrobial susceptible (AMS) strains of Yersinia pestis and Francisella tularensis, causative agents of the diseases plague and tularemia, respectively, to identify whether potential existed to use proteins as signatures of AMR. We found that protein expression was significantly impacted by AMR status. Antimicrobial resistance-conferring proteins were expressed even in the absence of antibiotics in growth media, and the abundance of 10–20% of cellular proteins beyond those that directly confer AMR also were significantly changed in both Y. pestis and F. tularensis. Most strikingly, the abundance of proteins involved in specific metabolic pathways and biological functions was altered in all AMR strains examined, independent of species, resistance mechanism, and affected cellular antimicrobial target. We have identified features that distinguish between AMR and AMS strains, including a subset of features shared across species with different resistance mechanisms, which suggest shared biological signatures of resistance. These features could form the basis of novel approaches to identify AMR phenotypes in unknown strains.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.