Small proteins play essential roles in bacterial physiology and virulence, however, automated algorithms for genome annotation are often not yet able to accurately predict the corresponding genes. The accuracy and reliability of genome annotations, particularly for small open reading frames (sORFs), can be significantly improved by integrating protein evidence from experimental approaches. Here we present a highly optimized and flexible bioinformatics workflow for bacterial proteogenomics covering all steps from (i) generation of protein databases, (ii) database searches and (iii) peptide-to-genome mapping to (iv) visualization of results. We used the workflow to identify high quality peptide spectrum matches (PSMs) for small proteins (≤ 100 aa, SP100) in Staphylococcus aureus Newman. Protein extracts from S. aureus were subjected to different experimental workflows for protein digestion and prefractionation and measured with highly sensitive mass spectrometers. In total, 175 with up to 100 aa (SP100) were identified. Out of these 24 (ranging from 9 to 99 aa) were novel and not contained in the used genome annotation.144 SP100 are highly conserved and were found in at least 50% of the publicly available S. aureus genomes, while 127 are additionally conserved in other staphylococci. Almost half of the identified SP100 were basic, suggesting a role in binding to more acidic molecules such as nucleic acids or phospholipids.
Emerging evidence places small proteins (≤50 amino acids) more centrally in physiological processes. Yet, their functional identification and the systematic genome annotation of their cognate small open-reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use the 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. They have difficulties evaluating prokaryotic genomes due to the unique architecture (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present a new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting putative smORFs. The unique feature of smORFer is that it uses an integrated approach and considers structural features of the genetic sequence along with in-frame translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way, and dependent on the data available for a particular organism, different modules can be selected for smORF search.
Emerging evidence places small proteins (≤ 50 amino acids) more centrally in physiological processes. Yet, the identification of functional small proteins and the systematic genome annotation of their cognate small open reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. Yet, they have difficulties evaluating prokaryotic genomes due to the unique architecture of prokaryotic genomes (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present our new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting smORFs. The unique feature of smORFer is that it uses integrated approach and considers structural features of the genetic sequence along with in-register translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way and dependent on the data available for a particular organism allows using different modules for smORF search.
Small proteins play diverse and essential roles in bacterial physiology and virulence. Despite their importance, automated genome annotation algorithms still cannot accurately annotate all respective small open reading frames (sORFs), as they usually provide insufficient sequence information for domain and homology searches, tend to be species specific and only a few experimentally validated examples are covered in standard proteomics studies. The accuracy and reliability of genome annotations, particularly for sORFs, can be significantly improved by integrating protein evidence from experimental approaches that enrich for small proteins. Here we present a highly optimized and flexible workflow for bacterial proteogenomics, which covers all steps from (i) creation of protein databases, (ii) database searches, (iii) peptide-to-genome mapping to (iv) result interpretation and whose automated execution is supported by two open source tools (SALT & Pepper). We used the workflow to identify high quality peptide spectrum matches (PSMs) for both annotated and unannotated small proteins (≤ 100 aa; SP100) in Staphylococcus aureus Newman. Proteins isolated from cells at the exponential and stationary growth phase were digested with different endopeptidases (trypsin, Lys-C, AspN), the resulting peptides fractionated by gel-based and gel-free methods and measured with highly sensitive mass spectrometers. PSMs or sORF predictions from sORFfinder were stringently filtered allowing us to detect 185 soluble SP100, 69 of which were missing in the used genome annotation. Most interestingly, almost half of the identified SP100 were basic, suggesting a role in binding to more acidic molecules such as nucleic acids or phospholipids. In addition, phage-related functions were proposed for 30 SP100, based on the localization of their coding sequences in the genome.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.