Recent functional, proteomic and ribosome profiling studies in eukaryotes have concurrently demonstrated the translation of alternative open-reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by these altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and contain functional domains. Evolutionary analyses indicate that altORFs often show more extreme conservation patterns than their CDSs. Thousands of alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many genes are multicoding genes and code for a large protein and one or several small proteins.
Advances in proteomics and sequencing have highlighted many non-annotated open reading frames (ORFs) in eukaryotic genomes. Genome annotations, cornerstones of today's research, mostly rely on protein prior knowledge and on ab initio prediction algorithms. Such algorithms notably enforce an arbitrary criterion of one coding sequence (CDS) per transcript, leading to a substantial underestimation of the coding potential of eukaryotes. Here, we present OpenProt, the first database fully endorsing a polycistronic model of eukaryotic genomes to date. OpenProt contains all possible ORFs longer than 30 codons across 10 species, and cumulates supporting evidence such as protein conservation, translation and expression. OpenProt annotates all known proteins (RefProts), novel predicted isoforms (Isoforms) and novel predicted proteins from alternative ORFs (AltProts). It incorporates cutting-edge algorithms to evaluate protein orthology and re-interrogate publicly available ribosome profiling and mass spectrometry datasets, supporting the annotation of thousands of predicted ORFs. The constantly growing database currently cumulates evidence from 87 ribosome profiling and 114 mass spectrometry studies from several species, tissues and cell lines. All data is freely available and downloadable from a web platform (www.openprot.org) supporting a genome browser and advanced queries for each species. Thus, OpenProt enables a more comprehensive landscape of eukaryotic genomes’ coding potential.
OpenProt (www.openprot.org) is the first proteogenomic resource supporting a polycistronic annotation model for eukaryotic genomes. It provides a deeper annotation of open reading frames (ORFs) while mining experimental data for supporting evidence using cutting-edge algorithms. This update presents the major improvements since the initial release of OpenProt. All species support recent NCBI RefSeq and Ensembl annotations, with changes in annotations being reported in OpenProt. Using the 131 ribosome profiling datasets re-analysed by OpenProt to date, non-AUG initiation starts are reported alongside a confidence score of the initiating codon. From the 177 mass spectrometry datasets re-analysed by OpenProt to date, the unicity of the detected peptides is controlled at each implementation. Furthermore, to guide the users, detectability statistics and protein relationships (isoforms) are now reported for each protein. Finally, to foster access to deeper ORF annotation independently of one’s bioinformatics skills or computational resources, OpenProt now offers a data analysis platform. Users can submit their dataset for analysis and receive the results from the analysis by OpenProt. All data on OpenProt are freely available and downloadable for each species, the release-based format ensuring a continuous access to the data. Thus, OpenProt enables a more comprehensive annotation of eukaryotic genomes and fosters functional proteomic discoveries.
It has been demonstrated that RNA G-quadruplexes (G4) are structural motifs present in transcriptomes and play important regulatory roles in several post-transcriptional mechanisms. However, the full picture of RNA G4 locations and the extent of their implication remain elusive. Solely computational prediction analysis of the whole transcriptome may reveal all potential G4, since experimental identifications are always limited to specific conditions or specific cell lines. The present study reports the first in-depth computational prediction of potential G4 region across the complete human transcriptome. Although using a relatively stringent approach based on three prediction scores that accounts for the composition of G4 sequences, the composition of their neighboring sequences, and the various forms of G4, over 1.1 million of potential G4 (pG4) were predicted. The abundance of G4 was computationally confirmed in both 5′ and 3′UTR as well as splicing junction of mRNA, appreciate for the first time in the long ncRNA, while almost absent of most of the small ncRNA families. The present results constitute an important step toward a full understanding of the roles of G4 in post-transcriptional mechanisms.
Since the availability of assembled eukaryotic genomes, the first one being a budding yeast, many computational methods for the reconstruction of ancestral karyotypes and gene orders have been developed. The difficulty has always been to assess their reliability, since we often miss a good knowledge of the true ancestral genomes to compare their results to, as well as a good knowledge of the evolutionary mechanisms to test them on realistic simulated data. In this study, we propose some measures of reliability of several kinds of methods, and apply them to infer and analyse the architectures of two ancestral yeast genomes, based on the sequence of seven assembled extant ones. The pre-duplication common ancestor of S. cerevisiae and C. glabrata has been inferred manually by Gordon et al. (Plos Genet. 2009). We show why, in this case, a good convergence of the methods is explained by some properties of the data, and why results are reliable. In another study, Jean et al. (J. Comput Biol. 2009) proposed an ancestral architecture of the last common ancestor of S. kluyveri, K. thermotolerans, K. lactis, A. gossypii, and Z. rouxii inferred by a computational method. In this case, we show that the dataset does not seem to contain enough information to infer a reliable architecture, and we construct a higher resolution dataset which gives a good reliability on a new ancestral configuration.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.