2020
DOI: 10.21203/rs.3.rs-50810/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding the Causes of Errors in Eukaryotic Protein-coding Gene Prediction: A Case Study of Primate Proteomes

Abstract: Background. Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results. We first investigated the prevalence of gene prediction errors in a lar… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
11
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1
1

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 15 publications
0
11
0
Order By: Relevance
“…Genscan [7], Snap [8] or GlimmerHMM [9]. Despite these developments, the annotation of gene structure remains a major challenge, especially for eukaryotic organisms [10][11][12][13] due to their complex exon-intron mosaics [14] (Fig. 1).…”
mentioning
confidence: 99%
“…Genscan [7], Snap [8] or GlimmerHMM [9]. Despite these developments, the annotation of gene structure remains a major challenge, especially for eukaryotic organisms [10][11][12][13] due to their complex exon-intron mosaics [14] (Fig. 1).…”
mentioning
confidence: 99%
“…Some important causes of erroneous sequences have been identified, including the genome sequence quality and gene structure complexity (40), as well as redundant or conflicting information in different resources or in the literature (34,41). Consequently, it has been estimated that 40 to 60% of the protein sequences in public databases are erroneous (42)(43)(44).…”
Section: Discussionmentioning
confidence: 99%
“…Typical errors include missing exons, non-coding sequence retention in exons, wrong exon and gene boundaries, fragmenting genes and merging neighboring genes. Thus, the development of automated methods to identify and correct mispredicted protein sequences remains an important research topic (43, 4547). Our study showing high error rates at a family-wide level is further evidence of the potential of domain-centric approaches for sequence annotation correction and of the urgent need to clean the databases.…”
Section: Discussionmentioning
confidence: 99%
“…In order to train and evaluate our object detector, we used multiple sequence alignments from an in-house built dataset [15]. These MSAs are extracted from the Uniprot reference proteomes [5] and RefSeq [17] databases, and were automatically annotated using SIBIS algorithm [12].…”
Section: Datasetmentioning
confidence: 99%
“…Algorithms used to predict protein-coding genes in DNA sequences, for instance, are not always accurate, and often lead to sequence prediction errors. Consequently, today's protein databases are riddled with inconsistencies [15].…”
Section: Introductionmentioning
confidence: 99%