2021
DOI: 10.1101/2021.05.21.445150
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

No one tool to rule them all: Prokaryotic gene prediction tool performance is highly dependent on the organism of study

Abstract: Motivation: The biases in Open Reading Frame (ORF) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any ORF prediction tool and allow them to choose the right tool for their analys… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 66 publications
0
3
0
Order By: Relevance
“…The sequences between pairs of stop codons are then searched for start codons, which are paired with downstream stop codons in the same reading frame, generating an open-reading frame (ORF). As bacterial genes have alternative start sites due to reuse of start codons within an exon (Dimonaco et al ., 2021), the best supported start site is chosen based on three criteria ( Step 4 ). For each potential start site, a score is calculated from the start site population-frequency, a translation initiation site (TIS) score using a sequence-scoring model from BALROG (Sommer & Salzberg, 2021), and how many times the start site has been ‘chosen’ before as the best supported start in other potential orthologues.…”
Section: Resultsmentioning
confidence: 99%
“…The sequences between pairs of stop codons are then searched for start codons, which are paired with downstream stop codons in the same reading frame, generating an open-reading frame (ORF). As bacterial genes have alternative start sites due to reuse of start codons within an exon (Dimonaco et al ., 2021), the best supported start site is chosen based on three criteria ( Step 4 ). For each potential start site, a score is calculated from the start site population-frequency, a translation initiation site (TIS) score using a sequence-scoring model from BALROG (Sommer & Salzberg, 2021), and how many times the start site has been ‘chosen’ before as the best supported start in other potential orthologues.…”
Section: Resultsmentioning
confidence: 99%
“…Most gene prediction tools provide high-quality prediction of genes. A study by Nicholas et al offers a comparative insight into the difference between the efficiency of joint prediction by integrating Prodigal, MetaGeneAnnotator and MetaGeneMark and that of using the best tool for each specific organism [142] . Compared with the best tool, the joint prediction model was shown to offer a negligible increase (∼0.47%) in the number of genes predicted.…”
Section: Performance Comparison and Computational Requisitesmentioning
confidence: 99%
“…However, mounting evidence suggests that short genes are widespread and play significant biological roles [155] . Emerging deep learning-based gene predictors without manual feature selection promise to improve prediction efficiency and quality for genes with irregular features [92] , [142] . According to Almeida et al [20] , approximately 40% of the proteins predicted in MAGs do not have similar sequences in the current databases, such as eggNOG, InterPro, COG and KEGG.…”
Section: Outlook Potential Challenges and Strategies To Address Themmentioning
confidence: 99%