2020
DOI: 10.1186/s12864-020-6707-9
|View full text |Cite
|
Sign up to set email alerts
|

A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms

Abstract: Background: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. Results: We describe the construction of a new benchmark, called G3PO (benc… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
39
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 71 publications
(54 citation statements)
references
References 48 publications
2
39
0
Order By: Relevance
“…To investigate the impact of the quality of the initial training data on the CNN models, we extracted data from public databases such as Ensembl [ 4 ] and UniProt [ 68 ], where it has been estimated that many proteins (with the exception of Swiss-Prot, which represents 0.3% of UniProt) have errors [ 69 ]. We then built a dataset called ‘All Sequences’ (AS), that includes some badly predicted gene sequences [ 45 ] and thus introduces noise in the form of wrong or missing SS. We compared the CNN model trained on the AS dataset with a second model trained on a ‘Gold Standard’ (GS) dataset, which was cleaned by removing all error-prone sequences.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…To investigate the impact of the quality of the initial training data on the CNN models, we extracted data from public databases such as Ensembl [ 4 ] and UniProt [ 68 ], where it has been estimated that many proteins (with the exception of Swiss-Prot, which represents 0.3% of UniProt) have errors [ 69 ]. We then built a dataset called ‘All Sequences’ (AS), that includes some badly predicted gene sequences [ 45 ] and thus introduces noise in the form of wrong or missing SS. We compared the CNN model trained on the AS dataset with a second model trained on a ‘Gold Standard’ (GS) dataset, which was cleaned by removing all error-prone sequences.…”
Section: Discussionmentioning
confidence: 99%
“…To build the positive and negative subsets, gene sequences from the multi-species benchmark G3PO [ 45 ] were used. G3PO is based on 147 phylogenetically disperse organisms and contains 1793 sequences including 20 human Bardet-Biedl Syndrome (BBS) genes (Additional file 1 : Table S4) and their orthologous sequences (ranging from primates to protists) extracted from the OrthoInspector database v3.0 [ 74 ].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Prior to structural annotation, repetitive elements were identified and masked (Supplementary File S1) using RepeatMasker ( Smit et al 2013 ) with a custom library that included all Viridiplantae entries from Repbase ( Bao et al 2015 ) along with repetitive elements from P. vulgaris ( Gao et al 2014 ). Using the masked assembly, gene predictions were generated by AUGUSTUS with the Arabidopsis thaliana training set ( Keller et al 2011 ; Scalzitti et al 2020 ). The completeness of the genome assembly and AUGUSTUS gene models was analyzed with BUSCO, which measures completeness in terms of evolutionarily informed expectations of gene content ( Simão et al 2015 ).…”
Section: Methodsmentioning
confidence: 99%
“…The prediction of a gene structure can be defined as the capacity to determine the start and the stop of the gene as well as the positions of introns, if present. Despite the number of performant gene prediction programs combining ab initio and homology-based approaches (Mathe et al, 2002 ; Hoff and Stanke, 2015 ), the rate of mis-predicted genes is not negligible and can be due to several factors (Scalzitti et al, 2020 ). For example, unusually long introns, short exons or long genes can generate incomplete or partially predicted gene structure; short intergenic regions can lead to gene fusion; DNA sequencing errors (nucleotide deletions or insertions) introducing frameshifts can affect predictions; non-canonical splice sites, overlapping genes and genes located within introns are also a source of erroneous predictions.…”
Section: Introductionmentioning
confidence: 99%