FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Banerjee, Sagnik; Bhandary, Priyanka; Woodhouse, Margaret R.; Sen, Taner Z.; Wise, Roger P.; Andorf, Carson M.

doi:10.1186/s12859-021-04120-9

Cited by 25 publications

(35 citation statements)

References 165 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When extrinsic evidence from RNA-seq and protein alignments are available, workflow packages like MAKER (Cantarel et al, 2008) and BRAKER (Hoff et al, 2019) can assist in training ab initio prediction tools. While these workflows can simplify the integration across external evidence, downstream packages are still required to select or modify the resulting predictions (Haas et al, 2008;Banerjee et al, 2021;Gabriel et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes

Vuruputoor

Monyak

Fetter

et al. 2022

Preprint

View full text Add to dashboard Cite

Premise of the study: Robust standards to evaluate quality and completeness are lacking for eukaryotic structural genome annotation. Genome annotation software is developed with model organisms and does not typically include benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. Plant genomes are particularly challenging with their large genome sizes, abundant transposable elements (TEs), and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and approach on protein-coding gene prediction. Methods: The impact of repeat masking, long-read, and short-read inputs, de novo, and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. Annotations were benchmarked for structural traits and sequence similarity. Results: Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long-reads can improve genome annotation. Adding protein evidence from de novo or genome-guided approaches generates more false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion: While annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation, and present a more robust set of metrics to evaluate the resulting predictions.

show abstract

Section: Introductionmentioning

confidence: 99%

Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes

Vuruputoor

Monyak

Fetter

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Gene annotation MAKER2 (Holt & Yandell, 2011), BRAKER2 (Brůna et al, 2021), FINDER (Banerjee et al, 2021), Funannotate (Palmer & Stajich, 2017) Not available Performing independent annotation of individual genomes may lead to artefacts (Bayer et al, 2017;König et al, 2016) Annotations of linear genomes can be projected onto graph using VG annotate/rna.…”

Section: Structural Variation (Sv) Genotypingmentioning

confidence: 99%

“…Without change, the world is on track to reach 840 million undernourished people by 2030 (FAO et al., 2020). The reasons for an increasingly undernourished population are manifold including unequal resource distribution, food waste, and crop loss arising from climate change (Barrera & Hertel, 2021; Hasegawa et al., 2016; Janssens et al., 2020). While an integrated approach is vital to successfully curb this alarming trend, climate‐change‐resilient crops are needed to counter more frequent and extreme weather events that affect mostly populations with already high rates of undernourishment (FAO et al., 2020).…”

Section: Introductionmentioning

confidence: 99%

Pangenomics in crop improvement—from coding structural variations to finding regulatory variants with pangenome graphs

et al. 2021

View full text Add to dashboard Cite

Since the first reported crop pangenome in 2014, advances in high‐throughput and cost‐effective DNA sequencing technologies facilitated multiple such studies including the pangenomes of oilseed rape (Brassica napus L.), soybean [Glycine max (L.) Merr.], rice (Oryza sativa L.), wheat (Triticum aestivum L.), and barley (Hordeum vulgare L.). Compared with single‐reference genomes, pangenomes provide a more accurate representation of the genetic variation present in a species. By combining the genomic data of multiple accessions, pangenomes allow for the detection and annotation of complex DNA polymorphisms such as structural variations (SVs), one of the major determinants of genetic diversity within a species. In this review we summarize the current literature on crop pangenomics, focusing on their application to find candidate SVs involved in traits of agronomic interest. We then highlight the potential of pangenomes in the discovery and functional characterization of noncoding regulatory sequences and their variations. We conclude with a summary and outlook on innovative data structures representing the complete content of plant pangenomes including annotations of coding and noncoding elements and outcomes of transcriptomic and epigenomic experiments.

show abstract

“…Multiple pieces of software exist for each analysis step, and attempts have been made to link these tools together in cohesive pipelines [ 4 – 10 ], as has been done for other analysis types [ 11 – 15 ]. However, these pipelines; summarized in Fig.…”

Section: Introductionmentioning

confidence: 99%

SEAseq: a portable and cloud-based chromatin occupancy analysis suite

Adetunji

Abraham

2022

BMC Bioinformatics

View full text Add to dashboard Cite

Background Genome-wide protein-DNA binding is popularly assessed using specific antibody pulldown in Chromatin Immunoprecipitation Sequencing (ChIP-Seq) or Cleavage Under Targets and Release Using Nuclease (CUT&RUN) sequencing experiments. These technologies generate high-throughput sequencing data that necessitate the use of multiple sophisticated, computationally intensive genomic tools to make discoveries, but these genomic tools often have a high barrier to use because of computational resource constraints. Results We present a comprehensive, infrastructure-independent, computational pipeline called SEAseq, which leverages field-standard, open-source tools for processing and analyzing ChIP-Seq/CUT&RUN data. SEAseq performs extensive analyses from the raw output of the experiment, including alignment, peak calling, motif analysis, promoters and metagene coverage profiling, peak annotation distribution, clustered/stitched peaks (e.g. super-enhancer) identification, and multiple relevant quality assessment metrics, as well as automatic interfacing with data in GEO/SRA. SEAseq enables rapid and cost-effective resource for analysis of both new and publicly available datasets as demonstrated in our comparative case studies. Conclusions The easy-to-use and versatile design of SEAseq makes it a reliable and efficient resource for ensuring high quality analysis. Its cloud implementation enables a broad suite of analyses in environments with constrained computational resources. SEAseq is platform-independent and is aimed to be usable by everyone with or without programming skills. It is available on the cloud at https://platform.stjude.cloud/workflows/seaseq and can be locally installed from the repository at https://github.com/stjude/seaseq.

show abstract

FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Cited by 25 publications

References 165 publications

Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes

Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes

Pangenomics in crop improvement—from coding structural variations to finding regulatory variants with pangenome graphs

SEAseq: a portable and cloud-based chromatin occupancy analysis suite

Contact Info

Product

Resources

About