2021
DOI: 10.1101/2021.04.09.438957
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Widespread false gene gains caused by duplication errors in genome assemblies

Abstract: False duplications in genome assemblies lead to false biological conclusions. We quantified false duplications in previous genome assemblies and their new counterparts of the same species (platypus, zebra finch, Anna's hummingbird) generated by the Vertebrate Genomes Project (VGP). Whole genome alignments revealed that 4 to 16% of the sequences were falsely duplicated in the previous assemblies, impacting hundreds to thousands of genes. These led to overestimated gene family expansions. The main source of the … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2

Relationship

4
3

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 52 publications
0
9
0
Order By: Relevance
“…For the remaining novel genes, RNAcode ( Washietl et al 2011 ) was used to further estimate the coding potential. To prevent the divergent homologous haplotypes that can caused false gene duplications ( Ko et al 2021 ), we merged novel coding genes that have high similarity (identify ≥95%) with each other or can be annotated to the same gene, and then performed the manual check. We generated customized whole-genome alignments for each de novo assembly against Japanese quail (GCF_001577835.1), turkey (GCF_000146605.3), and helmeted guineafowl (GCF_002078875.1), which we used to estimate coding potential.…”
Section: Methodsmentioning
confidence: 99%
“…For the remaining novel genes, RNAcode ( Washietl et al 2011 ) was used to further estimate the coding potential. To prevent the divergent homologous haplotypes that can caused false gene duplications ( Ko et al 2021 ), we merged novel coding genes that have high similarity (identify ≥95%) with each other or can be annotated to the same gene, and then performed the manual check. We generated customized whole-genome alignments for each de novo assembly against Japanese quail (GCF_001577835.1), turkey (GCF_000146605.3), and helmeted guineafowl (GCF_002078875.1), which we used to estimate coding potential.…”
Section: Methodsmentioning
confidence: 99%
“…Thousands of such false gains and losses in previous reference assemblies have been corrected in our VGP assemblies (more details in refs. 27,44 ), demonstrating that assembly quality has a critical effect on subsequent annotations and functional genomics.…”
Section: Articlementioning
confidence: 99%
“…The sequencing enzymes used often have difficulty reading through regions with complex structures, such as GC-rich regions often found in promoters that regulate gene expression 9, 10 . It is also now clear that mixing diverse haplotypes in a single assembly, even from the same individual, can introduce many errors with standard assembly tools 8, 10, 11 . These errors include: switch errors where variants from each haplotype are assembled into the same pseudo- haplotype; false duplications and associated gaps where more divergent haplotype homologs are assembled as separate false paralogs; and consensus errors due to collapses between haplotypes.…”
Section: Mainmentioning
confidence: 99%