2020
DOI: 10.1093/bioinformatics/btaa025
|View full text |Cite
|
Sign up to set email alerts
|

Identifying and removing haplotypic duplication in primary genome assemblies

Abstract: Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
629
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 1,819 publications
(630 citation statements)
references
References 11 publications
1
629
0
Order By: Relevance
“…they do not preserve phase across their entire length). Canu also does not assign contigs to haplotypes, and requires postprocessing with a tool such as Purge_dups (Guan et al 2020) to split the diploid assembly into primary and alternate alleles. While recent studies have successfully integrated HiFi data with additional long-range linkage information Porubsky et al 2019), we do not expect that significant improvements in phasing can be achieved by HiFi-only assemblies without an increase of HiFi read lengths.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…they do not preserve phase across their entire length). Canu also does not assign contigs to haplotypes, and requires postprocessing with a tool such as Purge_dups (Guan et al 2020) to split the diploid assembly into primary and alternate alleles. While recent studies have successfully integrated HiFi data with additional long-range linkage information Porubsky et al 2019), we do not expect that significant improvements in phasing can be achieved by HiFi-only assemblies without an increase of HiFi read lengths.…”
Section: Discussionmentioning
confidence: 99%
“…Similar to many modern assemblers, when faced with a diploid genome, HiCanu outputs contigs as "pseudo-haplotypes" that preserve local allelic phasing but may switch between haplotypes across longer distances. A single set of contigs representing all resolved alleles is output regardless of ploidy, and additional processing with a tool such as Purge_dups (Guan et al 2020) is required to partition the contigs into primary and alternate allele sets. Figure 1.…”
Section: Hicanu Overviewmentioning
confidence: 99%
See 1 more Smart Citation
“…VGP assemblies exceeding Q40 contained fewer frameshift errors, as predicted 76 , and therefore we recommend targeting a minimum QV of 40 whenever possible. Haplotype phasing and false duplications are the most underdeveloped measures, presumably because this is an under-appreciated area of need, but recent tools, developed here and elsewhere 27,48,51,53,77 , are helping to address this. The six broad quality categories in the 1st column are split into sub-metrics in the 2nd column.…”
Section: The Vertebrate Genomes Project (Vgp)mentioning
confidence: 99%
“…The k-mer frequencies were estimated with 21-mers using khist.sh (one of the BBTools v38.49 modules, [17]). Redundant heterozygous regions were removed with an identity cutoff of 60 using two rounds of Purge_dups v1.0.0 [16]. Non-redundant assembly was polished with Illumina short reads using two rounds of NextPolish v1.1.0 [17].…”
Section: Genome Assemblymentioning
confidence: 99%