2020
DOI: 10.1101/2020.02.25.964445
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing

Abstract: Segmental duplications are important for understanding human diseases and evolution. The challenge to distinguish allelic and duplication sequences has hindered their phased assembly as well as characterization of structural variant calls. Here we have developed a novel graph-based approach that leverages single nucleotide differences in overlapping reads to distinguish allelic and duplication sequences information from long read accurate PacBio HiFi sequencing. These differences enable to generate allelic and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
14
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(14 citation statements)
references
References 26 publications
0
14
0
Order By: Relevance
“…Developing benchmarks in these regions will require the development of methods to characterize these regions with confidence (e.g., using diploid assembly), standards for representing variants in these regions, and benchmarking methodology and tools. For example, for variants inside segmental duplications for which the individual has more copies than the reference, methods are actively being developed to assemble these regions, 21,23 but no standards exist for representing which copy the variants fall on or how to compare to a benchmark.…”
Section: Discussionmentioning
confidence: 99%
“…Developing benchmarks in these regions will require the development of methods to characterize these regions with confidence (e.g., using diploid assembly), standards for representing variants in these regions, and benchmarking methodology and tools. For example, for variants inside segmental duplications for which the individual has more copies than the reference, methods are actively being developed to assemble these regions, 21,23 but no standards exist for representing which copy the variants fall on or how to compare to a benchmark.…”
Section: Discussionmentioning
confidence: 99%
“…The problem of distinguishing reads originating from different paralogs without a reference genome is even more challenging but can allow for assembling segmental duplications that may be collapsed or incorrectly represented in the reference genome. Several novel methods have been designed to specifically assemble segmental duplications that leverage long reads, particularly accurate HiFi reads ( 27 , 54 ). The SDip method has been shown to assemble diploid contigs for many duplicated genes such as SMN1 ( 54 ).…”
Section: Discussionmentioning
confidence: 99%
“…Several novel methods have been designed to specifically assemble segmental duplications that leverage long reads, particularly accurate HiFi reads ( 27 , 54 ). The SDip method has been shown to assemble diploid contigs for many duplicated genes such as SMN1 ( 54 ). As these methods develop further and more complete benchmarks for reference human genomes become available, it would be useful to compare the performance of reference-based and haplotype-aware assembly-based methods for segmental duplications.…”
Section: Discussionmentioning
confidence: 99%
“…Similar to the collapsed approach, it works particularly well for human genomes when the heterozygosity rate is low, but fails in regions or genomes with high repeat and heterozygosity rates. However, the most promising uncollapsed approaches overcome these limitations by directly determining haplotype-specific overlaps in the overlap step of graph generation using SNP information from overlapping reads74 . The core idea is to preserve heterozygosity and repeat information from various data types in the graph space.…”
mentioning
confidence: 99%
“…Standard tools use run-length encoding or base-level alignment75 in the overlap step. Thus, a haplotype and repeat-aware overlap graph is generated with subsequent graph cleaning steps, finally reporting phased contigs.The recent invention of PacBio HiFi technology has made the diploid assembly process, that includes ordering as well as the phasing in the assembly process, easier74 .A whole generation of new algorithms based on uncollapsed approaches have become possible due to the availability of accurate long-read data and are implemented in tools such as HiFiasm ( https://github.com/chhylp123/hifiasm ), HiCanu75 , and SDip 74 , producing contigs with lengths of several tens of Mb having base quality scores >Q50, but phased blocks of only a few hundreds of kb. In these systems, the field is moving towards accurate HiFi data using k -mer based strategies for haplotype-aware error correction of phased contigs, which can be completed in a few hours for human-scale genomes.…”
mentioning
confidence: 99%