Gap Filling as Exact Path Length Problem

Salmela, Leena; Sahlin, Kristoffer; Mäkinen, Veli; Tomescu, Alexandru I.

doi:10.1007/978-3-319-16706-0_29

Cited by 6 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With some optimizations however, the above algorithm can be accelerated. As observed by Salmela et al [39] in the context of gap-filling problem, we expect d 2 |V |, therefore, it should be possible to compute a sub-graph containing vertices within ≤ d 2 /2 distance from v 1 or v 2 , before solving the recurrence. While this strategy was shown to be effective for gap-filling between assembled contigs, the count of vertex pairs to evaluate during read mapping process is expected to be significantly higher for large read sets.…”

Section: A Pseudo-polynomial Time Algorithmmentioning

confidence: 76%

“…The exact-path length problem determines if a path of a specified distance exists between two vertices in a weighted graph. An extension of this problem, referred to as the gap-filling problem [39], has been explored in the context of genome assembly using paired-end or mate pair read sets. Although the exact-path length problem has been shown to be N P-complete [33], we will demonstrate a simple and practical polynomial-time algorithm for our problem with unweighted edges.…”

Section: Related Problems In Graph Theorymentioning

confidence: 99%

See 1 more Smart Citation

Validating Paired-end Read Alignments in Sequence Graphs

Jain

Zhang

Dilthey

et al. 2019

Preprint

View full text Add to dashboard Cite

Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrixmatrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second. AcknowledgementsThe authors thank Abdurrahman Yasar, Siva Rajamanickam and Srinivas Eswar for sharing their insights on sparse matrix manipulations. Problem FormulationDefinition 1. Sequence Graph: A sequence graph G(V, E) is a directed graph with vertices V and edges E, where each vertex v ∈ V is labeled with a character from alphabet Σ.

show abstract

Section: A Pseudo-polynomial Time Algorithmmentioning

confidence: 76%

Section: Related Problems In Graph Theorymentioning

confidence: 99%

Validating Paired-end Read Alignments in Sequence Graphs

Jain

Zhang

Dilthey

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…The number of gaps could also be efficiently reduced by FGAP [22], which aligned long reads to the gaps using BLAST algorithm [23]. More tools modified the algorithm and extended for different purposes [24][25][26][27][28]. However, most tools mentioned above share the same crucial shortcoming: they only accept pre-error-corrected long reads or alternative assembled contigs.…”

Section: Problems In Current Tgs Assemblies and Tgs Gap-closing Toolsmentioning

confidence: 99%

TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads

Guo

et al. 2019

Preprint

View full text Add to dashboard Cite

The completeness and accuracy of genome assemblies determine the quality of subsequent bioinformatics analysis. Despite benefiting from the medium/long-range information of third-generation sequencing techniques, current gap-closing tools to enhance assemblies suffer multi-alignments and high error rates, resulting in huge time and money costs. We developed a software tool, TGS-GapCloser that uses the low depth (>=10X) single molecule sequencing long reads without any error correction to close gaps. The algorithm distinguishes gap regions from the alignments of long reads against original scaffolds, corrects only the candidate regions, and assigns the best sequences to each gap. We demonstrate that TGS-GapCloser improves the contig N50 value of draft assembly by 25-fold on average, updating above 90% gaps with 93.96% positive predictive value. Despite of high error rate of raw long reads, improved assemblies archive Q50 (99.999%) single-base accuracy with only 11.8% decrement to inputs. Besides it could complete more gaps, and is also ~29-fold faster than mainstream gapclosing tools. BUSCO analysis revealed that 3.4%-13.1% more expected genes were complete. TGS-GapCloser also shows its power to fill gaps for ultra large genome assembly of ginkgo (~12Gb) with 71.6% of gaps closed. The validation of inserted or merged gap sequences was conducted with NGS reads and reference genomes, respectively. The updated genome assemblies may promote the gene annotation, structure variant calling and thus improving the downstream analysis of ontogeny, phylogeny, and evolution.

show abstract

“…about gene content) or perform comparative genomic analysis. When mate pairs are available, contigs can be fed to later assembly stages, such as scaffolding [34,2,20] and then gap filling [35,3].…”

Section: Introductionmentioning

confidence: 99%

Safe and Complete Contig Assembly Via Omnitigs

Tomescu

Medvedev

2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question remains: given a genome graph G (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from G as contigs? In this paper we answer this question using a model where the genome is a circular covering walk. We also give a polynomial time algorithm to find such strings, which we call omnitigs. Our experiments show that omnitigs are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.

show abstract

Gap Filling as Exact Path Length Problem

Cited by 6 publications

References 20 publications

Validating Paired-end Read Alignments in Sequence Graphs

Validating Paired-end Read Alignments in Sequence Graphs

TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads

Safe and Complete Contig Assembly Via Omnitigs

Contact Info

Product

Resources

About