Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Mustafa, Harun; Karasikov, Mikhail; Rätsch, Gunnar; Kahles, André

doi:10.1101/2022.11.04.514718

Cited by 2 publications

(4 citation statements)

References 101 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, the respective annotations are retrieved and the aggregated result is returned as output ( Extended Data Figure 1 c ). For increased sensitivity, we developed algorithms for sequence-to-graph alignment 40,46 , which identify the closest matching path in the whole graph ( Extended Data Figures 1 d; Methods ). We also designed a batch query algorithm (schematic in Extended Data Figure 1 e, Methods ), exploiting the presence of k -mers shared between individual queries by forming a fast intermediate query subgraph , that increases throughput up to 100-fold for large repetitive queries (e.g., sets of sequencing reads).…”

Section: Resultsmentioning

confidence: 99%

“…When label recombination is not desired, we support an alternative approach where queries are aligned to subgraphs of the joint graph induced by single annotation labels (columns of the annotation matrix). We call this approach label-consistent graph alignment (or alignment to columns ) and is implemented by the MetaGraph-LA algorithm 46 . However, instead of aligning to all the subgraphs independently, we perform the alignment with a single search procedure while keeping track of the annotations corresponding to the alignments.…”

Section: Methodsmentioning

confidence: 99%

“…When mapping to joint graphs, we only considered mapping results that retrieved the ground-truth label of each query read. For all granularities, we mapped the reads via both exact k -mer matching and label-consistent sequence-to-graph alignment using MetaGraph-LA 46 . We measure how well the reads aligned as the percentage of characters in the query that are covered by at least one reported mapping.…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Indexing All Life’s Known Biological Sequences

Karasikov¹,

Mustafa²,

Danciu³

et al. 2020

Preprint

Self Cite

108

View full text Add to dashboard Cite

The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by an index and its query performance. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph indexes can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework's scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI's Sequence Read Archive, representing a total input of more than three petabases. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Notably, processing of data sets ranging from 1 TB of raw WGS reads to 20 TB of human RNA-sequencing data results in indexes whose memory footprints are small enough to host on standard desktop workstations. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including indexes of over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 40,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes will be available for download and in the cloud. In total, indexes comprising more than 1 million sequencing records are available for download. As an example of our indexes' integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Indexing All Life’s Known Biological Sequences

Karasikov¹,

Mustafa²,

Danciu³

et al. 2020

Preprint

Self Cite

108

View full text Add to dashboard Cite

show abstract

“…Such exhaustive approaches, sans additional theoretical insights, are unlikely to achieve a significant breakthrough due to inherent computational burden of storing or operating over the representations. Many other tasks addressing similar issues include genome alignment [16, 18, 21] and error correction [1].…”

Section: Introductionmentioning

confidence: 99%

Prokrustean Graph: A substring index for rapid k-mer size analysis

Park,

Koslicki

2023

Preprint

View full text Add to dashboard Cite

Despite the widespread adoption ofk-mer-based methods in bioinformatics, a fundamental question persists: How to elucidate the structural transition of ak-mer set when the order switches tok′? Attaining a generalized answer has significant implications tok-mer-based methods where the influence ofkhave been empirically analyzed (eg. in areas of assembly, genome comparison, etc.).We unravel the problem with a principle:k-mers andk′-mers can be grouped by their co-occurrences, and those in the same group behave similarly regardless of applications. This concept is embodied in a model, the Prokrustean graph, which embraces all similarity information of a given sequence set and has a channel to accessk-mers of anyk. This gives us a theoretical framework in which we can understand the presence and frequency ofk-mers askchanges.Practically, a Prokrustean graph is a space efficient data structure that can quickly be queried to extract essentially arbitrary information regardingk-mers with time complexity independent ofk-mer size range. We provide a series of examples that perform in competitive time and space when compared to purpose-built tools that operate on a singlek, such as KMC and GGCAT. For example, with large read sets, we can count allk-mers fork= 30 … 150 in ≃ 30 seconds in comparison to KMC requiring ≃ 30 seconds for a singlek. Similarly, on long read sets, we can count all unitigs in ≃ 30 seconds for the entire rangek= 30 …50000, a task that is prohibitively burdensome for GGCAT.Our construction algorithm of Prokrustean graph utilizes an (extended) Burrows-Wheeler transform as input, is easily parallelizable, and operates inO(N) time whereNis the cumulative input sequence length. We provide theoretical justification that the size of a Prokrustean graph isO(N), and in practice, is significantly smaller thanN.All code and algorithms are publicly accessible at:https://github.com/KoslickiLab/prokrustean.

show abstract

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Cited by 2 publications

References 101 publications

Indexing All Life’s Known Biological Sequences

Indexing All Life’s Known Biological Sequences

Prokrustean Graph: A substring index for rapid k-mer size analysis

Contact Info

Product

Resources

About