Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Ferragina, Paolo; Mishra, Bud

doi:10.1101/001669

Cited by 6 publications

(9 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior to the predominant prefix-sorting approach that we are going to discuss in detail in the next subsections, the problem of solving indexed path queries on labeled graphs has been tackled in the literature by resorting to geometric data structures [61,62]. These solutions work in the hypertext model: the objects being indexed are node-labeled graphs G = (V, E, Σ, λ), where function λ : V → Σ * associates a string to each node (note the difference with our edge-labeled model, where each edge is labeled with a single character).…”

Section: Hypertext Indexingmentioning

confidence: 99%

“…This labeled graph model is well suited for applications where the strings labeling each node are very long (for example, a transcriptome), in which case the label component (rather than the graph's topology) dominates the data structure's space. Both solutions discussed in Reference [61,62] resort to geometric data structures. First, a classic text index (for example, a compressed suffix array) is built over the concatenation λ(u 1 ) • # • • • # • λ(u n ) of the strings labeling all the graph's nodes u 1 , .…”

Section: Hypertext Indexingmentioning

confidence: 99%

See 1 more Smart Citation

Subpath Queries on Compressed Graphs: A Survey

Prezza

2021

Algorithms

View full text Add to dashboard Cite

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.

show abstract

Section: Hypertext Indexingmentioning

confidence: 99%

Section: Hypertext Indexingmentioning

confidence: 99%

Subpath Queries on Compressed Graphs: A Survey

Prezza

2021

Algorithms

View full text Add to dashboard Cite

show abstract

“…Prior to the predominant prefix-sorting approach that we are going to discuss in detail in the next subsections, the problem of solving indexed path queries on labeled graphs has been tackled in the literature by resorting to geometric data structures [61,62]. These solutions work in the hypertext model, where the graph's nodes are long strings and directed edges between those strings indicate how they are connected according to an arbitrarily complicated topology.…”

Section: Hypertext Indexingmentioning

confidence: 99%

“…Pattern occurrences entirely contained in a single node are instead matched using a standard compressed index like the ones discussed in Section 2.1. The main issue with these solutions is that they cannot efficiently locate pattern occurrences spanning two or more edges; the solutions proposed in [61,62], based on seed-and-extend, require to visit the whole graph in the worst case (even though on realistic datasets they do work well). In practice, the problem is mitigated by the fact that the strings stored in each node are assumed to be very long.…”

Section: Hypertext Indexingmentioning

confidence: 99%

Subpath Queries on Compressed Graphs: a Survey

Prezza¹

2020

Preprint

View full text Add to dashboard Cite

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text's entropy. These contributions had an enormous impact in bioinformatics: nowadays, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages.

show abstract

“…A graph is a diagram that allows any kind of genetic variant, large or small, to be represented as a path through the space of possible ways of gluing sequences together to form a genome. Past studies have considered the question of how to construct and represent genome graphs in a way that is both memory efficient and fast to query (18,41). Recent software tools make it easy to construct such graphs from genome sequences (43).…”

Section: Sequence Differences From the Referencementioning

confidence: 99%

Alignment of Next-Generation Sequencing Reads

Reinert

Langmead

Weese

et al. 2015

Annu. Rev. Genom. Hum. Genet.

108

View full text Add to dashboard Cite

High-throughput DNA sequencing has considerably changed the possibilities for conducting biomedical research by measuring billions of short DNA or RNA fragments. A central computational problem, and for many applications a first step, consists of determining where the fragments came from in the original genome. In this article, we review the main techniques for generating the fragments, the main applications, and the main algorithmic ideas for computing a solution to the read alignment problem. In addition, we describe pitfalls and difficulties connected to determining the correct positions of reads.

show abstract

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Cited by 6 publications

References 31 publications

Subpath Queries on Compressed Graphs: A Survey

Subpath Queries on Compressed Graphs: A Survey

Subpath Queries on Compressed Graphs: a Survey

Alignment of Next-Generation Sequencing Reads

Contact Info

Product

Resources

About