The Colored Longest Common Prefix Array Computed via Sequential Scans

Garofalo, Fabio; Rosone, Giovanna; Sciortino, Marinella; Verzotto, Davide

doi:10.1007/978-3-030-00479-8_13

Cited by 4 publications

(4 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the bitvectors of pairs of genomes from different species are recalcitrant to compression, even when the species are related: runlength encoding expands those files by a factor of two (Figure 3, insert in the left panel), and RRR expands most of them slightly (by a factor of 1.1), and manages to compress just few pairs with rate 1.25 (Figure 8 in the supplement). The same happens with pairs of artificial strings with controlled mutation rate (see Figures 16,17 In some applications, including genome comparison, short matches are considered noise by the user, and the precise length of a match can be discarded safely as long as we keep track that at that position the match was short. Given an array MS S,T and a user-defined threshold τ , let a thresholded matching statis-tics array MS S,T,τ be such that MS S,T,τ [i] = MS S,T [i] if MS S,T [i] ≥ τ , and MS S,T,τ [i] equals an arbitrary (possibly negative) value smaller than τ otherwise 2 .…”

Section: Compressing the Ms Bitvectormentioning

confidence: 99%

“…We do not detect any clear difference in performance between the variants, with D being significantly smaller in some but not all cases (Figure 11 in the supplement). A detailed analysis of how the permutation schemes compare when varying the similarity between query and text is provided in Figures 16,17 in the supplement. For pairs of genomes from human individuals, run-length encoding the original ms bitvector already brings its size down to approximately 4.5% of the original, and increasing τ shrinks the bitvectors to 2% of the input (Figure 3, right panel).…”

Section: Compressing the Ms Bitvectormentioning

confidence: 99%

“…Computing MS S,T is a classical problem in string processing, and in practice it involves building an index on a fixed T to answer a large number of queries S. Thus, solutions typically differ on the index they use, which can be the textbook suffix tree, the compressed suffix tree [29] or compressed suffix array, the colored longest common prefix array [17], a Burrows-Wheeler index combined with the suffix tree topology [3,4], or the r-index combined with balanced grammars [6]. In the frequent case where T consists of one genome (or proteome), or of the concatenation of few similar genomes or of many dissimilar genomes, the Burrows-Wheeler transform of T does not compress well, and the best space-time tradeoffs are achieved by the implementation in [4] (see [6] for a runtime comparison, and see Figure 2 in the supplement for a memory comparison).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fast and compact matching statistics analytics

Cunial

Denas²,

Belazzougui

2021

Preprint

View full text Add to dashboard Cite

Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state of the art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage, and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics.

show abstract

Section: Compressing the Ms Bitvectormentioning

confidence: 99%

Section: Compressing the Ms Bitvectormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast and compact matching statistics analytics

Cunial

Denas²,

Belazzougui

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Computing

is a classical problem in string processing, and in practice it involves building an index on a fixed T to answer a large number of queries S . Thus, solutions typically differ on the index they use, which can be the textbook suffix tree, the compressed suffix tree ( Ohlebusch et al , 2010 ) or compressed suffix array, the colored longest common prefix array ( Garofalo et al , 2018 ), a Burrows–Wheeler index combined with the suffix tree topology ( Belazzougui and Cunial, 2014 ; Belazzougui et al , 2018 ), or the r -index combined with balanced grammars ( Boucher et al , 2021 ). In the frequent case where T consists of one genome (or proteome), or of the concatenation of few similar genomes or of many dissimilar genomes, the Burrows–Wheeler transform of T does not compress well, and the best space-time tradeoffs are achieved by the implementation in Belazzougui et al (2018) (see Boucher et al , 2021 for a runtime comparison, and see Supplementary Fig.…”

Section: Introductionmentioning

confidence: 99%

Fast and compact matching statistics analytics

Cunial

Denas²,

Belazzougui

2022

Bioinformatics

View full text Add to dashboard Cite

Motivation Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. Results We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state of the art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage, and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. Availability ad implementation Our C/C ++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract