2018
DOI: 10.1007/978-3-030-00479-8_13
|View full text |Cite
|
Sign up to set email alerts
|

The Colored Longest Common Prefix Array Computed via Sequential Scans

Abstract: Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 30 publications
0
4
0
Order By: Relevance
“…However, the bitvectors of pairs of genomes from different species are recalcitrant to compression, even when the species are related: runlength encoding expands those files by a factor of two (Figure 3, insert in the left panel), and RRR expands most of them slightly (by a factor of 1.1), and manages to compress just few pairs with rate 1.25 (Figure 8 in the supplement). The same happens with pairs of artificial strings with controlled mutation rate (see Figures 16,17 In some applications, including genome comparison, short matches are considered noise by the user, and the precise length of a match can be discarded safely as long as we keep track that at that position the match was short. Given an array MS S,T and a user-defined threshold τ , let a thresholded matching statis-tics array MS S,T,τ be such that MS S,T,τ [i] = MS S,T [i] if MS S,T [i] ≥ τ , and MS S,T,τ [i] equals an arbitrary (possibly negative) value smaller than τ otherwise 2 .…”
Section: Compressing the Ms Bitvectormentioning
confidence: 99%
See 2 more Smart Citations
“…However, the bitvectors of pairs of genomes from different species are recalcitrant to compression, even when the species are related: runlength encoding expands those files by a factor of two (Figure 3, insert in the left panel), and RRR expands most of them slightly (by a factor of 1.1), and manages to compress just few pairs with rate 1.25 (Figure 8 in the supplement). The same happens with pairs of artificial strings with controlled mutation rate (see Figures 16,17 In some applications, including genome comparison, short matches are considered noise by the user, and the precise length of a match can be discarded safely as long as we keep track that at that position the match was short. Given an array MS S,T and a user-defined threshold τ , let a thresholded matching statis-tics array MS S,T,τ be such that MS S,T,τ [i] = MS S,T [i] if MS S,T [i] ≥ τ , and MS S,T,τ [i] equals an arbitrary (possibly negative) value smaller than τ otherwise 2 .…”
Section: Compressing the Ms Bitvectormentioning
confidence: 99%
“…We do not detect any clear difference in performance between the variants, with D being significantly smaller in some but not all cases (Figure 11 in the supplement). A detailed analysis of how the permutation schemes compare when varying the similarity between query and text is provided in Figures 16,17 in the supplement. For pairs of genomes from human individuals, run-length encoding the original ms bitvector already brings its size down to approximately 4.5% of the original, and increasing τ shrinks the bitvectors to 2% of the input (Figure 3, right panel).…”
Section: Compressing the Ms Bitvectormentioning
confidence: 99%
See 1 more Smart Citation
“…Computing is a classical problem in string processing, and in practice it involves building an index on a fixed T to answer a large number of queries S . Thus, solutions typically differ on the index they use, which can be the textbook suffix tree, the compressed suffix tree ( Ohlebusch et al , 2010 ) or compressed suffix array, the colored longest common prefix array ( Garofalo et al , 2018 ), a Burrows–Wheeler index combined with the suffix tree topology ( Belazzougui and Cunial, 2014 ; Belazzougui et al , 2018 ), or the r -index combined with balanced grammars ( Boucher et al , 2021 ). In the frequent case where T consists of one genome (or proteome), or of the concatenation of few similar genomes or of many dissimilar genomes, the Burrows–Wheeler transform of T does not compress well, and the best space-time tradeoffs are achieved by the implementation in Belazzougui et al (2018) (see Boucher et al , 2021 for a runtime comparison, and see Supplementary Fig.…”
Section: Introductionmentioning
confidence: 99%