2021
DOI: 10.1101/2021.07.17.452767
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CoLoRd: Compressing long reads

Abstract: The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 33 publications
1
6
0
Order By: Relevance
“…As shown in Section 3, all the experimented pipelines yield very similar performance on the original and quantized versions of each data set. These results complement and reinforce the aforementioned conclusion in [9], indicating that the effect of information loss caused by quality score quantization is not significant in practice. For the assembly polishing of a mock microbial community, setting all quality scores to the fixed value 10 results in a number of mismatches that is, on average over three independent runs, less than 1.2 % higher that that obtained with the original data.…”
Section: Introductionsupporting
confidence: 89%
See 3 more Smart Citations
“…As shown in Section 3, all the experimented pipelines yield very similar performance on the original and quantized versions of each data set. These results complement and reinforce the aforementioned conclusion in [9], indicating that the effect of information loss caused by quality score quantization is not significant in practice. For the assembly polishing of a mock microbial community, setting all quality scores to the fixed value 10 results in a number of mismatches that is, on average over three independent runs, less than 1.2 % higher that that obtained with the original data.…”
Section: Introductionsupporting
confidence: 89%
“…Nanopore sequencing, however, is a much more recent technology and few specific data compressors suitable for nanopore data are available, developed by our group [6,7] and others [8,9]. Moreover, the lossy compression of quality scores for nanopore data has only been explored in [9], where the impact of quality score information loss is assessed for some downstream analyses. Specifically, in [9], it is shown that this information loss has non or little impact on the construction of consensus sequences with Racon [10] for long CHM13 reads, either for HiFi or for Nanopore data.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…For the analysis of Tuberculosis data, a newer version of the SPLASH pipeline was run, SPLASH2 (26), to generate the contingency tables. Then, the same optimization procedures were run to generate the optimized p-value bounds.…”
Section: E Tuberculosis Additional Plotsmentioning
confidence: 99%