Guillermo Dufort y Álvarez scite author profile

Seroussi

Smircich

et al. 2020

Motivation The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage, and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost, and the portability of the sequencing technology. We present ENANO, a novel lossless compression algorithm especially designed for nanopore sequencing FASTQ files. Results The main focus of ENANO is on the compression of the quality scores, as they dominate the size of the compressed file. ENANO offers two modes, Maximum Compression and Fast (default), which trade-off compression efficiency and speed. We tested ENANO, the current state of the art compressor SPRING, and the general compressor pigz, on several publicly available nanopore datasets. The results show that the proposed algorithm consistently achieves the best compression performance (in both modes) on every considered nanopore dataset, with an average improvement over pigz and SPRING of more than 24.7% and 6.3%, respectively. In addition, in terms of encoding and decoding speeds, ENANO is 2.9x and 1.7x times faster than SPRING, respectively, with memory consumption up to 0.2 GB. Availability ENANO is freely available for download at: https://github.com/guilledufort/EnanoFASTQ Supplementary information Supplementary data are available at Bioinformatics online.

RENANO: a REference-based compressor for NANOpore FASTQ files

Seroussi

Smircich

et al. 2021

Motivation Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in < 72 hours). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Results We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor; (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip. Availability RENANO is freely available for download at: https://github.com/guilledufort/RENANO Supplementary information Supplementary data are available at Bioinformatics online.

RENANO: a REference-based compressor for NANOpore FASTQ files

Seroussi

Smircich

et al. 2021

Preprint

Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in less than 72 hs). In order to reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. Unlike short-read technologies, nanopore sequencing generates long noisy reads of variable length. In this note we introduce RENANO, a reference-based lossless FASTQ data compressor, specifically tailored to compress FASTQ files generated with nanopore sequencing technologies. RENANO builds on the recent compressor ENANO, which is currently state of the art. It focuses on improving the compression of the base call sequence portion of the FASTQ file, leaving the other parts of ENANO intact. Two novel reference-based compression algorithms are introduced, contemplating different scenarios: in the first scenario, a reference genome is available without cost to both the compressor and the decompressor; in the second, the reference genome is available only on the compressor side, and a compacted version of the reference is transmitted to the decompressor as part of the compressed file. To evaluate the proposed algorithms, we compare RENANO against ENANO on several publicly available nanopore datasets. In the first scenario considered, RENANO improves the base call sequences compression of ENANO by 40.8%, on average, over all the datasets. As for total compression (including the other parts of the FASTQ file), the average improvement is 13.1%. In the second scenario considered, the base call compression improvements of RENANO over ENANO range from 15.2% to 49.0%, depending on the coverage of the compressed dataset, while in terms of total size, the improvements range from 5.1% to 16.5%.

Wireless EEG System Achieving High Throughput and Reduced Energy Consumption Through Lossless and Near-Lossless Compression

IEEE Trans. Biomed. Circuits Syst.

Favaro

Lecumberry

et al. 2018

This work presents a wireless multichannel electroencephalogram (EEG) recording system featuring lossless and near-lossless compression of the digitized EEG signal. Two novel, low-complexity, efficient compression algorithms were developed and tested in a low-power platform. The algorithms were tested on six public EEG databases comparing favorably with the best compression rates reported up to date in the literature. In its lossless mode, the platform is capable of encoding and transmitting 59-channel EEG signals, sampled at 500 Hz and 16 bits per sample, at a current consumption of 337 A per channel; this comes with a guarantee that the decompressed signal is identical to the sampled one. The near-lossless mode allows for significant energy savings and/or higher throughputs in exchange for a small guaranteed maximum per-sample distortion in the recovered signal. Finally, we address the tradeoff between computation cost and transmission savings by evaluating three alternatives: sending raw data, or encoding with one of two compression algorithms that differ in complexity and compression performance. We observe that the higher the throughput (number of channels and sampling rate) the larger the benefits obtained from compression.

Nanopore quality score resolution can be reduced with little effect on downstream analysis

Rivara-Espasandín

Balestrazzi

et al. 2022

Motivation The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios, and various quality score quantizers. In all cases we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, like gzip, yields large storage space savings after quantization of quality scores. Availability Quantizers freely available for download at: https://github.com/mrivarauy/QS-Quantizer Supplementary information Available at https://github.com/mrivarauy/QS-Quantizer