FQSqueezer: <i>k</i>-mer-based compression of sequencing data

Deorowicz, Sebastian

doi:10.1101/559807

Cited by 3 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since sequence headers contribute marginally to the sizes of FASTA/FASTQ les, they are compressed with well-established token-based method analogously as in FQSqueezer [23] or ENANO.…”

Section: Colord Overviewmentioning

confidence: 99%

CoLoRd: Compressing long reads

Kokot

Gudyś

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.

show abstract

“…Since sequence headers contribute marginally to the sizes of FASTA/FASTQ les, they are compressed with well-established token-based method analogously as in FQSqueezer [23] or ENANO.…”

Section: Colord Overviewmentioning

confidence: 99%

CoLoRd: Compressing long reads

Kokot

Gudyś

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, many techniques have been proposed to store the data in compressed form. In [26], Sebastian proposed a compression algorithm to process single and paired reads. The proposed technique is based on partial matching and dynamic Markov coder algorithm.…”

Section: Literature Reviewmentioning

confidence: 99%

Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters

Najam

Rasool

Ahmad

et al. 2019

BioMed Research International

View full text Add to dashboard Cite

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.

show abstract

PgRC: Pseudogenome based Read Compressor

Kowalski

Grabowski

2019

Preprint

View full text Add to dashboard Cite

Motivation:The amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. Results: We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression. Availability: PgRC can be downloaded from https://github.com/kowallus/PgRC. Contact: tomasz.kowalski@p.lodz.pl 2Kowalski and Grabowski and distributes them into buckets. Its key concept, however, is to use socalled minimizers (Roberts et al., 2004) for the bucket labels. A minimizer of length for a read R of length m is the lexicographically smallest of the (m − + 1) -mers of R. A canonical minimizer, which is actually used by ORCOM, is a minimizer taken over the read and its reversedcomplemented form. Two reads with a large overlap are likely to share the same (canonical or non-canonical) minimizer and thus the same bucket. The contents of each bucket are compressed separately, with sorting the reads from their minimizer's position, careful modeling of mismatches and other minor improvements, combined with arithmetic coding or PPMd (context-based) compression applied to several resulting data streams. The compression ratio on a 134 Gbp human genome sequencing data achieved by ORCOM was 0.317 bits per base, improving the BEETL's result of 0.518 bits per base. Mince (Patro and Kingsford, 2015) is a related algorithm, but its distribution of reads into buckets is based on the number of shared kmers. More precisely, a read R is thrown to the bucket which maximizes the number of k-mers of R occurring in any read the bucket contains. Its compression ratio is in most cases by a few percent higher than ORCOM's (see, e.g., extensive comparisons in (Liu et al., 2018)), but is less efficient in terms of time and memory usage. FaStore (Roguski et al., 2018) also follows the ORCOM approach, but improves its compression ratio (by a factor of about 1.2 typically) mostly thanks to re-distribution of reads from the buckets and assembling reads into contigs; in other words, it allows to merge similar clusters of reads. FaStore also boasts with good performance-decompression speed exceeding 100 MB/s, and even 250 MB/s in one of the modes, using 8 threads-and several lossy modes for the quality and header streams.HARC (Chandak et al., 2018a) resigns from disk-based bucketing, in favor of a succinct in-memory hash tables. Its basic idea is to find maximum overlaps between reads and create consensus sequences, using majority vot...

show abstract

FQSqueezer: k-mer-based compression of sequencing data

Cited by 3 publications

References 28 publications

CoLoRd: Compressing long reads

CoLoRd: Compressing long reads

Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters

PgRC: Pseudogenome based Read Compressor

Contact Info

Product

Resources

About