Compression of nucleotide databases for fast searching

Williams, Hugh E.; Zobel, Justin

doi:10.1093/bioinformatics/13.5.549

Cited by 16 publications

(12 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Elias coding [2] is a non-parameterised method of coding integers that is, for example, used in large text database indexes [8] and specialist applications [10,11]. Elias coding, like the other schemes described in this paper, allows unambiguous coding of integers and does not require separators between each integer of a stored array.…”

Section: Non-parameterised Variable-bit Codingmentioning

confidence: 99%

Compressing Integers for Fast File Access

Williams¹

1999

The Computer Journal

172

109

View full text Add to dashboard Cite

Fast access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.

show abstract

Section: Non-parameterised Variable-bit Codingmentioning

confidence: 99%

Compressing Integers for Fast File Access

Williams¹

1999

The Computer Journal

172

109

View full text Add to dashboard Cite

show abstract

“…This has a negligible effect on accuracy: Table 2 shows there is no perceivable change in ROC score for the SCOP test, despite a small change in total hits between the queries and subject sequences. The approach of replacing wildcard characters with bases is already employed by BLAST for nucleotide searches, as originally proposed by Williams and Zobel (1997). Our final optimization is to store query positions as 16-bit integers where possible, instead of 32-bit integers as used in NCBI-BLAST.…”

Section: Cameron Et Almentioning

confidence: 99%

A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST

Cameron

Williams²,

Cannane³

2006

Journal of Computational Biology

View full text Add to dashboard Cite

BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST-by improving its algorithms and optimizations-is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/.

show abstract

“…The FASTA format also defines other wildcard characters, but they do not occur in this release of the human genome. Since in the present work we focus on DNA compression and do not address special file formats such as FASTA, we do not elaborate upon the representation of wildcard symbols, yet for the importance of this particular test we have extended our GeNML program to support the encoding and decoding of N symbols as well (for an algorithm concerning the storage and retrieval of wildcard characters in the FASTA format, see for example, Williams and Zobel [1997]). Nevertheless, it is important to mention that the statistical nature of the N symbols (and the wildcard characters in general), since they seem to always come in long runs, is quite different from the regular bases.…”

Section: Human Genome Compressionmentioning

confidence: 99%

An efficient normalized maximum likelihood algorithm for DNA sequence compression

Korodi

Tăbuş

2005

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.

show abstract

Compression of nucleotide databases for fast searching

Cited by 16 publications

References 15 publications

Compressing Integers for Fast File Access

Compressing Integers for Fast File Access

A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST

An efficient normalized maximum likelihood algorithm for DNA sequence compression

Contact Info

Product

Resources

About