1997
DOI: 10.1093/bioinformatics/13.5.549
|View full text |Cite
|
Sign up to set email alerts
|

Compression of nucleotide databases for fast searching

Abstract: Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

1999
1999
2009
2009

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(12 citation statements)
references
References 15 publications
0
12
0
Order By: Relevance
“…Elias coding [2] is a non-parameterised method of coding integers that is, for example, used in large text database indexes [8] and specialist applications [10,11]. Elias coding, like the other schemes described in this paper, allows unambiguous coding of integers and does not require separators between each integer of a stored array.…”
Section: Non-parameterised Variable-bit Codingmentioning
confidence: 99%
“…Elias coding [2] is a non-parameterised method of coding integers that is, for example, used in large text database indexes [8] and specialist applications [10,11]. Elias coding, like the other schemes described in this paper, allows unambiguous coding of integers and does not require separators between each integer of a stored array.…”
Section: Non-parameterised Variable-bit Codingmentioning
confidence: 99%
“…This has a negligible effect on accuracy: Table 2 shows there is no perceivable change in ROC score for the SCOP test, despite a small change in total hits between the queries and subject sequences. The approach of replacing wildcard characters with bases is already employed by BLAST for nucleotide searches, as originally proposed by Williams and Zobel (1997). Our final optimization is to store query positions as 16-bit integers where possible, instead of 32-bit integers as used in NCBI-BLAST.…”
Section: Cameron Et Almentioning
confidence: 99%
“…The FASTA format also defines other wildcard characters, but they do not occur in this release of the human genome. Since in the present work we focus on DNA compression and do not address special file formats such as FASTA, we do not elaborate upon the representation of wildcard symbols, yet for the importance of this particular test we have extended our GeNML program to support the encoding and decoding of N symbols as well (for an algorithm concerning the storage and retrieval of wildcard characters in the FASTA format, see for example, Williams and Zobel [1997]). Nevertheless, it is important to mention that the statistical nature of the N symbols (and the wildcard characters in general), since they seem to always come in long runs, is quite different from the regular bases.…”
Section: Human Genome Compressionmentioning
confidence: 99%