Data compression in full-text retrieval systems

Bell, Timothy C.; Moffat, Alistair; Nevill-Manning, Craig G.; Witten, Ian H.; Zobel, Justin

doi:10.1002/(sici)1097-4571(199310)44:9<508::aid-asi2>3.0.co;2-a

Cited by 46 publications

(29 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the absence of compression four bytes and two bytes respectively might be allocated for the d and f d,t values, that is, six bytes for each d, f d,t pair. Using compression the space required can be reduced to about one byte per pair [1]. On the 2 Gb TREC collection, described below, these methods compress the inverted file from 1100 Mb to 184 Mb, an irresistible saving.…”

Section: Document Databasesmentioning

confidence: 99%

“…Without compression, an inverted file can easily be as large or larger than the text it indexes. Compression results in a net space reduction of as much as 80% of the inverted file size [1], but even with fast decompression-decoding at approximately 400,000 numbers per second on a Sun Sparc 10-it involves a substantial overhead on processing time.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Self-indexing inverted files for fast text retrieval

Moffat

Zobel

1996

ACM Trans. Inf. Syst.

Self Cite

294

241

View full text Add to dashboard Cite

Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5-10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.

show abstract

Section: Document Databasesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Self-indexing inverted files for fast text retrieval

Moffat

Zobel

1996

ACM Trans. Inf. Syst.

Self Cite

294

241

View full text Add to dashboard Cite

show abstract

“…However, because databases are divided into records that must be independently decompressible , adaptive techniques are generally not effective. Similarly, arithmetic coding is in general the preferred coding technique; but it is slow for database applications (Bell et al, 1993).…”

Section: Database Compressionmentioning

confidence: 99%

“…We have used the Elias gamma codes to encode each count w and Golomb codes to represent each sequence of offsets. These techniques are a variation on techniques used for inverted file compression, which has been successfully applied to large text databases (Bell et al, 1993) and to genomic databases (Williams and Zobel, 1996a;Williams and Zobel, 1996b).…”

Section: Direct Codingmentioning

confidence: 99%

Compression of nucleotide databases for fast searching

Williams¹,

Zobel²

1997

Bioinformatics

Self Cite

View full text Add to dashboard Cite

Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly; that sequences can be accessed independently of the order in which they were stored; and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools.

show abstract

“…On the two gigabyte TREC collection these techniques compress the inverted file from 1000 megabytes to 135 megabytes, a dramatic saving. For this reason, if the information retrieval system is to be available on CD-ROM, and if we wish to maximise the amount of information stored on each disk, we should employ compression of both the index and also the stored text [2,3,9]. This is the environment that we consider here.…”

Section: Document Databasesmentioning

confidence: 99%