2000
DOI: 10.1145/348751.348754
|View full text |Cite
|
Sign up to set email alerts
|

Fast and flexible word searching on compressed text

Abstract: We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
81
0

Year Published

2001
2001
2007
2007

Publication Types

Select...
4
1
1

Relationship

3
3

Authors

Journals

citations
Cited by 191 publications
(86 citation statements)
references
References 30 publications
0
81
0
Order By: Relevance
“…The basic point is that a text is more compressible when regarded as a sequence of words rather than characters. In [12,17], a compression scheme that uses this strategy combined with a Huffman code is presented. From a compression viewpoint, character-based Huffman methods are able to reduce English texts to approximately 60% of their original size, while word-based Huffman methods are able to reduce them to 25% of their original size, because the distribution of words is much more biased than the distribution of characters.…”
Section: Word-based Huffman Compressionmentioning
confidence: 99%
See 4 more Smart Citations
“…The basic point is that a text is more compressible when regarded as a sequence of words rather than characters. In [12,17], a compression scheme that uses this strategy combined with a Huffman code is presented. From a compression viewpoint, character-based Huffman methods are able to reduce English texts to approximately 60% of their original size, while word-based Huffman methods are able to reduce them to 25% of their original size, because the distribution of words is much more biased than the distribution of characters.…”
Section: Word-based Huffman Compressionmentioning
confidence: 99%
“…The compression schemes presented in [12,17] use a semi-static model, that is, the encoder makes a first pass over the text to obtain the frequency of all the words in the text and then the text is coded in the second pass. During the coding phase, original symbols (words) are replaced by codewords.…”
Section: Word-based Huffman Compressionmentioning
confidence: 99%
See 3 more Smart Citations