Fast and flexible word searching on compressed text

Moura, Edleno Silva de; Navarro, Gonzalo; Ziviani, Nívio; Baeza–Yates, Ricardo

doi:10.1145/348751.348754

Cited by 191 publications

(86 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The basic point is that a text is more compressible when regarded as a sequence of words rather than characters. In [12,17], a compression scheme that uses this strategy combined with a Huffman code is presented. From a compression viewpoint, character-based Huffman methods are able to reduce English texts to approximately 60% of their original size, while word-based Huffman methods are able to reduce them to 25% of their original size, because the distribution of words is much more biased than the distribution of characters.…”

Section: Word-based Huffman Compressionmentioning

confidence: 99%

“…The compression schemes presented in [12,17] use a semi-static model, that is, the encoder makes a first pass over the text to obtain the frequency of all the words in the text and then the text is coded in the second pass. During the coding phase, original symbols (words) are replaced by codewords.…”

Section: Word-based Huffman Compressionmentioning

confidence: 99%

“…In [12] they modified the code assignment such that a sequence of bytes instead of bits is associated with each word in the text.…”

Section: Word-based Huffman Compressionmentioning

confidence: 99%

“…In [12] two codes following this approach are presented. In that paper, they call Plain Huffman Code to the one we have already described, that is, a word-based byte-oriented Huffman code.…”

Section: Tagged Huffman Codesmentioning

confidence: 99%

“…In [12], they presented a compression scheme that uses a semi-static wordbased model and a Huffman code where the coding alphabet is byte-oriented. This compression scheme allows the search for a word on the compressed text without decompressing it in such a way that the search can be up to eight times faster for certain queries.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Brisaboa

Fariña

Navarro

et al. 2003

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Abstract. This work presents (s, c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s, c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s, c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios. We formally describe the (s, c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s, c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.

show abstract

Section: Word-based Huffman Compressionmentioning

confidence: 99%

Section: Word-based Huffman Compressionmentioning

confidence: 99%

“…In [12] they modified the code assignment such that a sequence of bytes instead of bits is associated with each word in the text.…”

Section: Word-based Huffman Compressionmentioning

confidence: 99%

Section: Tagged Huffman Codesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Brisaboa

Fariña

Navarro

et al. 2003

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

Untitled

2007

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

NR‐grep: a fast and flexible pattern‐matching tool

Navarro

2001

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

We present nrgrep ('non-deterministic reverse grep'), a new pattern-matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a non-deterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept that is fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string-matching tools for the simplest patterns, and is by far unmatched for more complex patterns. 1266G. NAVARRO unwilling to maintain an index for that purpose), dynamic text collections (where the cost of keeping an up-to-date index is prohibitive, including the searchers inside text editors and Web interfaces ‡ ), for not very large texts (up to a few hundred megabytes) and even as internal tools of indexed schemesThere is a large class of string matching algorithms in the literature (see, for example, [5-7]) but not all of them are practical. There is also a wide variety of fast online string matching tools in the public domain, most prominently the grep family. Among these, Gnu grep and Wu and Manber's agrep [1] are widely known and currently considered to be the fastest string-matching tools in practice. Another distinguishing feature of these software systems is their flexibility: they can search not only for simple strings, but they also permit classes of characters (that is, a pattern position matches a set of characters), wild cards (a pattern position that matches an arbitrary string), regular expression searching, multipattern searching, etc. Agrep also permits approximate searching: the pattern matches the text after performing a limited number of alterations on it.The algorithmic principles behind agrep are diverse [8]. Exact string matching is done with the Horspool algorithm [9], a variant of the Boyer-Moore family [10]. The speed of the Boyer-Moore string-matching algorithms comes from their ability to 'skip' (i.e. not inspect) some text characters. Agrep deals with more complex patterns using a variant of Shift-Or [11], an algorithm exploiting 'bit parallelism' (a concept that we explain later) to simulate non-deterministic automata (NFA) efficiently. Shift-Or, however, cannot skip text characters. Multipattern searching is treated with bit parallelism or with a different algorithm depending on the case. As a result, the search performance of agrep varies sharply depending on the type of search pattern, and even slight modifications to the pattern yield widely different search times. For example, the search for the string "algorithm" is seven times faster than for "[Aa]lgorithm" (where "[Aa]" is a ...

show abstract

Fast and flexible word searching on compressed text

Cited by 191 publications

References 30 publications

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Untitled

NR‐grep: a fast and flexible pattern‐matching tool

Contact Info

Product

Resources

About