(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Brisaboa, Nieves R.; Fariña, Antonio; Navarro, Gonzalo; Esteller, María F.

doi:10.1007/978-3-540-39984-1_10

Cited by 68 publications

(52 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inverted indexes are designed to take advantage of a myriad of different compression techniques. As such, our baselines also support several state-of-the-art byte and word aligned compression algorithms [3,9,28,39,43]. So, when we report the space usage for an inverted index, the numbers are reported using compressed inverted indexes and compressed document collections.…”

Section: Space Usagementioning

confidence: 99%

Efficient in-memory top-k document retrieval

Culpepper

Petri

Scholer

2012

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiency-bound search tasks can now easily be supported entirely in-memory as a result of recent hardware advances.In this paper we present a hybrid algorithmic framework for inmemory bag-of-words ranked document retrieval using a self-index derived from the FM-Index, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic trade-offs for performing efficient queries entirely in-memory. We compare our approach with two classic approaches to bag-of-words queries using inverted indexes, term-at-a-time (TAAT) and document-at-atime (DAAT) query processing. We show that our framework is competitive with state-of-the-art indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.

show abstract

Section: Space Usagementioning

confidence: 99%

Efficient in-memory top-k document retrieval

Culpepper

Petri

Scholer

2012

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

show abstract

“…This dense coding, however, is interesting by itself as a bound for the compression that can be obtained with a Huffman code. In this section we present this coding and some of its properties, generalizing the previous proposal of [3]. It should be clear that a stop-cont coding is just a base-c numerical representation, with the exception that the last digit is between c and c + s − 1, i.e., the last digit is a base-s number that is distinguished from previous digits by adding c. Digits between 0 and c−1 are called "continuers" and those between c and c + s − 1 are called "stoppers".…”

Section: Dense Codingmentioning

confidence: 99%

“…In [3] we proposed Dense Coding as a more efficient alternative to Tagged Huffman Coding [14] for direct compressed text searching on natural language texts. This dense coding, however, is interesting by itself as a bound for the compression that can be obtained with a Huffman code.…”

Section: Dense Codingmentioning

confidence: 99%

See 1 more Smart Citation

New bounds on D-ary optimal codes

Navarro

Brisaboa

2005

Information Processing Letters

Self Cite

View full text Add to dashboard Cite

We propose a simple method that, given a symbol distribution, yields upper and lower bounds on the average code length of a D-ary optimal code over that distribution. Thanks to its simplicity, the method permits deriving analytical bounds for families of parametric distributions. We demonstrate this by obtaining new bounds, much better than the existing ones, for Zipf and exponential distributions when D > 2.

show abstract

“…The loss incurred by not using an optimal (Huffman) code is often tolerable, and other non-optimal variants with desirable features, such as faster processing and simplicity have been suggested, for example Tagged Huffman codes [5], EndTagged Dense codes [3] and (s, c)-Dense codes [2]. Similarly, the loss of optimality caused by moving to not fully sorted frequencies can also be acceptable in certain applications, for example when based on estimations rather than on actual counts.…”

Section: Introductionmentioning

confidence: 99%

Huffman Coding with Non-Sorted Frequencies

Klein

Shapira

2011

Math.Comput.Sci.

View full text Add to dashboard Cite

Abstract.A standard way of implementing Huffman's optimal code construction algorithm is by using a sorted sequence of frequencies. Several aspects of the algorithm are investigated as to the consequences of relaxing the requirement of keeping the frequencies in order. Using only partial order may speed up the code construction, which is important in some applications, at the cost of increasing the size of the encoded file.

show abstract

(S,C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

Cited by 68 publications

References 16 publications

Efficient in-memory top-k document retrieval

Efficient in-memory top-k document retrieval

New bounds on D-ary optimal codes

Huffman Coding with Non-Sorted Frequencies

Contact Info

Product

Resources

About