Abstract. A perfect hash function (PHF) h : U → [0, m − 1] for a key set S is a function that maps the keys of S to unique values. The minimum amount of space to represent a PHF for a given set S is known to be approximately 1.44n2 /m bits, where n = |S|. In this paper we present new algorithms for construction and evaluation of PHFs of a given set (for m = n and m = 1.23n), with the following properties:1. Evaluation of a PHF requires constant time.2. The algorithms are simple to describe and implement, and run in linear time. 3. The amount of space needed to represent the PHFs is around a factor 2 from the information theoretical minimum. No previously known algorithm has these properties. To our knowledge, any algorithm in the literature with the third property either:-Requires exponential time for construction and evaluation, or -Uses near-optimal space only asymptotically, for extremely large n. Thus, our main contribution is a scheme that gives low space usage for realistic values of n. The main technical ingredient is a new way of basing PHFs on random hypergraphs. Previously, this approach has been used to design simple PHFs with superlinear space usage 3 .⋆ This work was supported in part by GERINDO Project-grant MCT/CNPq/CT-INFO 552.087/02-5, and CNPq Grants 30.5237/02-0 (Nivio Ziviani) and 142786/2006-3 (Fabiano C. Botelho) 3 This version of the paper is identical to the one published in the WADS 2007 proceedings. Unfortunately, it does not give reference and credit to: (i) Chazelle et al.[5], where it is presented a way of constructing PHFs that is equivalent to the ones presented in this paper. It is explained as a modification of the "Bloomier Filter" data structure, but it is not explicit that a PHF is constructed. We have independently designed an algorithm to construct a PHF that maps keys from a key set S of size n to the range [0, (2.0 + ǫ)n − 1] based on random 2-graphs, where ǫ > 0. The resulting functions require 2.0 + ǫ bits per key to be stored. And to (ii) Belazzougui [3], who suggested a method to construct PHFs that map to the range [0, (1.23 + ǫ)n − 1] based on random 3-graphs. The resulting functions are stored in 2.46 bits per key and this space usage was further improved to 1.95 bits per key by using arithmetic coding. Thus, the simple construction of a PHF described must be attributed to Chazelle et al. The new contribution of this paper is to analyze and optimize the constant of the space usage considering implementation aspects as well as a way of constructing MPHFs from those PHFs.
We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approxi-This mate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.