We address the problem of improving variable-length-to-xed-length codes (VF codes). A VF code is an encoding scheme that uses a xed-length code, and thus, one can easily access the compressed data. However, conventional VF codes usually have an inferior compression ratio to that of variable-length codes. Although a method proposed by T. Uemura et al. in 2010 achieves a good compression ratio comparable to that of gzip, it is very time consuming. In this study, we propose a new VF coding method that applies a xed-length code to the set of rules extracted by the Re-Pair algorithm, proposed by N. J. Larsson and A. Moat in 1999. The Re-Pair algorithm is a simple o-line grammarbased compression method that has good compression-ratio performance with moderate compression speed. Moreover, we present several experimental results to show that the proposed coding is superior to the existing VF coding.
IntroductionOur objective is to develop an eective variable-length-to-xed-length code (VF code).A VF code is a coding scheme that parses an input text into a consecutive sequence of substrings, and then, it assigns a xed length codeword to each parsed substring. Combining such algorithms with VF coding is a promising idea.In this study, we propose a method to apply xed-length coding to the rules ex- Re-Pair algorithm with xed-length codewords, whereas the original algorithm utilizes variable-length codewords to achieve an extremely good compression ratio. To minimize the decrease in the compression ratio compared to the original algorithm, we exploit a simple characteristic of the algorithm; the minimum output size frequently occurs in the process of repeated bigram replacement. Because all the codewords have equal length in our method, we can easily estimate the nal output size for each intermediate rule set of the Re-Pair algorithm. Therefore, by preserving the best point and rewinding the rule set back to this point, we can obtain the minimum output with a reasonable cost.The performance of the proposed method is explained by evaluation experiments for some corpus. The experimental results show that the compression ratio of the proposed method is approximately equal to that of bzip even though it uses xedlength codewords. The compression speed is approximately the same as that of the original Re-Pair algorithm. Pattern-matching performance is also demonstrated on compressed texts, and it is conrmed that the compressed pattern matching with our VF code is faster than UNIX zgrep, which is a typical decompress-then-search method, i.e., gunzip-then-grep.Our contributions can be summarized as follows:• We developed a new VF coding that has superior compression ratio and compression time compared with those of the existing VF coding. The proposed 2 method is based on a general concept. However, it was not so obvious whether the method was really eective.• We demonstrated experimentally that pattern matching can be performed faster on a text compressed by our method than that on the text compressed by the decompress-then...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.