Given a string T of length N , the goal of grammar compression is to construct a small context-free grammar generating only T. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair(T) in expected O(N ) time, the study to reduce its working space is still active so that it is applicable to large-scale data. In this paper, we propose the first Re-Pair algorithm working in compressed space, i.e., potentially o(N ) space for highly compressible texts. The key idea is to give a new way to restructure an arbitrary grammar S for T into RePair(T) in compressed space and time. Based on the recompression technique, we propose an algorithm for RePair(T) in O(min(N, nm log N )) space and expected O(min(N, nm log N )m) time or O(min(N, nm log N ) log log N ) time, where n is the size of S and m is the number of variables in RePair(T). We implemented our algorithm running in O(min(N, nm log N )m) time and show it can actually run in compressed space. We also present a new approach to reduce the peak memory usage of existing RePair algorithms combining with our algorithms, and show that the new approach outperforms, both in computation time and space, the most space efficient linear-time RePair implementation to date. ACM Subject Classification Data structures design and analysis → Data compressionDigital Object Identifier 10.4230/LIPIcs.CVIT.2016.2 to be precise, the improvement is achieved only when m = ω(log log N ), which is likely to hold for compressible texts 3 log N ≤ m is not necessarily true since RePair stops producing variables when the input text is compressed into a string w containing no bigram with frequency ≥ 2. Still, it holds that log N ≤ m + |w|.
Re-Pairis a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large-scale data sets. As a solution for this problem, we present, given a text of length n whose characters are drawn from an integer alphabet with size σ=nO(1), an O(min(n2,n2lglogτnlglglgn/logτn)) time algorithm computing Re-Pair with max((n/c)lgn,nlgτ)+O(lgn) bits of working space including the text space, where c≥1 is a fixed user-defined constant and τ is the sum of σ and the number of non-terminals. We give variants of our solution working in parallel or in the external memory model. Unfortunately, the algorithm seems not practical since a preliminary version already needs roughly one hour for computing Re-Pair on one megabyte of text.
No abstract
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.