Space-Efficient Re-Pair Compression

Bille, Philip; Gørtz, Inge Li; Prezza, Nicola

doi:10.1109/dcc.2017.24

Cited by 20 publications

(31 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In practice, the MR-order varies depending on the implementation of the priority queue that manages pairs. For this reason, we used four different implementations of RePair in the comparative analysis, and they were implemented by Maruyama (https://code.google.com/archive/p/re-pair/), Navarro (https: //www.dcc.uchile.cl/~gnavarro/software/index.html), Prezza (https://github.com/nicolaprezza/ Re-Pair) [7], and Wan (https://github.com/rwanwork/Re-Pair); we ran it with level 0 (no heuristic option), respectively. Table 1 lists the details of the texts that we used in the experiments.…”

Section: Methodsmentioning

confidence: 99%

“…Despite its simple scheme, RePair is known for its high compression in practice [3][4][5], and hence, it has been comprehensively studied. Some examples of studies on the RePair algorithm include its extension to an online algorithm [6], practical working time/space improvements [7,8], applications to various fields [3,9,10], and theoretical analysis of generated grammar sizes [1,11,12].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Practical Grammar Compression Based on Maximal Repeats

et al. 2020

View full text Add to dashboard Cite

This study presents an analysis of RePair, which is a grammar compression algorithm known for its simple scheme, while also being practically effective. First, we show that the main process of RePair, that is, the step by step substitution of the most frequent symbol pairs, works within the corresponding most frequent maximal repeats. Then, we reveal the relation between maximal repeats and grammars constructed by RePair. On the basis of this analysis, we further propose a novel variant of RePair, called MR-RePair, which considers the one-time substitution of the most frequent maximal repeats instead of the consecutive substitution of the most frequent pairs. The results of the experiments comparing the size of constructed grammars and execution time of RePair and MR-RePair on several text corpora demonstrate that MR-RePair constructs more compact grammars than RePair does, especially for highly repetitive texts.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Practical Grammar Compression Based on Maximal Repeats

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In practice, the MR-order varies how we implement the priority queue managing symbol pairs. To see this, we used five RePair implementations in the comparison; they were implemented by Maruyama 3 , Navarro 4 , Prezza 5 [5], Wan 6 , and Yoshida 7 . Table 1 summarizes the details of the texts we used in the comparison.…”

Section: Methodsmentioning

confidence: 99%

MR-RePair: Grammar Compression Based on Maximal Repeats

Furuya

Takagi

Nakashima

et al. 2019

2019 Data Compression Conference (DCC)

View full text Add to dashboard Cite

We analyze the grammar generation algorithm of the RePair compression algorithm, and show the relation between a grammar generated by RePair and maximal repeats. We reveal that RePair replaces step by step the most frequent pairs within the corresponding most frequent maximal repeats. Then, we design a novel variant of RePair, called MR-RePair, which substitutes the most frequent maximal repeats at once instead of substituting the most frequent pairs consecutively. We implemented MR-RePair and compared the size of the grammar generated by MR-RePair to that by RePair on several text corpora. Our experiments show that MR-RePair generates more compact grammars than RePair does, especially for highly repetitive texts. IntroductionGrammar compression is a method of lossless data compression that reduces the size of a given text by constructing a small context free grammar that uniquely derives the text. While the problem of generating the smallest such grammar is NP-hard [6], several approximation techniques have been proposed. Among them, RePair [11] is known as an off-line method that achieves a high compression ratio in practice [7,9,20], despite its simple scheme. There have been many studies concerning RePair, such as extending it to an online algorithm [13], improving its practical working time or space [5,17], applications to other fields [7,12,18], and analyzing the generated grammar size theoretically [6,15,16].Recently, maximal repeats have been considered as a measure for estimating how repetitive a given string is: Belazzougui et al. [4] showed that the number of extensions of maximal repeats is an upper bound on the number of runs in the Burrows-Wheeler transform and the number of factors in the Lempel-Ziv parsing. Also, several index structures whose size is bounded by the number of extensions of maximal repeats have been proposed [2,3,19].In this paper, we analyze the properties of RePair with regard to its relationship to maximal repeats. As stated above, several works have studied RePair, but, to the best of our knowledge, none of them associate RePair with maximal repeats. Moreover, we propose a grammar compression algorithm, called MR-RePair, that focuses on the property of maximal repeats. Ahead of this work, several off-line grammar compression schemes focusing on (non-maximal) repeats have been proposed [1,10,14]. Very recently, Gańczorz and Jeż addressed to heuristically improve the compression ratio of RePair with regard to the grammar size [8]. However, none of these techniques use the properties of maximal repeats. We show that, under a specific condition, there is a theoretical guarantee that the size of the grammar generated by MR-RePair is smaller than or equal to that generated by RePair. We also confirmed the effectiveness of MR-RePair compared to RePair through computational experiments. Contributions: The primary contributions of this study are as follows. arXiv:1811.04596v2 [cs.DS] 18 Feb 2019 2. We design a novel variant of RePair called MR-RePair, which is based on substituting the ...

show abstract

“…In our experiments, we combine our implementation described above with a well-tuned implementation of lineartime RePair by Maruyama [1] (denote it by RP). Setting t ∈ {2, 3, 4, 5}, we compare our method with RP and the most space efficient linear-time algorithm [6,2] to date (denote it by SERP). In theory, SERP runs in O(N/ ) time using at most (1.5 + )N words of space for arbitrary small ≤ 1, but is fixed to 1 in their implementation.…”

Section: Methodsmentioning

confidence: 99%

RePair in Compressed Space and Time

Sakai

Ohno

Goto

et al. 2019

2019 Data Compression Conference (DCC)

View full text Add to dashboard Cite

Given a string T of length N , the goal of grammar compression is to construct a small context-free grammar generating only T. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair(T) in expected O(N ) time, the study to reduce its working space is still active so that it is applicable to large-scale data. In this paper, we propose the first Re-Pair algorithm working in compressed space, i.e., potentially o(N ) space for highly compressible texts. The key idea is to give a new way to restructure an arbitrary grammar S for T into RePair(T) in compressed space and time. Based on the recompression technique, we propose an algorithm for RePair(T) in O(min(N, nm log N )) space and expected O(min(N, nm log N )m) time or O(min(N, nm log N ) log log N ) time, where n is the size of S and m is the number of variables in RePair(T). We implemented our algorithm running in O(min(N, nm log N )m) time and show it can actually run in compressed space. We also present a new approach to reduce the peak memory usage of existing RePair algorithms combining with our algorithms, and show that the new approach outperforms, both in computation time and space, the most space efficient linear-time RePair implementation to date. ACM Subject Classification Data structures design and analysis → Data compressionDigital Object Identifier 10.4230/LIPIcs.CVIT.2016.2 to be precise, the improvement is achieved only when m = ω(log log N ), which is likely to hold for compressible texts 3 log N ≤ m is not necessarily true since RePair stops producing variables when the input text is compressed into a string w containing no bigram with frequency ≥ 2. Still, it holds that log N ≤ m + |w|.

show abstract

Space-Efficient Re-Pair Compression

Cited by 20 publications

References 11 publications

Practical Grammar Compression Based on Maximal Repeats

Practical Grammar Compression Based on Maximal Repeats

MR-RePair: Grammar Compression Based on Maximal Repeats

RePair in Compressed Space and Time

Contact Info

Product

Resources

About