2011
DOI: 10.3390/a4040262
|View full text |Cite
|
Sign up to set email alerts
|

The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing

Abstract: Abstract:The smallest grammar problem-namely, finding a smallest context-free grammar that generates exactly one sequence-is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose a new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents of the grammar and (2) searching for the smallest grammar given this set of constituents. We show how to solve the second task in polynomial t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
7
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 22 publications
1
7
0
Order By: Relevance
“…This paper extends the work of Sidorov et al (2014), and validates the hypothesis experimentally, through the use of several compressors: Lempel-Ziv Welch (Ziv & Lempel, 1978), Burrows-Wheeler with run-length encoding (Burrows & Wheeler, 1994), and GZIP (Deutsch, 1996). We compare the performance of these general-purpose compression algorithms against two algorithms designed specifically for grammar-based compression: Zig-Zag (ZZ) (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012 and Iterative Repeat Replacement with Most Compressive score function (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012. Our experiments are conducted on a collection of 7928 musical scores gathered from sources which include the Acadia Early Music Archive (Callon, 1998(Callon, -2009, the Choral Public Domain Library (CPDL organisation, 2018), andMusopen (Musopen organisation, 2018).…”
Section: Introductionsupporting
confidence: 71%
See 1 more Smart Citation
“…This paper extends the work of Sidorov et al (2014), and validates the hypothesis experimentally, through the use of several compressors: Lempel-Ziv Welch (Ziv & Lempel, 1978), Burrows-Wheeler with run-length encoding (Burrows & Wheeler, 1994), and GZIP (Deutsch, 1996). We compare the performance of these general-purpose compression algorithms against two algorithms designed specifically for grammar-based compression: Zig-Zag (ZZ) (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012 and Iterative Repeat Replacement with Most Compressive score function (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012. Our experiments are conducted on a collection of 7928 musical scores gathered from sources which include the Acadia Early Music Archive (Callon, 1998(Callon, -2009, the Choral Public Domain Library (CPDL organisation, 2018), andMusopen (Musopen organisation, 2018).…”
Section: Introductionsupporting
confidence: 71%
“…Its ability to correctly select motifs from two Bach chorales was also demonstrated. Carrascosa et al (2010Carrascosa et al ( , 2011Carrascosa et al ( , 2012 showed that a variety of existing grammar-based compressors performed identical steps during the construction process. These compressors differed only in score function, selecting one of three specific functions: Maximal Length (ML), where the repeating term with the greatest length was chosen; Most Frequent (MF), where the repeating term with the highest number of occurrences in the input was chosen; and Most Compressive (MC), where both term length l and frequency f were combined as lf to allow selection of the term offering the greatest reduction in encoding length when all its instances were replaced within the input.…”
Section: Grammars and Compressorsmentioning
confidence: 99%
“…In DNA, the problem seems more complicated since the challenge is then to deal with palindromes and copies, requiring us to use and learn more expressive grammars. For DNA, recent advances have thus rather been on a simpler task: discovering the hierarchical structure of DNA as an instance of the smallest grammar problem, along the lines initiated by Sequitur [116] and its successors [117,118,119,120,121,122,123]. These studies have not been presented in this chapter since it is still difficult to assert and compare their biological pertinence, but these approaches based on repeats may help us to better understand what are the important words and where are their occurrences in DNA and to decipher its word structure as a preliminary step to learning grammars.…”
Section: Resultsmentioning
confidence: 99%
“…Subword regularization (Kudo, 2018) could be used to make the training more robust to this tokenization mismatch. Moreover, that approach could be adapted to work with any of the inferred tokenization: while a vanilla optimal parsing does not seem to support a probabilistic approach which could allow a sampling procedure, there might be several optimal parsings of one word (Carrascosa et al (2011) show both theoretical and empirical evidence that there can be an exponential number of parses with the same size) and generating several of those could make the translation system more robust.…”
Section: Methodsmentioning
confidence: 99%
“…Despite its simplicity, BPE performs very well in standard compression benchmarks (Gage, 1994;Carrascosa et al, 2012). The best performing ones have an unreasonable time-complexity (including one of complexity O(n 7 ) (Carrascosa et al, 2011) and an even slower genetic algorithm (Benz and Kötzing, 2013)). Based on these benchmarks, we decided to use the so-called IRRMGP algorithm, which outperforms BPE, as well as other worse performing algorithms.…”
Section: Other Compression Algorithmsmentioning
confidence: 99%