The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing

Carrascosa, Rafael; Coste, F; Gallé, Matthias; Infante-López, Gabriel

doi:10.3390/a4040262

Cited by 9 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper extends the work of Sidorov et al (2014), and validates the hypothesis experimentally, through the use of several compressors: Lempel-Ziv Welch (Ziv & Lempel, 1978), Burrows-Wheeler with run-length encoding (Burrows & Wheeler, 1994), and GZIP (Deutsch, 1996). We compare the performance of these general-purpose compression algorithms against two algorithms designed specifically for grammar-based compression: Zig-Zag (ZZ) (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012 and Iterative Repeat Replacement with Most Compressive score function (Carrascosa et al, 2010(Carrascosa et al, , 2011(Carrascosa et al, , 2012. Our experiments are conducted on a collection of 7928 musical scores gathered from sources which include the Acadia Early Music Archive (Callon, 1998(Callon, -2009, the Choral Public Domain Library (CPDL organisation, 2018), andMusopen (Musopen organisation, 2018).…”

Section: Introductionsupporting

confidence: 71%

“…Its ability to correctly select motifs from two Bach chorales was also demonstrated. Carrascosa et al (2010Carrascosa et al ( , 2011Carrascosa et al ( , 2012 showed that a variety of existing grammar-based compressors performed identical steps during the construction process. These compressors differed only in score function, selecting one of three specific functions: Maximal Length (ML), where the repeating term with the greatest length was chosen; Most Frequent (MF), where the repeating term with the highest number of occurrences in the input was chosen; and Most Compressive (MC), where both term length l and frequency f were combined as lf to allow selection of the term offering the greatest reduction in encoding length when all its instances were replaced within the input.…”

Section: Grammars and Compressorsmentioning

confidence: 99%

See 1 more Smart Citation

An investigation of music analysis by the application of grammar-based compressors

Humphreys

Sidorov

Jones

et al. 2021

Journal of New Music Research

View full text Add to dashboard Cite

Many studies have presented computational models of musical structure, as an important aspect of musicological analysis. However, the use of grammar-based compressors to automatically recover such information is a relatively new and promising technique. We investigate their performance extensively using a collection of nearly 8000 scores, on tasks including error detection, classification, and segmentation, and compare this with a range of more traditional compressors. Further, we detail a novel method for locating transcription errors based on grammar compression. Despite its lack of domain knowledge, we conclude that grammar-based compression offers competitive performance when solving a variety of musicological tasks.

show abstract

Section: Introductionsupporting

confidence: 71%

Section: Grammars and Compressorsmentioning

confidence: 99%

An investigation of music analysis by the application of grammar-based compressors

Humphreys

Sidorov

Jones

et al. 2021

Journal of New Music Research

View full text Add to dashboard Cite

show abstract

“…In DNA, the problem seems more complicated since the challenge is then to deal with palindromes and copies, requiring us to use and learn more expressive grammars. For DNA, recent advances have thus rather been on a simpler task: discovering the hierarchical structure of DNA as an instance of the smallest grammar problem, along the lines initiated by Sequitur [116] and its successors [117,118,119,120,121,122,123]. These studies have not been presented in this chapter since it is still difficult to assert and compare their biological pertinence, but these approaches based on repeats may help us to better understand what are the important words and where are their occurrences in DNA and to decipher its word structure as a preliminary step to learning grammars.…”

Section: Resultsmentioning

confidence: 99%

Learning the Language of Biological Sequences

Coste

2016

Topics in Grammatical Inference

Self Cite

View full text Add to dashboard Cite

International audienceLearning the language of biological sequences is an appealing challenge for the grammatical inference research field.While some first successes have already been recorded, such as the inference of profile hidden Markov models or stochastic context-free grammars which are now part of the classical bioinformatics toolbox, it is still a source of open and nice inspirational problems for grammatical inference, enabling us to confront our ideas to real fundamental applications. As an introduction to this field, we survey here the main ideas and concepts behind the approaches developed in pattern/motif discovery and grammatical inference to characterize successfully the biological sequences with their specificities

show abstract

“…Subword regularization (Kudo, 2018) could be used to make the training more robust to this tokenization mismatch. Moreover, that approach could be adapted to work with any of the inferred tokenization: while a vanilla optimal parsing does not seem to support a probabilistic approach which could allow a sampling procedure, there might be several optimal parsings of one word (Carrascosa et al (2011) show both theoretical and empirical evidence that there can be an exponential number of parses with the same size) and generating several of those could make the translation system more robust.…”

Section: Methodsmentioning

confidence: 99%

“…Despite its simplicity, BPE performs very well in standard compression benchmarks (Gage, 1994;Carrascosa et al, 2012). The best performing ones have an unreasonable time-complexity (including one of complexity O(n 7 ) (Carrascosa et al, 2011) and an even slower genetic algorithm (Benz and Kötzing, 2013)). Based on these benchmarks, we decided to use the so-called IRRMGP algorithm, which outperforms BPE, as well as other worse performing algorithms.…”

Section: Other Compression Algorithmsmentioning

confidence: 99%

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Gallé¹

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that-given a fixed vocabulary size budget-the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

show abstract

The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing

Cited by 9 publications

References 22 publications

An investigation of music analysis by the application of grammar-based compressors

An investigation of music analysis by the application of grammar-based compressors

Learning the Language of Biological Sequences

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Contact Info

Product

Resources

About