PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for
            Automatic Text Simplification

Brunato, Dominique; Cimino, Andrea; Dell’Orletta⋄, Felice; Venturi, Giulia

doi:10.18653/v1/d16-1034

Cited by 23 publications

(19 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In terms of sentence aligned corpora for text simplification, different versions of aligned Wiki-Simple Wikipedia sentences have been used in NLP research (Zhu et al, 2010;Coster and Kauchak, 2011;Hwang et al, 2015). Different supervised and unsupervised approaches were proposed to construct such corpora (Bott and Saggion, 2011;Klerke and Søgaard, 2012;Klaper et al, 2013;Brunato et al, 2016). Our corpus adds a new resource for the English text simplification task.…”

Section: Introductionmentioning

confidence: 99%

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

Vajjala¹,

Lučić²

2018

Proceedings of the Thirteenth Workshop on Innovative Use of NLP For Building Educational Applications

View full text Add to dashboard Cite

This paper describes the collection and compilation of the OneStopEnglish corpus of texts written at three reading levels, and demonstrates its usefulness for through two applications -automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total). The corpus is now freely available under a CC by-SA 4.0 license 1 and we hope that it would foster further research on the topics of readability assessment and text simplification.

show abstract

Section: Introductionmentioning

confidence: 99%

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

Vajjala¹,

Lučić²

2018

Proceedings of the Thirteenth Workshop on Innovative Use of NLP For Building Educational Applications

View full text Add to dashboard Cite

show abstract

“…and it is thus more suitable to catch the "layman" intuition of sentence complexity. For these reasons, this method has been used in recent works in the field of readability and text simplification; it is the case of Lasecki et al (2015); Clercq et al (2013); Brunato et al (2016) where the crowd was asked to evaluate the level of complexity or the degree of informativeness of simplified sentences compared to the original one.…”

Section: Introductionmentioning

confidence: 99%

Is this Sentence Difficult? Do you Agree?

Brunato¹,

Mattei²,

Dell’Orletta⋄³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

In this paper, we present a crowdsourcingbased approach to model the human perception of sentence complexity. We collect a large corpus of sentences rated with judgments of complexity for two typologically-different languages, Italian and English. We test our approach in two experimental scenarios aimed to investigate the contribution of a wide set of lexical, morpho-syntactic and syntactic phenomena in predicting i) the degree of agreement among annotators independently from the assigned judgment and ii) the perception of sentence complexity.

show abstract

“…• A subset of the PaCCSS-it corpus (Brunato et al, 2016), which contains 63, 000 complex-to-simple sentence pairs automatically extracted from the Web. In order to extract only the pairs of higher quality, we pre-processed the corpus by discarding sentence pairs with special characters, misspellings, non-matching numerals or dates, and a cosine similarity below 0.5. mal language, including Italian Opensubtitles, 2 the Paisà corpus (Lyding et al, 2014), Wikipedia and the collection of Italian laws.…”

Section: Italianmentioning

confidence: 99%

Neural Text Simplification in Low-Resource Conditions Using Weak Supervision

Aprosio¹,

Tonelli²,

Turchi³

et al. 2019

Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation

View full text Add to dashboard Cite

Neural text simplification has gained increasing attention in the NLP community thanks to recent advancements in deep sequence-tosequence learning. Most recent efforts with such a data-demanding paradigm have dealt with the English language, for which sizeable training datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work to create training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspired by the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements to neural models, in this paper we exploit large amounts of heterogeneous data to automatically select simple sentences, which are then used to create synthetic simplification pairs. We also evaluate other solutions, such as oversampling and the use of external word embeddings to be fed to the neural simplification system. Our approach is evaluated on Italian and Spanish, for which few thousand gold sentence pairs are available. The results show that these techniques yield performance improvements over a baseline sequence-to-sequence configuration.

show abstract

PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification

Cited by 23 publications

References 16 publications

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

Is this Sentence Difficult? Do you Agree?

Neural Text Simplification in Low-Resource Conditions Using Weak Supervision

Contact Info

Product

Resources

About