Proceedings of the Thirteenth Workshop on Innovative Use of NLP For Building Educational Applications 2018
DOI: 10.18653/v1/w18-0535
|View full text |Cite
|
Sign up to set email alerts
|

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

Abstract: This paper describes the collection and compilation of the OneStopEnglish corpus of texts written at three reading levels, and demonstrates its usefulness for through two applications -automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total). The corpus is now freely available under a CC by-SA 4.0 license 1 and we hope that it would foster further research on the topics of readability assessment and text simplification.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
94
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 96 publications
(97 citation statements)
references
References 26 publications
2
94
0
1
Order By: Relevance
“…It was already noted by (Vajjala and Meurers, 2012) that the actual readability level of each test was difficult to predict accurately based on the sole FKGL. OSE complexity levels are also in agreement with the Flesch-Kincaid index, and in agreement with the numbers reported in (Vajjala and Lucic, 2018); again we see a large overlap between levels for this index. Overall, OSE texts are somewhat more complex than Weebit's, with OSE level 1 comparable in difficulty to Weebit level 4.…”
Section: Sourcesupporting
confidence: 90%
“…It was already noted by (Vajjala and Meurers, 2012) that the actual readability level of each test was difficult to predict accurately based on the sole FKGL. OSE complexity levels are also in agreement with the Flesch-Kincaid index, and in agreement with the numbers reported in (Vajjala and Lucic, 2018); again we see a large overlap between levels for this index. Overall, OSE texts are somewhat more complex than Weebit's, with OSE level 1 comparable in difficulty to Weebit level 4.…”
Section: Sourcesupporting
confidence: 90%
“…However, we know it is possible to automatically distinguish between these levels in this corpus using machine learning models (Ambati et al, 2016;Vajjala and Lucic, 2018). Whether the variation between texts of any specific linguistic property (e.g., lexical richness, syntactic complexity, coherence) can be correlated with the differences in comprehension scores instead of "reading level" assigned by the teachers should be explored as a part of future work.…”
Section: Discussionmentioning
confidence: 99%
“…Texts: We randomly selected 15 texts from the OneStopEnglish corpus (Vajjala and Lucic, 2018), consisting of manually simplified news articles from The Guardian, by English teachers, to suit beginner, intermediate, and advanced readers of English as Second Language (ESL). This corpus was also used in past user studies related to readability assessment (Crossley et al, 2014; Participants: 112 non-native English speaking participants were recruited for this study from among the student population of an American university by means of an internal email advertisement.…”
Section: Methods and Experiments Proceduresmentioning
confidence: 99%
“…This advantage of this method lies in its high reliability, but inviting the experts is carefully considered. This was also the method of Slyh & Hansen [15], Todirascu et al [16], Xia et al [17], Vajjala & Lučić [18], Nguyen & Henkin [8,9], among others.…”
Section: Criteria For Building the Corpusmentioning
confidence: 99%