A Comparative Study of Feature Types for Age-Based Text Classification

Glazkova, Anna; Egorov, Yury; Glazkov, Maksim

doi:10.1007/978-3-030-72610-2_9

Cited by 10 publications

(12 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Помимо установления связи параметра частотности слов и сложности составленного из них текста отдельную исследовательскую проблему представляет выбор источника данных о частотности лексики, релевантного выбранной возрастной категории, так как данные о частотности слова сильно зависят от типа и наполнения корпуса, по которому ведутся подсчёты (Ляшевская, Шаров 2009, предисловие к словарю). Ряд исследователей используют для этих целей данные больших национальных корпусов текстов (Dorofeeva et al 2019, Glazkova et al 2021, Иомдин, Морозов 2021. Аргументами в пользу этого выбора могут служить большой размер таких корпусов, а также представленность в их составе различных жанров «официального» кодифицированного языка, с которым учащимся предстоит столкнуться в жизни: художественная литература, новости, публицистика -всё это составляет основу данных.…”

Section: частотность слова как параметр оценки сложности текста: теор...unclassified

“…Классический способ, представленный еще в ранних формулах читабельности, предлагает расчет процента слов текста, входящих в релевантный список слов, одной из разновидностей которого может стать частотный список. Этот метод расчета и сейчас используется в ряде исследований сложности текста (Glazkova et al 2021, Sato 2014. Ещё один популярный способ учета частотности слов текста -это расчет среднего или медианного значения из частотности каждого слова текста (Francois & Fairon 2012, Reynolds 2016.…”

Section: частотность слова как параметр оценки сложности текста: теор...unclassified

“…Соответственно, текст с высокой долей частотных слов должен восприниматься лучше (Chen & Meurers 2018). Частотность слова в качестве одного из признаков, оказывающих влияние на сложность, широко используется как в исследованиях для англоязычных материалов (Lexile 2007, Graesser et al 2014, так и для текстов на русском языке (Solovyev et al 2018, Glazkova et al 2021, Иомдин, Морозов 2021. С другой стороны, исследователи указывают на отсутствие значимой корреляционной связи частотности и сложности текста, оцениваемой с помощью классической формулы Флеша-Кинкейда (Мартынова и др.…”

Section: Introductionunclassified

See 2 more Smart Citations

Word frequency and text complexity: an eye-tracking study of young Russian readers

Laposhina

Lebedeva

Khenis

2022

Russian Journal of Linguistics

View full text Add to dashboard Cite

Although word frequency is often associated with the cognitive load on the reader and is widely used for automated text complexity assessment, to date, no eye-tracking data have been obtained on the effectiveness of this parameter for text complexity prediction for the Russian primary school readers. Besides, the optimal ways for taking into account the frequency of individual words to assess an entire text complexity have not yet been precisely determined. This article aims to fill these gaps. The study was conducted on a sample of 53 children of primary school age. As a stimulus material, we used 6 texts that differ in the classical Flesch readability formula and data on the frequency of words in texts. As sources of the frequency data, we used the common frequency dictionary based on the material of the Russian National Corpus and DetCorpus - the corpus of literature addressed to children. The speed of reading the text aloud in words per minute averaged over the grades was employed as a measure of the text complexity. The best predictive results of the relative reading time were obtained using the lemma frequency data from the DetCorpus. At the text level, the highest correlation with the reading speed was shown by the text coverage with a list of 5,000 most frequent words, while both sources of the lists - Russian National Corpus and DetCorpus - showed almost the same correlation values. For a more detailed analysis, we also calculated the correlation of the frequency parameters of specific word forms and lemmas with three parameters of oculomotor activity: the dwell time, fixations count, and the average duration of fixations. At the word-by-word level, the lemma frequency by DetCorpus demonstrated the highest correlation with the relative reading time. The results we obtained confirm the feasibility of using frequency data in the text complexity assessment task for primary school children and demonstrate the optimal ways to calculate frequency data.

show abstract

Section: частотность слова как параметр оценки сложности текста: теор...unclassified

Section: Introductionunclassified

See 1 more Smart Citation

Word frequency and text complexity: an eye-tracking study of young Russian readers

Laposhina

Lebedeva

Khenis

2022

Russian Journal of Linguistics

View full text Add to dashboard Cite

show abstract

“…The reported results were obtained from text corpora of widely differing sizes and domains. Moreover, the authors used different machine learning (ML) models and text representation techniques (Feng et al 2010, Cantos & Almela 2019, Isaeva & Sorokin 2020, Deutsch et al 2020, Glazkova et al 2021, Martinc et al 2021. This makes it complicated to achieve an objective evaluation of the impact of different types of features.…”

Section: Introductionmentioning

confidence: 99%

Text complexity and linguistic features: Their correlation in English and Russian

Морозов

Glazkova²,

Iomdin³

2022

Russian Journal of Linguistics

View full text Add to dashboard Cite

Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.

show abstract

“…All of these corpora can be used for the creation and/or evaluation of automatic text simplification systems. As for the Russian language, the linguistic complexity of texts for children was studied on educational materials for Russian-speaking students at primary school (Laposhina et al, 2019 ) and secondary school (Solovyev et al, 2018 ; Vakhrusheva et al, 2021 ) and the collection of book previews labelled with one of two categories—children's or adult (Glazkova et al, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study of Educational Texts for Native, Foreign, and Bilingual Young Speakers of Russian: Are Simplified Texts Equally Simple?

2021

View full text Add to dashboard Cite

Studies on simple language and simplification are often based on datasets of texts, either for children or learners of a second language. In both cases, these texts represent an example of simple language, but simplification likely involves different strategies. As such, this data may not be entirely homogeneous in terms of text simplicity. This study investigates linguistic properties and specific simplification strategies used in Russian texts for primary school children with different language backgrounds and levels of language proficiency. To explore the structure and variability of simple texts for young readers of different age groups, we have trained models for multiclass and binary classification. The models were based on quantitative features of texts. Subsequently, we evaluated the simplification strategies applied to readers of the same age with different linguistic backgrounds. This study is particularly relevant for the Russian language material, where the concept of easy and plain language has not been sufficiently investigated. The study revealed that the three types of texts cannot easily be distinguished from each other by judging the performance of multiclass models based on various quantitative features. Therefore, it can be said that texts of all types exhibit a similar level of accessibility to young readers. In contrast, binary classification tasks demonstrated better results, especially in the R-native vs. non R-native track (with 0.78 F1-score), these results may indicate that the strategies used for adapting or creating texts for each type of audience are different.

show abstract

A Comparative Study of Feature Types for Age-Based Text Classification

Cited by 10 publications

References 21 publications

Word frequency and text complexity: an eye-tracking study of young Russian readers

Word frequency and text complexity: an eye-tracking study of young Russian readers

Text complexity and linguistic features: Their correlation in English and Russian

A Comparative Study of Educational Texts for Native, Foreign, and Bilingual Young Speakers of Russian: Are Simplified Texts Equally Simple?

Contact Info

Product

Resources

About