Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.
Our study tackles Russian interrogative-relative pronouns ( wh -words) as a lexicographic type which requires a unified treatment. Our objective is to give a systematic description and explanation of the numerous collocational and constructional properties of the Russian wh -words using lexicographic and corpus methods. The dataset and statistics were extracted from the Russian National Corpus, at least 100 examples for each of the pronouns were analysed. Methodologically the study is based on the principles of the Moscow School of Semantics (namely, integral description of language and systematic lexicography) which are to a large extent rooted in the “Meaning⇔Text” theory. They include analysis of linguistic items on all levels of language; a focus on their semantic and combinatorial properties; acknowledged validity of dictionary as an instrument of linguistic research. The paper considers semantic, syntactic and co-occurrence properties shared by many Russian interrogative pronouns and analyzes the reasons for their almost entire lack in the pronouns zachem ‘what for’ and pochemu ‘why’. As demonstrated in the study, most of the constructional and co-occurrence properties typical of Russian interrogative pronouns (for example, co-occurrence with particles imenno ‘exactly’ and khot’ ‘at least’, constructions with mnogo ‘many’, malo ‘few’, etc.) are motivated by the semantics of multiplicity and choice, which are incompatible with ‘what for’ and ‘why’. In addition, as the findings show, different interrogative pronouns have different frequencies of occurrence in the described constructions, which is explained not by their general corpus frequencies or by the animacy hierarchy, but by the compatibility of their semantics with the meanings of multiplicity and choice. The obtained results suggest that combinatorial properties of wh -words are motivated by their semantics which, in turn, reflects the meta-linguistic characteristics of the situations to which they refer.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.