Identifying which words present in a text may be difficult to understand by common readers is a well-known subtask in text complexity analysis. The advent of deep language models has also established the new state-of-the-art in this task by means of end-to-end semi-supervised (pre-trained) and downstream training of, mainly, transformer-based neural networks. Nevertheless, the usefulness of traditional linguistic features in combination with neural encodings is worth exploring, as the computational cost needed for training and running such networks is becoming more and more relevant with energy-saving constraints. This study explores lexical complexity prediction (LCP) by combining pre-trained and adjusted transformer networks with different types of traditional linguistic features. We apply these features over classical machine learning classifiers. Our best results are obtained by applying Support Vector Machines on an English corpus in an LCP task solved as a regression problem. The results show that linguistic features can be useful in LCP tasks and may improve the performance of deep learning systems.
This article describes a system to predict the complexity of words for the Lexical Complexity Prediction (LCP) shared task hosted at Se-mEval 2021 (Task 1) with a new annotated English dataset with a Likert scale. Located in the Lexical Semantics track, the task consisted of predicting the complexity value of the words in context. A machine learning approach was carried out based on the frequency of the words and several characteristics added at word level. Over these features, a supervised random forest regression algorithm was trained. Several runs were performed with different values to observe the performance of the algorithm. For the evaluation, our best results reported a M.A.E of 0.07347, M.S.E. of 0.00938, and R.M.S.E. of 0.096871. Our experiments showed that, with a greater number of characteristics, the precision of the classification increases.
Students often require a greater understanding of the lexicon that teachers use when dictating an assignment in class or in written texts as supporting material. Identifying and labelling difficult words has allowed us to examine the problem. A sample of students from the University of Guayaquil (Ecuador) was taken to experiment in a corpus of video transcripts that correspond to the different careers. After performing the analysis of the tagged words, the conclusions reached by other research papers in lexical simplification are confirmed and corroborates the recommendations of the Easy Reading guide prepared by Inclusion Europe in 2009. The investigation determined that the words labeled as difficult were specialized words, common lexical words, slang, English words, acronyms, among others. It was difficult for students to understand its meaning; in some cases, they either ignored its definition or just had the wrong idea of the lexicon. This work aims to be a contribution to future research in the area of lexical simplification applied to the development of solutions for detecting difficult words in the university academic field. Also, the type of complex expressions identified in the VYTEDU-CW corpus were characterized by the software, which enriches this resource while opening the possibility to organize a workshop where to promote research in the detection of difficult words to the Spanish. The support to validate these solutions is available to the scientific community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.