Text readability is a measure of how easy or difficult it is to read a text. This readability factor plays a crucial role in the processes of drafting and comprehending the texts, affecting the choice of proper texts for reading. Studies on the readability of text have started since the late nineteenth century and there have been many practical applications. However, these studies are mainly performed in English and other popular languages. In Vietnamese, the study of the text readability is still relatively untapped and has only received attention in recent years in the process of improving the curriculum and teaching methods. Recent studies on the readability of text in Vietnamese language are still limited, the main reason was largely due to the lack of text resources, which are corpora graded accordingly to difficulty levels. Therefore, in this study, we focused on building a corpus for assessing the readability of Vietnamese texts in the literature domain through the process of collecting, processing and evaluating documents. The result is that we have built up a corpus of 1,825 Vietnamese texts, divided into four levels of difficulty (Very easy, Easy, Medium and Difficult). Experiments with the existing Vietnamese readability assessment methods show that the built corpus is reliable and usable for further research on the text readability.
While English text readability has been studied for a long time, investigating text readability in Vietnamese, a low-resourced language with poorresearch technologies and data sets questionable of international importance, is at its beginnings. In readability research, it is generally the “word” that has been carefully investigated. Based on the comparison of elements affecting readability of the “word” unit in English, we determine the parts of speech (POS) in Vietnamese that were found to influence Vietnamese text readability. In this study, prose texts in Vietnamese textbooks at different difficulty level were taken as the data to find out the POS frequencies and their correlations. In terms of frequency, our findings can initially assist users when editing documents, reforming textbooks, and question banks for native Vietnamese in general and foreigners in particular. Even more important, with these findings we can identify those linguistic elements that are considered the “potential” POS affecting Vietnamese text readability, and make grounds for further studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.