The paper deals with the pilot version of the first RST discourse treebank for Russian. The project started in 2016. At present, the treebank consists of sixty news texts annotated for rhetorical relations according to RST scheme. However, this scheme was slightly modified in order to achieve higher inter-annotator agreement score. During the annotation procedure, we also registered the discourse connectives of different types and mapped them onto the corresponding rhetoric relations. In present paper, we discuss our experience of RST scheme adaptation for Russian news texts. Besides, we report on the distribution of the most frequent discourse connectives in our corpus.
This work presents the first fully-fledged discourse parser for Russian based on the Rhetorical Structure Theory of Mann and Thompson (1988). For the segmentation, discourse tree construction, and discourse relation classification we employ deep learning models. With the help of multiple word embedding techniques, the new state of the art for discourse segmentation of Russian texts is achieved. We found that the neural classifiers using contextual word representations outperform previously proposed feature-based models for discourse relation classification. By ensembling both methods, we are able to further improve the performance of the discourse relation classification achieving the new state of the art for Russian.
Results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank-first Russian corpus annotated within RST framework-are presented. Various lexical, quantitative, morphological, and semantic features were used. In rhetorical relation classification, ensemble of CatBoost model with selected features and a linear SVM model provides the best score (macro F 1 = 54.67 ± 0.38). We discover that most of the important features for rhetorical relation classification are related to discourse connectives derived from the connectives lexicon for Russian and from other sources.
The paper presents a corpus study of the discourse features in the corpus of blogs. It is based on the data of Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The Ru-RSTreebank represents genres of news and popular science, scientific papers, and blogs texts. Blog subcorpus contains such topics as travelling, cosmetics, sports and health, psychology, IT and tech and some others. Blogs texts constitute a specific genre as they combine properties of written and spoken discourse. The purpose of the paper is to investigate discourse features of blogs in comparison with other genres. We analyze the variation in rhetoric relations distribution among genres, and single out the differences in discourse connectives usage. Furthermore, we check the distribution of other discourse features reported in different studies for spoken discourse and for social media in the Ru-RSTreebank blogs subcorpus. The general frequency analysis and the experiments on RandomForest classifier application to genre recognition have shown that the most important rhetoric relations specific to blogs are Evaluation and Contrast, that there is a tendency to use shorter discourse units and not to express the discourse relations overtly via subordinative conjunctions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.