This article describes the original method of automatic summarization of scientific and technical texts based on rhetorical analysis and using topic modeling. The proposed method combines the use of a linguistic knowledge base and machine learning. For the detection of key terms, we used topic modeling. First, unigram topic models containing only one-word terms are constructed. Further, these models are extended by adding multiword terms. The most significant fragments of the original document are determined in the process of rhetorical analysis with the help of discursive markers. When evaluating the importance of text fragments, keywords, multiword terms, and scientific lexicon characterizing scientific and technical texts are also taken into account. A linguistic knowledge base has been created to store information about the markers and scientific lexicon. The experiments showed that this method is effective, needs a comparatively small amount of training data and can be adapted to processing texts of different subject fields in other languages.
Abstract. The paper describes the generalization of the summarization algorithm of Niraj Kumar. The method proposed in the article uses the Link Grammar Parser. Our investigations are oriented to processing news articles, reviews from social networks, etc. We consider the possibility of applying this algorithm to estimate the relevance of posts published in the Internet to the selected articles published before. This approach is useful in solving the problem of identifying the source of information dissemination.
The paper describes the methods of comparison of the sentences in a natural language for estimation of their similarity. To solve this problem, it is possible to use the semantic-syntactical relations between words constructed by the software system Link Grammar Parser. The results of our research are planned to be used in information retrieval systems. The application of the methods here considered to studies of Turkic languages is briefly described.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.