The aim of research is to improve the quality of domain dictionaries by expanding the corpus of the documents under study by using short documents. A document model is proposed that allows to define a short document and the need to combine it with other documents to highlight verbose terms. An algorithm for highlighting the substantive part of the document has been developed, since in a short document the heading and closing parts usually contain terms that are not related to the studied domain. A method for preliminary clustering of short documents to highlight verbose terms has been developed. The method is based on highlighting and counting occurrences of nouns (one-word terms) for all analyzed documents. The concept of document proximity is introduced, which is determined by the combination of two criteria: the relative number of matching terms and the relative frequency of occurrence of matching terms. The principle of grouping documents at the customer's site often does not correspond to the principles of grouping necessary for building a dictionary of the domain. In a short document, it is usually impossible to isolate a verbose term because the repetition of terms is very low. A method has been developed for virtual combining of short documents based on the principle of achieving the necessary repeatability of one-word terms. The merged document has the highest possible frequency of terms for the cluster it belongs to. At the same time, the original text of documents is preserved and the ability to associate the selected verbose term with those documents in which it is included. The experiment made it possible to find the best ratio for the elements of the document proximity coefficient and confirm the effectiveness of the proposed preliminary clustering method
In this paper, a method of forming definitions of terms for a vocabulary of a subject domain using existing explanatory dictionaries is proposed. It is shown that with a combined search for terms and their interpretations, it is possible to find about ten percent of definitions, which is clearly not enough. A method of automated search for the interpretation of terms is proposed, involving the use of existing explanatory dictionaries. A mathematical model of the subject domain dictionary entry is proposed. A mathematical model of an explanatory dictionary entry is proposed, taking into account the headword, a variety of interpretations of the word, litters and stable phrases. A mechanism has been developed for extracting definitions of a term from an explanatory dictionary depending on the structure of its dictionary entry. An algorithm for automated search for definitions for single-word terms has been developed. An algorithm has been developed for the automated search for definitions for verbose terms, based on the selection of nouns from the term. A mechanism for assessing the quality of possible interpretations, depending on the occurrence of terms from the subject domain, is proposed. A mechanism has been developed for the choice of definitions, when the terms from the vocabulary of the subject domain and the explanatory dictionary are coinciding incompletely, which is based on the procedure of term decomposition, the search for partial interpretations and the synthesis of the resulting interpretation. The software developed that allows to organize the search for interpretations of terms both in local explanatory dictionaries (previously loaded into the system), and in online dictionaries. The expert’s task includes the evaluation the interpretations found and possibly editorial correction of them. Experimental evaluation of the effectiveness of the use of a software product showed a reduction in the expert’s working time compared to the “manual mode” by approximately four times.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.