This paper provides results of participation in the Russian News Clustering task within Dialogue Evaluation 2021. News clustering is a common task in the industry, and its purpose is to group news by events. We propose two methods based on BERT for news clustering, one of them shows competitive results in Dialogue 2021 evaluation. The first method uses supervised representation learning. The second one reduces the problem to binary classification.
Computation of text similarity is one of the most challenging tasks in NLP as it implies understanding of semantics beyond the meaning of individual words (tokens). Due to the lack of labelled data this task is often accomplished by means of unsupervised methods such as clustering. Within the DE2021: "Russian News Clustering and Headline Selection" we propose a method of building robust text embeddings based on Sentence Transformers architecture, pretrained on a large dataset of in-domain data and then fine-tuned on a small dataset of paraphrases leveraging GlobalMultiheadPooling.
We propose to use a generative model to classify nonverbal communicative movements (gestures, head and body postures, facial expressions). This model represents the communicative movement as a function of the emotional state, the communicative goal, and reference, and demonstrates the pathways of how the movement could be generated in communication. The proposed generative model distributes communicative movements based on the stimulus for their generation. We distinguish movements (a) related to the meaning of the sentence, (b) emotional uncontrolled movements and their controlled variants, (c) emotional states, simulated to influence the addressee, and (d) movements related to utterance production, expectation of feedback, or lack of feedback. This model can potentially classify the entire spectrum of nonverbal actions and can be applied to control robots and emotional computer agents.
In this paper, we consider the solution of the problem of increasing the speed of the algorithm for hyperspectral images (HSI) compression, based on recognition methods. Two methods are proposed to reduce the computational complexity of a lossy compression algorithm. The first method is based on the use of compression results obtained with other parameters, including those of the recognition method. The second method is based on adaptive partitioning of hyperspectral image pixels into clusters and calculating the estimates of similarity only with the templates of one of the subsets. Theoretical and practical estimates of the increase in the speed of the compression algorithm are obtained.
The paper is focused on divergent ways of conveying discourse relations in translation. For data collection, we used the supracorpora database of connectives storing parallel texts from the Russian-French subcorpus of the Russian National Corpus. These data show what logical-semantic relations tend to be translated using divergent ways, i.e. other than connectives (exclusion in its various gradations, propositional concomitance and substitution, the share of divergent translations ranging from 30% to 50%). Also, such data help define what causes divergent ways of translation to be used. The causes may be as follows: (a) the lack of an adequate equivalent of a given connective in the target language; (b) differences in the syntactic structure of the source and target languages; (c) usage differences; (d) contextually determined use of divergent translation. If there is a prototypical indicator of logical-semantic relations (i.e. connective) in the source text, it also occurs in translation in more than 90% of cases. The data on human translations are then compared with those on machine translations, which shows that the machine translation system also tends to keep a connective if there is one in the source text (it occurs in almost 98% of cases). However, there are cases where the machine translation system has difficulties processing а multiword connective (failing to perceive it as a whole) or a polyfunctional unit (failing to tell a connective from a non-connective) and thus uses divergent ways to translate it. Some causes of divergently translating connectives are likely to be the same for human and machine translations. These are differences in the syntactic structure of languages and usage differences. Further research of divergent means of conveying discourse relations will allow to draw a sharper border-line between explicitly expressed and implicit discourse relations. The data collected from annotated corpora (both monolingual and multilingual and parallel) will help determine what the divergent ways of expressing logical-semantic relations are and how frequently they are used. The research results can be used both in automatic text processing and automatic text generation. Also, the data on divergent translations of discourse relations can serve to improve the machine translation quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.