Assessment of the population life quality is an important and relevant sociological task. Machine learning as a classification tool of social network users’ digital traces makes it possible to create a base to calculate subjective life quality index. The article consistently reviews all stages of the machine learning algorithms application to assess the life quality of the population of the regions of the Russian Federation and the issues of improving neural network accuracy. To train the neural network the authors formed a set of marked-up data extracted from regional communities of the social network “VKontakte”. Various approaches to text vectorisation, publicly available neural network models pre-trained on large Russian-language text corpora, as well as metrics for evaluating the algorithms results were analysed. Computational experiments with different algorithms were carried out, according to the results of which the Rubert-tiny algorithm was selected due to its high learning and classification rate. During the model parameters adjustment, the accuracy of f1-macro 0.545 was achieved. Computational experiments were carried out using Python scripts.Typical errors that a neural network makes in the process of automatic content classification were considered. The results of the study can be used to calculate the online activity index in the VKontakte social network of users from various Russian regions, on the basis of which the subjective life quality index will be calculated in the future. Improving the neural network accuracy will make it possible to obtain more reliable data for assessing the life quality in Russian regions based on users’ digital traces.
Sentiment analysis is one of the most demanded natural language processing operations for solving applied problems. One of the key methods of automated sentiment analysis is supervised machine learning. In the presence of a large selection of ready-made solutions for determining the tonality, the results of the models give significant errors due to the complexity and contextual conditionality of the linguistic explication of emotions. The article presents the results of the validation of 6 models for determining the sentiment of Russian-language publications using a research validation dataset – expertly marked 300 statements extracted from social network messages on the subject of quality of life and corresponding to one of the sentiment types: positive, negative, neutral. To evaluate the performance of the models, interannotator agreement coefficients were used, in particular, Krippendorff’s alpha, Cohen’s kappa and Fleiss’s kappa coefficients. The obtained values of the coefficients showed a low level of reliability between the expert labels and the labels that were assigned by the models. Among the experiments performed, the lowest agreement coefficients were achieved for the Blanchefort model trained on Rusentiment data, and the highest for the model of the same developer trained on medical feedback data. Based on the results obtained, conclusions were drawn about the most common causes of disagreements in determining sentiment by machine learning models. Machine learning models correctly identify the tone of texts if they contain bright lexical markers that match in tone the general tone of the statement. On the contrary, problems in determining the tone of an emotionally charged message by the model are provoked by the presence of a word with the opposite tone in it. The use of emotive vocabulary that does not match the tone of the entire statement, the presence of marker words not in their direct meanings, the use of uppercase, forms of complicated communication (including irony, sarcasm) remain risk factors for attracting automated analysis resources: with a high degree of probability the automatic classification model will not be able to correctly determine the tone of the text. The main reason for the “difficulties” of the automated determination of sentiment is the complexity of the task of focusing on the utterance as an integral unit and the refusal to focus on individual formal indicators. The utterance is the minimum communicative unit of speech. Capturing its semantic and emotionally expressive integrity is a super task for machine learning models in sentiment analysis. So, it is still quite difficult to trust machine learning models in solving such a complex task as automated categorization of emotions. It is advisable to associate the prospects for research directions in this area, first of all, with the development of high-quality, linguistically sound training datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.