The paper presents the General Internet Corpus of the Russian Language (GICR) as a tool for linguistic research. Problems are identified that are common to any WEB-corpus that affect the reliability of such research. Among the problems considered: the importance of taking into account sociolinguistic variability, the influence of falsely attributed texts, thematic biases, the prospects and disadvantages of new methods for corpora output aggregation. A distinctive feature of our approach is the emphasis on linguistic significance, reliability, and interpretability of the results obtained.
This paper presents the results of the study devoted to the applicability of SOTA methods for morphological corpus annotation (based on GramEval2020) for analytical sociolinguistic research. The study shows that statistically successful technologies of morphosyntactic annotation for such purposes create a number of problems for researchers if they are used purely i.e. without any linguistic knowledge. In this paper, methods for improving the morphological annotation, successfully implemented in GICR, from the point of view of its reliability are presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.