2017
DOI: 10.1088/1742-6596/937/1/012046
|View full text |Cite
|
Sign up to set email alerts
|

A comparison of Data Driven models of solving the task of gender identification of author in Russian language texts for cases without and with the gender deception

Abstract: Abstract. In this work we compare several data-driven approaches to the task of author's gender identification for texts with or without gender imitation. The data corpus has been specially gathered with crowdsourcing for this task. The best models are convolutional neural network with input of morphological data (f1-measure: 88%±3) for texts without imitation, and gradient boosting model with vector of character n-grams frequencies as input data (f1-measure: 64% ± 3) for texts with gender imitation. The metho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 6 publications
0
3
0
Order By: Relevance
“…For such languages as English, Spanish, and Arabic, there are large text corpora to create data-driven models to identify author's profile (see Section 3). A few last years the author's profiling identification task has been rapidly developing for Russian, and it is strongly related to formation of similar Russian corpora, but the sizes of these corpora are currently smaller: in our work we use Gender-Imitation-Crowdsource ("GI cs") corpus [1], , and Gender Imitation corpus [2] Naturally, it makes it more difficult to reach the high precision of task solution.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…For such languages as English, Spanish, and Arabic, there are large text corpora to create data-driven models to identify author's profile (see Section 3). A few last years the author's profiling identification task has been rapidly developing for Russian, and it is strongly related to formation of similar Russian corpora, but the sizes of these corpora are currently smaller: in our work we use Gender-Imitation-Crowdsource ("GI cs") corpus [1], , and Gender Imitation corpus [2] Naturally, it makes it more difficult to reach the high precision of task solution.…”
Section: Introductionmentioning
confidence: 99%
“…The adaptation of complex models, based on convolutional neural networks, gradient boosting methods, LSTM, Siamese networks is described in Section 4.2. We then apply the same methods to the 'GI cs' datasets for the task of gender prediction (Section 7) and compare them to our previous results [1,3], that demonstrated the accuracy of 88% ±3%, which is about 30% more than the baseline.…”
Section: Introductionmentioning
confidence: 99%
“…The purpose of this paper is to evaluate the accuracy of solving multigenre profiling task more correctly with cross-validation, using an extended set of text features along with neural net and machine learning models. Models, which are effective and largely independent of genre and external dictionaries, have been proposed in our previous work (Sboev A., 2017).The sets of used features included: morpho-syntactic, linguistic features (LIWC), a generalized dictionary approach of low displacement rank (LDR), along with different variants of vector representation of the text. The adaptations of these models and more complex models, based on convolutional neural networks (CNN), long short-term memories (LSTM), gradient boosting methods, are described in Section Materials and Methods, see Subsection Models.…”
Section: Introductionmentioning
confidence: 99%