Proceedings of the the 3rd Multidisciplinary International Social Networks Conference on SocialInformatics 2016, Data Science 2 2016
DOI: 10.1145/2955129.2955182
|View full text |Cite
|
Sign up to set email alerts
|

Gender Inference using Statistical Name Characteristics in Twitter

Abstract: Much attention has been given to the task of gender inference of Twitter users. Although names are strong gender indicators, the names of Twitter users are rarely used as a feature; probably due to the high number of ill-formed names, which cannot be found in any name dictionary. Instead of relying solely on a name database, we propose a novel name classifier. Our approach extracts characteristics from the user names and uses those in order to assign the names to a gender. This enables us to classify internati… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
11
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 23 publications
0
11
0
Order By: Relevance
“…We acknowledge that there is a distinction between gender and sex (West and Zimmerman, 1987), but we use gender estimates as a proxy for sex in order to be consistent with CDC measures. This technique combines three classification approaches: (a) matching users’ first names to data from the US Social Security Administration24 (which captures approximately 60% of Twitter names), (b) an SVM classifier applied to word and character n-gram features from users’ names25 and (c) a decision tree classifier applied to features constructed from the linguistic structure of users’ names, including the count of syllables, vowels, consonants, bouba (round) and kiki (sharp) vowels and consonants,26 27 and whether or not the last character is a vowel 28. For each user, we combined the predictions from all three classifiers using a weighted stacked logistic regression framework 29.…”
Section: Methodsmentioning
confidence: 99%
“…We acknowledge that there is a distinction between gender and sex (West and Zimmerman, 1987), but we use gender estimates as a proxy for sex in order to be consistent with CDC measures. This technique combines three classification approaches: (a) matching users’ first names to data from the US Social Security Administration24 (which captures approximately 60% of Twitter names), (b) an SVM classifier applied to word and character n-gram features from users’ names25 and (c) a decision tree classifier applied to features constructed from the linguistic structure of users’ names, including the count of syllables, vowels, consonants, bouba (round) and kiki (sharp) vowels and consonants,26 27 and whether or not the last character is a vowel 28. For each user, we combined the predictions from all three classifiers using a weighted stacked logistic regression framework 29.…”
Section: Methodsmentioning
confidence: 99%
“…Data from Twitter does not include demographic characteristics of users. To address this limitation, we developed a scalable and efficient ensemble approach for inferring gender by combining predictions from three previously proposed methods that focus only on the metadata available on users' profile (Burger et al, 2011;Mislove et al, 2011;Longley et al, 2015;Mueller and Stumme, 2016). The three approaches included, method (1), Twitter users' first names were matched to data from the U.S. Social Security Administration (Longley et al, 2015) (this captured approximately 60% of Twitter names); method (2), we used word and character n-grams from users' names and a Support Vector Machine (SVM) classifier with a linear kernel (Burger et al, 2011); and method (3), we applied a decision tree classifier to the linguistic structure of users' names-including the count of syllables, vowels, consonants, bouba (round) and kiki (sharp) vowels and consonants (Maurer et al, 2006;Nielsen and Rendall, 2011), and whether or not the last character was a vowel.…”
Section: Methodsmentioning
confidence: 99%
“…Previous works explored demographic characteristics using names from different sources or combined with other factors. Several works predicted gender based on users' names only [18,19]. Burger et al distinguished gender on Twitter using names and screen names for classification [20].…”
Section: Demographic Characteristics From Namesmentioning
confidence: 99%