2020
DOI: 10.31234/osf.io/94xcp
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PANDORA Talks: Personality and Demographics on Reddit

Abstract: Personality and demographics are important variables in social sciences, whilein NLP they can aid in interpretability and removal of societal biases.However, datasets with both personality and demographic labels are scarce. Toaddress this, we present PANDORA, the first large-scale dataset of Reddit commentslabeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. Weshowcase the usefulness of this dataset on three exp… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(21 citation statements)
references
References 64 publications
0
21
0
Order By: Relevance
“…Jiang et al (2020) simply concatenate all the utterances from a single user into a document and encode it with BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019). Gjurković et al (2020) first encode each post by BERT and then use CNN (LeCun et al, 1998) to aggregate the post representations. Most of them focus on how to obtain more effective contextual representations, with only several exceptions that try to introduce psycholinguistic features into DNNs, such as Majumder et al (2017) and Xue et al (2018).…”
Section: Personality Detectionmentioning
confidence: 99%
See 3 more Smart Citations
“…Jiang et al (2020) simply concatenate all the utterances from a single user into a document and encode it with BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019). Gjurković et al (2020) first encode each post by BERT and then use CNN (LeCun et al, 1998) to aggregate the post representations. Most of them focus on how to obtain more effective contextual representations, with only several exceptions that try to introduce psycholinguistic features into DNNs, such as Majumder et al (2017) and Xue et al (2018).…”
Section: Personality Detectionmentioning
confidence: 99%
“…Personality detection can be formulated as a multidocument multi-label classification task (Lynn et al, 2020;Gjurković et al, 2020). Formally, each user has a set P = {p 1 , p 2 , .…”
Section: Our Approachmentioning
confidence: 99%
See 2 more Smart Citations
“…To detect MBTI typologies in a more natural way and without necessity for trained human assessors, many studies have attempted at building systems for automatic detection of MBTI personality types from text in the last several years. Attempts have been made for automatic detection of MBTI personality types from: tweets written in English (Plank and Hovy, 2015), six other Western European languages (Ver-hoeven et al, 2016), and Japanese (Yamada et al, 2019); English posts collected from Personality Cafe forum 6 available in Kaggle; 7 and English Reddit comments (Gjurković and Šnajder, 2018;Gjurković et al, 2020). Despite being trained on large amounts of textual data (over one million), and modelled as four binary classification tasks, the best systems performed only slightly better than the random and majority-class baselines, regardless of the architecture used.…”
Section: Introductionmentioning
confidence: 99%