Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1395
|View full text |Cite
|
Sign up to set email alerts
|

Native Language Identification with User Generated Content

Abstract: We address the task of native language identification in the context of social media content, where authors are highly-fluent, advanced nonnative speakers (of English). Using both linguistically-motivated features and the characteristics of the social media outlet, we obtain high accuracy on this challenging task. We provide a detailed analysis of the features that sheds light on differences between native and nonnative speakers, and among nonnative speakers with different backgrounds.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
39
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

3
6

Authors

Journals

citations
Cited by 18 publications
(39 citation statements)
references
References 27 publications
0
39
0
Order By: Relevance
“…We create a dataset of sentences from comments by users who self-identify as being from L1 English countries, as well as a set of comments by users who self-identify as being from Russia. These datasets are constructed using similar methodology to recent work in native language identification [13]. This test is used to demonstrate the tendency of each model to generate more false positives when considering English comments written by users who speak Russian as a first language, as opposed to English native speakers.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We create a dataset of sentences from comments by users who self-identify as being from L1 English countries, as well as a set of comments by users who self-identify as being from Russia. These datasets are constructed using similar methodology to recent work in native language identification [13]. This test is used to demonstrate the tendency of each model to generate more false positives when considering English comments written by users who speak Russian as a first language, as opposed to English native speakers.…”
Section: Methodsmentioning
confidence: 99%
“…Reddit has been the data source for past work on Native-Language Identification (NLI) on sophisticated second-language speakers [11] [13]. This work entailed the creation of datasets of Reddit comments from users of a variety of different languages by looking for self-identified "flair" in European subreddits.…”
Section: Corpus Iii: Augmented L2 Reddit Datasetmentioning
confidence: 99%
“…Linear classifier with content-independent features (LR) Replicating Goldin et al (2018), we trained a logistic regression classifier with three types of features: function words, POS trigrams, and sentence length, all of which are reflective of the style of writing. We deliberately avoided using content features (e.g., word frequencies).…”
Section: Baselinesmentioning
confidence: 99%
“…This work considers the problem of learning to compare users on social media. A related task which has received considerably more attention is predicting user attributes (Han et al, 2014;Sap et al, 2014;Dredze et al, 2013;Culotta et al, 2015;Volkova et al, 2015;Goldin et al, 2018). The inferred user attributes have proven useful for social science and public health research (Mislove et al, 2011;Morgan-Lopez et al, 2017).…”
Section: Related Workmentioning
confidence: 99%