Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-2096
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Foreign Language Usage from English-Only Social Media Posts

Abstract: Social media is known for its multi-cultural and multilingual interactions, a natural product of which is code-mixing. Multilingual speakers mix languages they tweet to address a different audience, express certain feelings, or attract attention. This paper presents a largescale analysis of 6 million tweets produced by 27 thousand multilingual users speaking 12 other languages besides English. We rely on this corpus to build predictive models to infer non-English languages that users speak exclusively from the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 29 publications
0
13
0
Order By: Relevance
“…These are content based features that are highly domain-dependent and are not likely to generalize across domains. Volkova et al (2018) explored the contribution of various (lexical, syntactic, and stylistic) signals for predicting the foreign language of non-English speakers based on their English posts on Twitter. This effectively results in a 12-way classification task, with 12 different L1s (data sizes are distributed very unevenly), and the best results are unsurprisingly obtained with word unigrams and bigrams.…”
Section: Related Workmentioning
confidence: 99%
“…These are content based features that are highly domain-dependent and are not likely to generalize across domains. Volkova et al (2018) explored the contribution of various (lexical, syntactic, and stylistic) signals for predicting the foreign language of non-English speakers based on their English posts on Twitter. This effectively results in a 12-way classification task, with 12 different L1s (data sizes are distributed very unevenly), and the best results are unsurprisingly obtained with word unigrams and bigrams.…”
Section: Related Workmentioning
confidence: 99%
“…• Feature Space. Previous NLI studies with UGC [9,18,45] rely on (i) the content-specifc features (linguistic features) such as word and character based n-grams, which are not likely to generalize to other domains [9] (see Section 2.2 for more details); and (ii) the social network features which are specifc to the Reddit dataset 3 only [30] and are hard to generalize to other datasets. These features include karma, average score, average number of submissions, average number of comments and most popular subreddits [9].…”
Section: Introductionmentioning
confidence: 99%
“…These features include karma, average score, average number of submissions, average number of comments and most popular subreddits [9]. Unlike previous NLI studies with UGC [9,18,45], we defne a topic-independent feature space without relying on the social network features and content-specifc features, which makes our solution generalizable to other domains and datasets. Specifcally, our feature space is based on (i) part-of-speech (POS) n-grams, (ii) function words, (iii) context-free grammar production rules (CFG rules), and (iv) the structure of the text sample such as average sentence length (see Section 4.1 for more details).…”
Section: Introductionmentioning
confidence: 99%
“…We model the presence of the words in one of the following cues categories: assertives verbs [109] , bias [187] , factive verbs [125] , implicative verbs [123] , hedges citehyland2018metadiscourse , report verbs . A previous work has used these bias cues to identify bias in suspicious news posts in Twitter [226] (6 Features).…”
Section: Modelsmentioning
confidence: 99%
“…We aim to detect IRA trolls by identifying their way of writing English tweets. As shown in [226], English tweets generated by non-English speakers have a different syntactic pattern. Thus, we use SOTA NLI features to detect this unique pattern [45,142,96].…”
Section: Modelsmentioning
confidence: 99%