Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers) 2017
DOI: 10.18653/v1/p17-2009
|View full text |Cite
|
Sign up to set email alerts
|

Incorporating Dialectal Variability for Socially Equitable Language Identification

Abstract: Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-tosequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves st… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
50
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 61 publications
(50 citation statements)
references
References 31 publications
0
50
0
Order By: Relevance
“…The results prove that BERT expresses strong preferences for male pronouns, raising concerns with using BERT in downstream tasks like resume filtering. Table 5: Percentage of attributes associated more strongly with the male gender 6 Related Work NLP applications ranging from core tasks such as coreference resolution (Rudinger et al, 2018) and language identification (Jurgens et al, 2017), to downstream systems such as automated essay scoring (Amorim et al, 2018), exhibit inherent social biases which are attributed to the datasets used to train the embeddings (Barocas and Selbst, 2016;Zhao et al, 2017;Yao and Huang, 2017).…”
Section: Real World Implicationsmentioning
confidence: 99%
“…The results prove that BERT expresses strong preferences for male pronouns, raising concerns with using BERT in downstream tasks like resume filtering. Table 5: Percentage of attributes associated more strongly with the male gender 6 Related Work NLP applications ranging from core tasks such as coreference resolution (Rudinger et al, 2018) and language identification (Jurgens et al, 2017), to downstream systems such as automated essay scoring (Amorim et al, 2018), exhibit inherent social biases which are attributed to the datasets used to train the embeddings (Barocas and Selbst, 2016;Zhao et al, 2017;Yao and Huang, 2017).…”
Section: Real World Implicationsmentioning
confidence: 99%
“…We primarily evaluate on the task of language identification ("LangID": Cavnar and Trenkle (1994)), using the corpora of Lui and Baldwin (2012), which combine large training sets over a diverse range of text domains. Domain adaptation is an important problem for this task (Lui and Baldwin, 2014;Jurgens et al, 2017), where text resources are collected from numerous sources, and exhibit a wide variety of language use. We show that while domain adversarial training overall improves over baselines, gains are modest.…”
Section: Introductionmentioning
confidence: 99%
“…used byte-level representations of sentences as input for the networks.Recently, Hanani et al (2016) and also used LSTMs. Later, GRUs were successfully used for LI byJurgens et al (2017) andKocmi and Bojar (2017). In addition to GRUs,Bjerva (2016) also experimented with deep residual networks ("ResNets") at DSL 2016.…”
mentioning
confidence: 99%