Anglicized Words and Misspelled Cognates in Native Language Identification

Markov, Ilia; Năstase, Vivi; Strapparava, Carlo

doi:10.18653/v1/w19-4429

Cited by 2 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NLI is usually approached from a machine learning perspective as a multi-class classification problem of assigning class labels representing L1s to texts written in L2, where the main focus (of the traditional machine learning) is to design features that capture the systematic fingerprints of the first language in the second language writing (native language interference (Odlin, 1989)). Numerous feature types that capture various aspects of the interference phenomenon have been explored for NLI: spelling errors (Koppel et al, 2005;; lexical features, e.g., word and lemma n-grams (Jarvis et al, 2013), cognates (Markov et al, 2019), etymologically-related words ; syntactic features, e.g., context-free grammar features (Wong and Dras, 2011), Stanford parser dependency features (Tetreault et al, 2012); stylometric features, e.g., punctuation (Markov et al, 2018a), character n-gram features (Kulmizev et al, 2017); emotion-based features (Markov et al, 2018b), etc. The combination of such features provides the best results for NLI, as shown by the two shared tasks on the NLI task organized in 2013 and 2017 (Malmasi et al, 2017), where the two top-ranked systems (Cimino and Dell'Orletta, 2017;Markov et al, 2017) used Support Vector Machines (SVM) with a variety of engineered features.…”

Section: Introductionmentioning

confidence: 99%

A Deep Generative Approach to Native Language Identification

Lotfi¹,

Markov²,

Daelemans³

2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Native language identification (NLI) -identifying the native language (L1) of a person based on his/her writing in the second language (L2) -is useful for a variety of purposes, including marketing, security, and educational applications. From a traditional machine learning perspective, NLI is usually framed as a multi-class classification task, where numerous designed features are combined in order to achieve state-of-the-art results. We introduce a deep generative language modelling (LM) approach to NLI, which consists in fine-tuning a GPT-2 model separately on texts written by the authors with the same L1, and assigning a label to an unseen text based on the minimum LM loss with respect to one of these fine-tuned GPT-2 models. Our method outperforms traditional machine learning approaches and currently achieves the best results on the benchmark NLI datasets.

show abstract

Section: Introductionmentioning

confidence: 99%

A Deep Generative Approach to Native Language Identification

Lotfi¹,

Markov²,

Daelemans³

2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Within traditional machine learning, NLI is usually approached as a multi-class classification problem of assigning class labels representing L1s to texts written in L2, where the main focus is to design features that capture the systematic fingerprints of the first language in the second language writing (native language interference (Odlin, 1989)). These features include: spelling errors (Koppel et al, 2005;Chen et al, 2017); lexical features, e.g., word and lemma n-grams (Jarvis et al, 2013), cognates (Markov et al, 2019), etymologically-related words ; syntactic features, e.g., context-free grammar features (Wong and Dras, 2011), Stanford parser dependency features (Tetreault et al, 2012); stylometric features, e.g., punctuation (Markov et al, 2018a), character n-gram features (Kulmizev et al, 2017); emotion-based features (Markov et al, 2018b), etc. The combination of such features provides the best results for NLI, as shown by the two shared tasks organized in 2013 and 2017 (Malmasi et al, 2017), where the two top-ranked systems (Cimino and Dell'Orletta, 2017;Markov et al, 2017) used Support Vector Machines (SVM) with a variety of engineered features.…”

Section: Background and Motivationmentioning

confidence: 99%

Stay grounded : incorporating knowledge in open-domain dialog

Lotfi

View full text Add to dashboard Cite

Anglicized Words and Misspelled Cognates in Native Language Identification

Cited by 2 publications

References 17 publications

A Deep Generative Approach to Native Language Identification

A Deep Generative Approach to Native Language Identification

Stay grounded : incorporating knowledge in open-domain dialog

Contact Info

Product

Resources

About