Unsupervised Deep Language and Dialect Identification for Short Texts

Goswami, Koustava; Sarkar, Rajdeep; Chakravarthi, Bharathi Raja; Fransen, Theodorus; McCrae, John P.

doi:10.18653/v1/2020.coling-main.141

Cited by 3 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Character Encoding. The character level CNN layer generates word representation on n-gram (n ∈ {2,3,4,5,6}) characters, which helps to understand the representation of words on different sub-word levels (Goswami et al, 2020). A word's different sub-word level representations are achieved using a 1-dimensional CNN (Zhang et al, 2015).…”

Section: Word Encodermentioning

confidence: 99%

“…where k is the number of classes. We train the unsupervised model based on the maximum likelihood clustering loss proposed by Goswami et al (2020), where they try to maximize the probability distribution function for each class and at the same time try to minimize the probability of all the datasets to be assigned to one class using Equation 7.…”

Section: Weakly-supervised/unsupervised Cognate Detectormentioning

confidence: 99%

“…To alleviate the above challenges, in this paper, we propose a language-agnostic weakly-supervised cognate detection framework based on Siamese architecture with an iterative clustering approach (Xie et al, 2016) during back-propagation. Our encoder design is inspired by Goswami et al (2020), where they learn the n-gram character features of a sentence with attention. We introduce a positional encoder on n-gram features, which, in combination with the attention mechanism, learns sub-word representations of a word.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

Goswami,

Rani,

Fransen

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weaklysupervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closelyrelated languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the stateof-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training. The code and dataset building scripts can be found at https://github.com/koustavagoswami/ Weakly_supervised-Cognate_Detection

show abstract

Section: Word Encodermentioning

confidence: 99%

Section: Weakly-supervised/unsupervised Cognate Detectormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

Goswami,

Rani,

Fransen

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…On the side of CNNs, we evaluate two concrete architectures: Kim_CNN (Kim et al, 2016) and Zhang_CNN (Zhang et al, 2015), which are known to perform well on the task of text classification and are widely used. On the side of RNNs, we evaluate the architecture Lin_SA_BiLSTM (Lin et al, 2017), which has been shown to give good results on the task of dialect classifications (Goswami et al, 2020). We manipulated the tokenizers of these models using different granularity levels without changing the overall architecture.…”

Section: Models For Classificationmentioning

confidence: 99%

Optimizing the Size of Subword Vocabularies in Dialect Classification

Kanjirangat,

Samardžić,

Dolamic

et al. 2023

Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

View full text Add to dashboard Cite

Pre-trained models usually come with a predefined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granularity in four dialect datasets, two with relatively consistent writing (one Arabic and one Indo-Aryan set) and two with considerably inconsistent writing (one Arabic and one Swiss German set). To gain more control over subword tokenization and ensure direct comparability in the experimental settings, we train a CNN classifier from scratch comparing two subword tokenization methods (Unigram model and BPE). For reference, we compare the results obtained in our analysis to the state of the art achieved by fine-tuning pre-trained models. We show that models trained from scratch with an optimal tokenization level perform better than fine-tuned classifiers in the case of highly inconsistent writing. In the case of relatively consistent writing, fine-tuned models remain better regardless of the tokenization level. 1

show abstract

“…akshara n-grams) work better than features based on bytes. Goswami et al (2020) experiment with supervised and unsupervised methods in dialect identification using, among others, a Dravidian data set containing Tamil, Telugu, Malayalam, and Kannada.…”

Section: Language Identification Of South Dravidian Languagesmentioning

confidence: 99%

Comparing Approaches to Dravidian Language Identification

Jauhiainen,

Ranasinghe,

Zampieri

2021

Preprint

View full text Add to dashboard Cite

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformerbased model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.

show abstract

Unsupervised Deep Language and Dialect Identification for Short Texts

Cited by 3 publications

References 24 publications

Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

Optimizing the Size of Subword Vocabularies in Dialect Classification

Comparing Approaches to Dravidian Language Identification

Contact Info

Product

Resources

About